Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ShouldRestart for on-failure handle #20853

Merged
merged 1 commit into from
Apr 11, 2016

Conversation

WeiZhang555
Copy link
Contributor

Currently if you restart docker daemon, all the containers with restart
policy on-failure regardless of its RestartCount will be started, this
will make daemon cost more extra time for restart.

This commit will stop these containers to do unnecessary start on
daemon's restart.

How to reproduce:

  1. start a container with restart policy on-failure:3
    $docker run -tid --restart on-failure:3 busybox sh -c "sleep 1; exit 127"
  2. wait it to fail and restart 3 times, make sure it stops and docker ps can't see it.
  3. Restart docker daemon.
  4. (DO THIS QUICKLY) docker ps shows that the stopped container is started, and fails, and restart 3 times again.

This is unnecessary because it already failed 3 times last time daemon is running, and starting this container will apparently lower the restart speed of daemon.


Edit: make some small refactor
Merge Container.ShouldRestart() and containerMonitor.shouldRestart(), so
they can share same logic. Besides that, some duplicated fields of
containerMonitor are removed.

Signed-off-by: Zhang Wei zhangwei555@huawei.com

@thaJeztah
Copy link
Member

Should we have a test for this?

change sgtm, although the code is becoming a bit complex to read, possibly splitting it up at some point would be good for readability

@WeiZhang555
Copy link
Contributor Author

This test case will need to restart docker daemon, I'm not familiar with that part, could someone give me some instructions?

And I doubt that it will become a flaky test, because it need to capture the container's restarting/running state.

@thaJeztah
Copy link
Member

ping @vdemeester wdyt, is it worth a test, or are we okay without?

@WeiZhang555
Copy link
Contributor Author

@thaJeztah I figure out how to make the test case now, I think it'll work 😄

/cc @vdemeester @cpuguy83 @calavera

@@ -505,7 +505,8 @@ func copyEscapable(dst io.Writer, src io.ReadCloser, keys []byte) (written int64
func (container *Container) ShouldRestart() bool {
return container.HostConfig.RestartPolicy.Name == "always" ||
(container.HostConfig.RestartPolicy.Name == "unless-stopped" && !container.HasBeenManuallyStopped) ||
(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0)
(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have container/monitor.go use this fn as well? It looks like it's defining it's own rules, which are exactly the same.

@WeiZhang555 WeiZhang555 force-pushed the fix-ShouldRestart branch 3 times, most recently from 6ae40a4 to 1e52a09 Compare March 8, 2016 16:07
@WeiZhang555
Copy link
Contributor Author

@cpuguy83 I made an implementation for having container/monitor.go use Container.ShouldRestart(), it's a little complicated now and I'm worrying it may bring race conditions.
But in general it should make sense, let's see whether Janky is happy with this change.

@WeiZhang555
Copy link
Contributor Author

Double checked this change, I believe that it should be right, Janky's test result also supported me.

/cc @vdemeester @calavera

@thaJeztah
Copy link
Member

WindowsTP4 timed out after 2 hours 😢 restarting

@WeiZhang555
Copy link
Contributor Author

@thaJeztah Looks like 2 hours is definitely insufficient for Windows now 🔮

@thaJeztah
Copy link
Member

@WeiZhang555 yeah, I brought it up yesterday, but we think #21017 will be a better solution than raising the timeout limit. Perhaps I should discuss it again, even if just temporarily raising the limit 😢

@thaJeztah
Copy link
Member

ping @WeiZhang555 sorry, needs a rebase now 😢

@WeiZhang555
Copy link
Contributor Author

@thaJeztah Rebased, thank you for your reminder!

/cc @cpuguy83

@cpuguy83
Copy link
Member

LGTM, thanks!

case rp.IsAlways():
return true
case rp.IsUnlessStopped():
if !container.HasBeenManuallyStopped {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nit but I would have inverted the condition, so it's if container.HasBeenManuallyStopped { instead (feels easier to read).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@thaJeztah
Copy link
Member

ping @vdemeester PR was updated, ptal

@WeiZhang555
Copy link
Contributor Author

@thaJeztah Thank you 😜

@vdemeester
Copy link
Member

LGTM 🐼

@WeiZhang555
Copy link
Contributor Author

@thaJeztah Never mind, this change is still necessary as I can see, just need some time to do more investigate, and also another code review 😢

@WeiZhang555
Copy link
Contributor Author

@cpuguy83 because of the change brought by containerd, it's hard to merge ShouldRestart() now. So I removed the refactor and only keeps the bug fix.

I can try to do the refactor after I get more familiar with new code if you don't mind.

Also ping @vdemeester @thaJeztah for another code review. 😄

@vdemeester
Copy link
Member

Re-LGTM if it's 📗

@@ -310,6 +310,8 @@ func (daemon *Daemon) restore() error {
}
}

// reset container restart count
c.RestartManager(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to reset all container's restart policies here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestartManager decides whether the container should be restarted based on its failureCount, If we don't reset container's restart count, a container with on-failure:3 may have a RestartCount 5. e.g.

  1. container failed 2 times, RestartCount=2
  2. daemon shutdown.
  3. daemon reboot, RestartCount not reset, container failed and restart 3 times, RestartCount turns out to be 2+3=5.

This is incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a problem with some internal counter that's not persisted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we used two variables to calculate ShouldRestart: failureCount and RestartPolicy.

OK, I'd better merge these two params into one again, have had done it before, I'll try to migrate that part into this.

@thaJeztah
Copy link
Member

ping @cpuguy83 ptal, what's the status here?

@WeiZhang555 looks like it needs a rebase 😢

@cpuguy83
Copy link
Member

I think we're just waiting on the changes mentioned above.

@WeiZhang555
Copy link
Contributor Author

Yes, sorry for my delay. I'll be on travel these days and can't get reach of my computer.I'll post more updates once I come back in 3 or 4 days.

@thaJeztah
Copy link
Member

@WeiZhang555 no worries; we were just going through some old PR's, thanks in advance!

@WeiZhang555
Copy link
Contributor Author

I'm back again, the code is updated based on previous comment, please help take a look. @cpuguy83

restart = true
case rm.policy.IsOnFailure():
// the default value of 0 for MaximumRetryCount means that we will not enforce a maximum count
if max := rm.policy.MaximumRetryCount; max == 0 || rm.failureCount <= max {
if max := rm.policy.MaximumRetryCount; max == 0 || rm.restartCount < max {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you changed this to just <?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the "failureCount" in restartManager with restartCount for reusing the restartManager.ShouldRestart() in container.ShouldRestart(), when using failureCount, we're doing this:

  1. failureCount++
  2. Judge if failureCount <= max
    After replaced with restartCount, we're doing:
  3. Judge if restartCount < max
  4. if need restart, restartCount++

And they have different meaning, restartCountmeans how many times container has been restarted before, it must be less than MaximumRetryCount, if the limit is reached, we shall not restart it again.

@cpuguy83
Copy link
Member

cpuguy83 commented Apr 4, 2016

1 minor question, otherwise LGTM

@WeiZhang555
Copy link
Contributor Author

ping @cpuguy83

@cpuguy83
Copy link
Member

cpuguy83 commented Apr 8, 2016

LGTM but sadly needs a rebase again :(

@WeiZhang555 WeiZhang555 force-pushed the fix-ShouldRestart branch 2 times, most recently from c30bdd2 to e5aeaf6 Compare April 8, 2016 03:31
@WeiZhang555
Copy link
Contributor Author

Rebased. Waiting Janky turn to green 😄

Currently if you restart docker daemon, all the containers with restart
policy `on-failure` regardless of its `RestartCount` will be started,
this will make daemon cost more extra time for restart.

This commit will stop these containers to do unnecessary start on
daemon's restart.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
@thaJeztah thaJeztah added this to the 1.12.0 milestone Apr 11, 2016
@vdemeester
Copy link
Member

LGTM 👍

@vdemeester vdemeester merged commit a692910 into moby:master Apr 11, 2016
@thaJeztah
Copy link
Member

Thanks @WeiZhang555!

@WeiZhang555
Copy link
Contributor Author

😆

@WeiZhang555 WeiZhang555 deleted the fix-ShouldRestart branch April 11, 2016 12:10
@thaJeztah
Copy link
Member

Added "changelog" label; May be a nice optimization to mention in the change logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants