Fix ShouldRestart for on-failure handle #20853

WeiZhang555 · 2016-03-02T09:19:02Z

Currently if you restart docker daemon, all the containers with restart
policy on-failure regardless of its RestartCount will be started, this
will make daemon cost more extra time for restart.

This commit will stop these containers to do unnecessary start on
daemon's restart.

How to reproduce:

start a container with restart policy on-failure:3
$docker run -tid --restart on-failure:3 busybox sh -c "sleep 1; exit 127"
wait it to fail and restart 3 times, make sure it stops and docker ps can't see it.
Restart docker daemon.
(DO THIS QUICKLY) docker ps shows that the stopped container is started, and fails, and restart 3 times again.

This is unnecessary because it already failed 3 times last time daemon is running, and starting this container will apparently lower the restart speed of daemon.

Edit: make some small refactor
Merge Container.ShouldRestart() and containerMonitor.shouldRestart(), so
they can share same logic. Besides that, some duplicated fields of
containerMonitor are removed.

Signed-off-by: Zhang Wei zhangwei555@huawei.com

thaJeztah · 2016-03-02T16:03:00Z

Should we have a test for this?

change sgtm, although the code is becoming a bit complex to read, possibly splitting it up at some point would be good for readability

WeiZhang555 · 2016-03-02T16:10:49Z

This test case will need to restart docker daemon, I'm not familiar with that part, could someone give me some instructions?

And I doubt that it will become a flaky test, because it need to capture the container's restarting/running state.

thaJeztah · 2016-03-02T16:16:06Z

ping @vdemeester wdyt, is it worth a test, or are we okay without?

WeiZhang555 · 2016-03-06T14:03:29Z

@thaJeztah I figure out how to make the test case now, I think it'll work 😄

/cc @vdemeester @cpuguy83 @calavera

cpuguy83 · 2016-03-07T16:12:15Z

container/container.go

@@ -505,7 +505,8 @@ func copyEscapable(dst io.Writer, src io.ReadCloser, keys []byte) (written int64
 func (container *Container) ShouldRestart() bool {
 	return container.HostConfig.RestartPolicy.Name == "always" ||
 		(container.HostConfig.RestartPolicy.Name == "unless-stopped" && !container.HasBeenManuallyStopped) ||
-		(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0)
+		(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0 &&


Can we have container/monitor.go use this fn as well? It looks like it's defining it's own rules, which are exactly the same.

WeiZhang555 · 2016-03-08T16:17:12Z

@cpuguy83 I made an implementation for having container/monitor.go use Container.ShouldRestart(), it's a little complicated now and I'm worrying it may bring race conditions.
But in general it should make sense, let's see whether Janky is happy with this change.

WeiZhang555 · 2016-03-09T16:20:52Z

Double checked this change, I believe that it should be right, Janky's test result also supported me.

/cc @vdemeester @calavera

thaJeztah · 2016-03-10T11:04:55Z

WindowsTP4 timed out after 2 hours 😢 restarting

WeiZhang555 · 2016-03-10T11:11:33Z

@thaJeztah Looks like 2 hours is definitely insufficient for Windows now 🔮

thaJeztah · 2016-03-10T11:38:22Z

@WeiZhang555 yeah, I brought it up yesterday, but we think #21017 will be a better solution than raising the timeout limit. Perhaps I should discuss it again, even if just temporarily raising the limit 😢

thaJeztah · 2016-03-15T11:22:15Z

ping @WeiZhang555 sorry, needs a rebase now 😢

WeiZhang555 · 2016-03-15T14:47:43Z

@thaJeztah Rebased, thank you for your reminder!

/cc @cpuguy83

cpuguy83 · 2016-03-15T18:37:10Z

LGTM, thanks!

vdemeester · 2016-03-16T09:24:39Z

container/container.go

+	case rp.IsAlways():
+		return true
+	case rp.IsUnlessStopped():
+		if !container.HasBeenManuallyStopped {


It's a nit but I would have inverted the condition, so it's if container.HasBeenManuallyStopped { instead (feels easier to read).

thaJeztah · 2016-03-16T11:13:55Z

ping @vdemeester PR was updated, ptal

WeiZhang555 · 2016-03-16T11:26:42Z

@thaJeztah Thank you 😜

vdemeester · 2016-03-16T11:49:23Z

LGTM 🐼

WeiZhang555 · 2016-03-21T14:04:06Z

@thaJeztah Never mind, this change is still necessary as I can see, just need some time to do more investigate, and also another code review 😢

WeiZhang555 · 2016-03-22T15:57:12Z

@cpuguy83 because of the change brought by containerd, it's hard to merge ShouldRestart() now. So I removed the refactor and only keeps the bug fix.

I can try to do the refactor after I get more familiar with new code if you don't mind.

Also ping @vdemeester @thaJeztah for another code review. 😄

vdemeester · 2016-03-22T16:28:35Z

Re-LGTM if it's 📗

cpuguy83 · 2016-03-22T19:49:04Z

daemon/daemon.go

@@ -310,6 +310,8 @@ func (daemon *Daemon) restore() error {
 				}
 			}

+			// reset container restart count
+			c.RestartManager(true)


Do we really want to reset all container's restart policies here?

RestartManager decides whether the container should be restarted based on its failureCount, If we don't reset container's restart count, a container with on-failure:3 may have a RestartCount 5. e.g.

container failed 2 times, RestartCount=2

daemon shutdown.

daemon reboot, RestartCount not reset, container failed and restart 3 times, RestartCount turns out to be 2+3=5.

This is incorrect.

So this is a problem with some internal counter that's not persisted?

This is because we used two variables to calculate ShouldRestart: failureCount and RestartPolicy.

OK, I'd better merge these two params into one again, have had done it before, I'll try to migrate that part into this.

thaJeztah · 2016-03-31T18:35:43Z

ping @cpuguy83 ptal, what's the status here?

@WeiZhang555 looks like it needs a rebase 😢

cpuguy83 · 2016-03-31T18:36:54Z

I think we're just waiting on the changes mentioned above.

WeiZhang555 · 2016-03-31T23:39:24Z

Yes, sorry for my delay. I'll be on travel these days and can't get reach of my computer.I'll post more updates once I come back in 3 or 4 days.

thaJeztah · 2016-04-01T00:12:40Z

@WeiZhang555 no worries; we were just going through some old PR's, thanks in advance!

WeiZhang555 · 2016-04-04T11:24:45Z

I'm back again, the code is updated based on previous comment, please help take a look. @cpuguy83

cpuguy83 · 2016-04-04T16:29:04Z

restartmanager/restartmanager.go

 		restart = true
 	case rm.policy.IsOnFailure():
 		// the default value of 0 for MaximumRetryCount means that we will not enforce a maximum count
-		if max := rm.policy.MaximumRetryCount; max == 0 || rm.failureCount <= max {
+		if max := rm.policy.MaximumRetryCount; max == 0 || rm.restartCount < max {


How come you changed this to just <?

I replaced the "failureCount" in restartManager with restartCount for reusing the restartManager.ShouldRestart() in container.ShouldRestart(), when using failureCount, we're doing this:

failureCount++

Judge if failureCount <= max
After replaced with restartCount, we're doing:

Judge if restartCount < max

if need restart, restartCount++

And they have different meaning, restartCountmeans how many times container has been restarted before, it must be less than MaximumRetryCount, if the limit is reached, we shall not restart it again.

cpuguy83 · 2016-04-04T16:40:25Z

1 minor question, otherwise LGTM

WeiZhang555 · 2016-04-07T08:31:46Z

ping @cpuguy83

cpuguy83 · 2016-04-08T02:48:39Z

LGTM but sadly needs a rebase again :(

WeiZhang555 · 2016-04-08T03:35:12Z

Rebased. Waiting Janky turn to green 😄

Currently if you restart docker daemon, all the containers with restart policy `on-failure` regardless of its `RestartCount` will be started, this will make daemon cost more extra time for restart. This commit will stop these containers to do unnecessary start on daemon's restart. Signed-off-by: Zhang Wei <zhangwei555@huawei.com>

vdemeester · 2016-04-11T12:01:48Z

LGTM 👍

thaJeztah · 2016-04-11T12:05:48Z

Thanks @WeiZhang555!

WeiZhang555 · 2016-04-11T12:09:36Z

😆

thaJeztah · 2016-04-11T12:15:18Z

Added "changelog" label; May be a nice optimization to mention in the change logs

GordonTheTurtle added the status/0-triage label Mar 2, 2016

thaJeztah added status/2-code-review and removed status/0-triage labels Mar 2, 2016

WeiZhang555 force-pushed the fix-ShouldRestart branch from c1fddff to b5607b0 Compare March 6, 2016 14:00

cpuguy83 reviewed Mar 7, 2016
View reviewed changes

WeiZhang555 force-pushed the fix-ShouldRestart branch 3 times, most recently from 6ae40a4 to 1e52a09 Compare March 8, 2016 16:07

WeiZhang555 force-pushed the fix-ShouldRestart branch from 1e52a09 to 9be9bdb Compare March 9, 2016 16:16

WeiZhang555 force-pushed the fix-ShouldRestart branch from 9be9bdb to d86f464 Compare March 11, 2016 03:22

WeiZhang555 force-pushed the fix-ShouldRestart branch from d86f464 to a84f4b2 Compare March 15, 2016 14:46

vdemeester reviewed Mar 16, 2016
View reviewed changes

WeiZhang555 force-pushed the fix-ShouldRestart branch from a84f4b2 to a7bef0f Compare March 16, 2016 10:53

WeiZhang555 force-pushed the fix-ShouldRestart branch from a7bef0f to c8f21e6 Compare March 22, 2016 15:50

cpuguy83 reviewed Mar 22, 2016
View reviewed changes

WeiZhang555 force-pushed the fix-ShouldRestart branch from c8f21e6 to 622f976 Compare April 4, 2016 10:06

cpuguy83 reviewed Apr 4, 2016
View reviewed changes

WeiZhang555 force-pushed the fix-ShouldRestart branch 2 times, most recently from c30bdd2 to e5aeaf6 Compare April 8, 2016 03:31

WeiZhang555 force-pushed the fix-ShouldRestart branch from e5aeaf6 to 51e42e6 Compare April 10, 2016 07:45

thaJeztah added this to the 1.12.0 milestone Apr 11, 2016

vdemeester merged commit a692910 into moby:master Apr 11, 2016

WeiZhang555 deleted the fix-ShouldRestart branch April 11, 2016 12:10

thaJeztah added the impact/changelog label Apr 11, 2016

thaJeztah mentioned this pull request Jun 17, 2016

v1.12.0: some fixes to CHANGELOG #23594

Merged

leberknecht mentioned this pull request Oct 19, 2017

on-failure restart-policy leads to start-on-boot #35166

Closed

Fix ShouldRestart for on-failure handle #20853

Fix ShouldRestart for on-failure handle #20853

Conversation

WeiZhang555 commented Mar 2, 2016

thaJeztah commented Mar 2, 2016

WeiZhang555 commented Mar 2, 2016

thaJeztah commented Mar 2, 2016

WeiZhang555 commented Mar 6, 2016

Choose a reason for hiding this comment

WeiZhang555 commented Mar 8, 2016

WeiZhang555 commented Mar 9, 2016

thaJeztah commented Mar 10, 2016

WeiZhang555 commented Mar 10, 2016

thaJeztah commented Mar 10, 2016

thaJeztah commented Mar 15, 2016

WeiZhang555 commented Mar 15, 2016

cpuguy83 commented Mar 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah commented Mar 16, 2016

WeiZhang555 commented Mar 16, 2016

vdemeester commented Mar 16, 2016

WeiZhang555 commented Mar 21, 2016

WeiZhang555 commented Mar 22, 2016

vdemeester commented Mar 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah commented Mar 31, 2016

cpuguy83 commented Mar 31, 2016

WeiZhang555 commented Mar 31, 2016

thaJeztah commented Apr 1, 2016

WeiZhang555 commented Apr 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuguy83 commented Apr 4, 2016

WeiZhang555 commented Apr 7, 2016

cpuguy83 commented Apr 8, 2016

WeiZhang555 commented Apr 8, 2016

vdemeester commented Apr 11, 2016

thaJeztah commented Apr 11, 2016

WeiZhang555 commented Apr 11, 2016

thaJeztah commented Apr 11, 2016