-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ShouldRestart for on-failure handle #20853
Conversation
Should we have a test for this? change sgtm, although the code is becoming a bit complex to read, possibly splitting it up at some point would be good for readability |
This test case will need to restart docker daemon, I'm not familiar with that part, could someone give me some instructions? And I doubt that it will become a flaky test, because it need to capture the container's restarting/running state. |
ping @vdemeester wdyt, is it worth a test, or are we okay without? |
c1fddff
to
b5607b0
Compare
@thaJeztah I figure out how to make the test case now, I think it'll work 😄 |
@@ -505,7 +505,8 @@ func copyEscapable(dst io.Writer, src io.ReadCloser, keys []byte) (written int64 | |||
func (container *Container) ShouldRestart() bool { | |||
return container.HostConfig.RestartPolicy.Name == "always" || | |||
(container.HostConfig.RestartPolicy.Name == "unless-stopped" && !container.HasBeenManuallyStopped) || | |||
(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0) | |||
(container.HostConfig.RestartPolicy.Name == "on-failure" && container.ExitCode != 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have container/monitor.go
use this fn as well? It looks like it's defining it's own rules, which are exactly the same.
6ae40a4
to
1e52a09
Compare
@cpuguy83 I made an implementation for having |
1e52a09
to
9be9bdb
Compare
Double checked this change, I believe that it should be right, Janky's test result also supported me. /cc @vdemeester @calavera |
WindowsTP4 timed out after 2 hours 😢 restarting |
@thaJeztah Looks like 2 hours is definitely insufficient for Windows now 🔮 |
@WeiZhang555 yeah, I brought it up yesterday, but we think #21017 will be a better solution than raising the timeout limit. Perhaps I should discuss it again, even if just temporarily raising the limit 😢 |
9be9bdb
to
d86f464
Compare
ping @WeiZhang555 sorry, needs a rebase now 😢 |
d86f464
to
a84f4b2
Compare
@thaJeztah Rebased, thank you for your reminder! /cc @cpuguy83 |
LGTM, thanks! |
case rp.IsAlways(): | ||
return true | ||
case rp.IsUnlessStopped(): | ||
if !container.HasBeenManuallyStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nit but I would have inverted the condition, so it's if container.HasBeenManuallyStopped {
instead (feels easier to read).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
a84f4b2
to
a7bef0f
Compare
ping @vdemeester PR was updated, ptal |
@thaJeztah Thank you 😜 |
LGTM 🐼 |
@thaJeztah Never mind, this change is still necessary as I can see, just need some time to do more investigate, and also another code review 😢 |
a7bef0f
to
c8f21e6
Compare
@cpuguy83 because of the change brought by I can try to do the refactor after I get more familiar with new code if you don't mind. Also ping @vdemeester @thaJeztah for another code review. 😄 |
Re-LGTM if it's 📗 |
@@ -310,6 +310,8 @@ func (daemon *Daemon) restore() error { | |||
} | |||
} | |||
|
|||
// reset container restart count | |||
c.RestartManager(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to reset all container's restart policies here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RestartManager decides whether the container should be restarted based on its failureCount
, If we don't reset container's restart count, a container with on-failure:3
may have a RestartCount
5. e.g.
- container failed 2 times,
RestartCount
=2 - daemon shutdown.
- daemon reboot, RestartCount not reset, container failed and restart 3 times, RestartCount turns out to be 2+3=5.
This is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a problem with some internal counter that's not persisted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we used two variables to calculate ShouldRestart: failureCount and RestartPolicy.
OK, I'd better merge these two params into one again, have had done it before, I'll try to migrate that part into this.
ping @cpuguy83 ptal, what's the status here? @WeiZhang555 looks like it needs a rebase 😢 |
I think we're just waiting on the changes mentioned above. |
Yes, sorry for my delay. I'll be on travel these days and can't get reach of my computer.I'll post more updates once I come back in 3 or 4 days. |
@WeiZhang555 no worries; we were just going through some old PR's, thanks in advance! |
c8f21e6
to
622f976
Compare
I'm back again, the code is updated based on previous comment, please help take a look. @cpuguy83 |
restart = true | ||
case rm.policy.IsOnFailure(): | ||
// the default value of 0 for MaximumRetryCount means that we will not enforce a maximum count | ||
if max := rm.policy.MaximumRetryCount; max == 0 || rm.failureCount <= max { | ||
if max := rm.policy.MaximumRetryCount; max == 0 || rm.restartCount < max { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come you changed this to just <
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the "failureCount" in restartManager
with restartCount
for reusing the restartManager.ShouldRestart()
in container.ShouldRestart()
, when using failureCount, we're doing this:
- failureCount++
- Judge if failureCount <= max
After replaced withrestartCount
, we're doing: - Judge if restartCount < max
- if need restart, restartCount++
And they have different meaning, restartCount
means how many times container has been restarted before, it must be less than MaximumRetryCount, if the limit is reached, we shall not restart it again.
1 minor question, otherwise LGTM |
ping @cpuguy83 |
LGTM but sadly needs a rebase again :( |
c30bdd2
to
e5aeaf6
Compare
Rebased. Waiting Janky turn to green 😄 |
Currently if you restart docker daemon, all the containers with restart policy `on-failure` regardless of its `RestartCount` will be started, this will make daemon cost more extra time for restart. This commit will stop these containers to do unnecessary start on daemon's restart. Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
e5aeaf6
to
51e42e6
Compare
LGTM 👍 |
Thanks @WeiZhang555! |
😆 |
Added "changelog" label; May be a nice optimization to mention in the change logs |
Currently if you restart docker daemon, all the containers with restart
policy
on-failure
regardless of itsRestartCount
will be started, thiswill make daemon cost more extra time for restart.
This commit will stop these containers to do unnecessary start on
daemon's restart.
How to reproduce:
on-failure:3
$docker run -tid --restart on-failure:3 busybox sh -c "sleep 1; exit 127"
docker ps
can't see it.docker ps
shows that the stopped container is started, and fails, and restart 3 times again.This is unnecessary because it already failed 3 times last time daemon is running, and starting this container will apparently lower the restart speed of daemon.
Edit: make some small refactor
Merge
Container.ShouldRestart()
andcontainerMonitor.shouldRestart()
, sothey can share same logic. Besides that, some duplicated fields of
containerMonitor are removed.
Signed-off-by: Zhang Wei zhangwei555@huawei.com