-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libpod: wait another interval for healthcheck #22764
libpod: wait another interval for healthcheck #22764
Conversation
1c4f37b
to
86c0acd
Compare
Ephemeral COPR build failed. @containers/packit-build please check. |
86c0acd
to
71c98e8
Compare
ready for review |
libpod/container_api.go
Outdated
if !c.valid { | ||
return -1, define.ErrCtrRemoved | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mheon PTAL
Not sure how I feel about this, the container struct is invalid at this point taking the lock, taking the lock later etc... is no longer correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. The locking model is explicitly built to bail the moment a container struct is invalid. Another container may already hold this lock number, which can lead to us deadlocking ourselves if we do anything after this point that takes a container lock. I don't see anything right now that does that, but that doesn't mean it won't be added in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we can't really guarantee any of the container's functions that touch the database actually working after this point? Seems extremely risky to continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that the code in this case would just end up in c.runtime.state.GetContainerExitCode()
and retrieve the status from there; but sure let's make this safer. I'll just call c.runtime.state.GetContainerExitCode()
if !c.valid
, would that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me at least
libpod/container_internal.go
Outdated
// release the lock while waiting for healtcheck processes to terminate. | ||
if !c.batched { | ||
c.lock.Unlock() | ||
} | ||
if err := c.removeTransientFiles(ctx, c.config.StartupHealthCheckConfig != nil && !c.state.StartupHCPassed); err != nil { | ||
logrus.Errorf("Removing timer for container %s healthcheck: %v", c.ID(), err) | ||
} | ||
if !c.batched { | ||
c.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of the unlock?
This seems super dangerous as we now can end up in this same code path twice which it can not handle properly if that happens.
Also the unlock in the middle of the code is bad coding IMO, it completely breaks any sane locking model that a person can keep track of. cleanup() can be called from so many different places even during init if we start to unlock during that who knowns how many impossible to debug race conditions we get.
And going by bugs like #22651 it is practically impossible to catch these bug during review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killing other podman processes in the middle of whatever they are doing is also a bad practice and causes all sort of race issues, The current code just exits on SIGTERM before this patch, that is why I added the shutdown inhibit. I don't see how the healthcheck code is handling that, files are not even written in an atomic fashion. The race we hit in the CI is the healthcheck not able to record the ready
status as it could be killed before doing so, and the test I've added causes the process to terminate immediately.
If we want a clean shutdown for the healthcheck process (and I think we should aim for that), we cannot hold the lock since the healthcheck process takes the lock during exec()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the podman healtcheck run
is already in a locked state which means we cannot be here in cleanup at the same time. Or podman healtcheck run
is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.
I think the fundamental issue is we cannot expect a healtcheck to work on short living container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the
podman healtcheck run
is already in a locked state which means we cannot be here in cleanup at the same time. Orpodman healtcheck run
is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.
In theory yes, but in practice the healthcheck code is running unlocked as it calls a bunch of different operations that do the locking internally and we try to change it, it is enough we forget one occurrence and we end up in a deadlock.
I think the fundamental issue is we cannot expect a healtcheck to work on short living container.
I didn't foresee opening this can of worms when I added the new test to the CI. But now that I am aware of the root issue, I expect to transition to healthy
at some point without having to add an artificial delay to the container:
podman --health-cmd="touch /terminate" --sdnotify=healthy $IMAGE sh -c 'while test \! -e /terminate; do sleep 0.1; done'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right in your test case we definitely can execute the healthcheck once.
So basically the issue is the we write the sdnotify socket from the podman run command and not the podman healtchcheck run command, this also means that podman run -d blocks until is is healthy.
Ideally we could have the podman healtchcheck run write directly the notify socket but I guess systemd will not allow this as they are not part of the unit so this will not work.
I also just realized that you are changing the wait behavior and this code is part of the public podman wait API
podman wait --condition healthy
so we need to be careful to not cause regressions.
Maybe it is time to re-design this logic. I am thinking about having the podman run
command expose a unix socket that the podman healthcheck run command can that write to as signal that it is healthy now. Then podman run can send the notify message and exit. So we would not have to use the polling WaitForConditionWithInterval() at all. We can make the podman healtcheck run just ignore failures in case the socket can not be connected too for future healthcheck runs, or store that fact that it did connect in the container state db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds good, but do you agree that such re-design is out of scope for the following PR? Let's deal with the CI flake (which is a real issue) first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well IMO the unlock in the critical section is not safe so I do not want to merge this like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to merge this like this.
so far the only alternative seems to be inventing a new IPC mechanism just to handle sdnotify=healthy
.
Assuming the new IPC mechanism is the way to go (it might be a bit overkill for a feature that was broken for almost a year and nobody noticed), how do we deal with it in the short term?
Should we just remove the test?
The unlock pattern is already used with --sdnotify=healthy
since its initial implementation:
0cfd127#diff-694b11f8e3815ed5986dc6e81cbc40ecff32b50938a38b44c8d0cb405adb3330R1254-R1260
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right start did it at the end which is less risky but still caused the initial bug here which means unlocking in the middle is by no means simple and has to be avoided at all costs IMO.
Unlocking in cleanup is much worse because we now can get two commands directly after each other into cleanup at the code is not set up to handle this I think. At least I have little confidence in the correctness of the code there already. Adding more failures points will hurt us given in how many places cleanup is called.
If the goal is to fix the flake then let's disable the test until we can work on a better fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well if this is what you want: dropped the last patch and added a new one to disable the tests
wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change. Closes: containers#22760 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Now WaitForExit returns the exit code as stored in the db instead of returning an error when the container was removed. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
71c98e8
to
1a7d1e3
Compare
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
1a7d1e3
to
7f567a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@mheon PTAL
We should create a new issue once this is merged for the proper race free fix.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change.
Closes: #22760