Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: wait another interval for healthcheck #22764

Conversation

giuseppe
Copy link
Member

wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change.

Closes: #22760

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 20, 2024
@giuseppe giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch 6 times, most recently from 1c4f37b to 86c0acd Compare May 22, 2024 09:16
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@giuseppe giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 86c0acd to 71c98e8 Compare May 22, 2024 09:30
@giuseppe giuseppe marked this pull request as ready for review May 22, 2024 12:58
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2024
@giuseppe
Copy link
Member Author

ready for review

Comment on lines 584 to 586
if !c.valid {
return -1, define.ErrCtrRemoved
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheon PTAL

Not sure how I feel about this, the container struct is invalid at this point taking the lock, taking the lock later etc... is no longer correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. The locking model is explicitly built to bail the moment a container struct is invalid. Another container may already hold this lock number, which can lead to us deadlocking ourselves if we do anything after this point that takes a container lock. I don't see anything right now that does that, but that doesn't mean it won't be added in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can't really guarantee any of the container's functions that touch the database actually working after this point? Seems extremely risky to continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that the code in this case would just end up in c.runtime.state.GetContainerExitCode() and retrieve the status from there; but sure let's make this safer. I'll just call c.runtime.state.GetContainerExitCode() if !c.valid, would that work for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me at least

Comment on lines 2049 to 2057
// release the lock while waiting for healtcheck processes to terminate.
if !c.batched {
c.lock.Unlock()
}
if err := c.removeTransientFiles(ctx, c.config.StartupHealthCheckConfig != nil && !c.state.StartupHCPassed); err != nil {
logrus.Errorf("Removing timer for container %s healthcheck: %v", c.ID(), err)
}
if !c.batched {
c.lock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of the unlock?
This seems super dangerous as we now can end up in this same code path twice which it can not handle properly if that happens.

Also the unlock in the middle of the code is bad coding IMO, it completely breaks any sane locking model that a person can keep track of. cleanup() can be called from so many different places even during init if we start to unlock during that who knowns how many impossible to debug race conditions we get.

And going by bugs like #22651 it is practically impossible to catch these bug during review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killing other podman processes in the middle of whatever they are doing is also a bad practice and causes all sort of race issues, The current code just exits on SIGTERM before this patch, that is why I added the shutdown inhibit. I don't see how the healthcheck code is handling that, files are not even written in an atomic fashion. The race we hit in the CI is the healthcheck not able to record the ready status as it could be killed before doing so, and the test I've added causes the process to terminate immediately.

If we want a clean shutdown for the healthcheck process (and I think we should aim for that), we cannot hold the lock since the healthcheck process takes the lock during exec().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the podman healtcheck run is already in a locked state which means we cannot be here in cleanup at the same time. Or podman healtcheck run is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.

I think the fundamental issue is we cannot expect a healtcheck to work on short living container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the podman healtcheck run is already in a locked state which means we cannot be here in cleanup at the same time. Or podman healtcheck run is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.

In theory yes, but in practice the healthcheck code is running unlocked as it calls a bunch of different operations that do the locking internally and we try to change it, it is enough we forget one occurrence and we end up in a deadlock.

I think the fundamental issue is we cannot expect a healtcheck to work on short living container.

I didn't foresee opening this can of worms when I added the new test to the CI. But now that I am aware of the root issue, I expect to transition to healthy at some point without having to add an artificial delay to the container:

podman --health-cmd="touch /terminate" --sdnotify=healthy  $IMAGE sh -c 'while test \! -e /terminate; do sleep 0.1; done'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are right in your test case we definitely can execute the healthcheck once.

So basically the issue is the we write the sdnotify socket from the podman run command and not the podman healtchcheck run command, this also means that podman run -d blocks until is is healthy.

Ideally we could have the podman healtchcheck run write directly the notify socket but I guess systemd will not allow this as they are not part of the unit so this will not work.

I also just realized that you are changing the wait behavior and this code is part of the public podman wait API
podman wait --condition healthy so we need to be careful to not cause regressions.

Maybe it is time to re-design this logic. I am thinking about having the podman run command expose a unix socket that the podman healthcheck run command can that write to as signal that it is healthy now. Then podman run can send the notify message and exit. So we would not have to use the polling WaitForConditionWithInterval() at all. We can make the podman healtcheck run just ignore failures in case the socket can not be connected too for future healthcheck runs, or store that fact that it did connect in the container state db.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good, but do you agree that such re-design is out of scope for the following PR? Let's deal with the CI flake (which is a real issue) first

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well IMO the unlock in the critical section is not safe so I do not want to merge this like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to merge this like this.

so far the only alternative seems to be inventing a new IPC mechanism just to handle sdnotify=healthy.

Assuming the new IPC mechanism is the way to go (it might be a bit overkill for a feature that was broken for almost a year and nobody noticed), how do we deal with it in the short term?

Should we just remove the test?

The unlock pattern is already used with --sdnotify=healthy since its initial implementation:

0cfd127#diff-694b11f8e3815ed5986dc6e81cbc40ecff32b50938a38b44c8d0cb405adb3330R1254-R1260

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right start did it at the end which is less risky but still caused the initial bug here which means unlocking in the middle is by no means simple and has to be avoided at all costs IMO.

Unlocking in cleanup is much worse because we now can get two commands directly after each other into cleanup at the code is not set up to handle this I think. At least I have little confidence in the correctness of the code there already. Adding more failures points will hurt us given in how many places cleanup is called.

If the goal is to fix the flake then let's disable the test until we can work on a better fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if this is what you want: dropped the last patch and added a new one to disable the tests

giuseppe added 2 commits May 22, 2024 21:34
wait for another interval when the container transitioned to "stopped"
to give more time to the healthcheck status to change.

Closes: containers#22760

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Now WaitForExit returns the exit code as stored in the db instead of
returning an error when the container was removed.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 71c98e8 to 1a7d1e3 Compare May 22, 2024 19:34
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 1a7d1e3 to 7f567a4 Compare May 27, 2024 11:02
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@mheon PTAL

We should create a new issue once this is merged for the proper race free fix.

Copy link
Contributor

openshift-ci bot commented May 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mheon
Copy link
Member

mheon commented May 28, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit af8fe2b into containers:main May 28, 2024
89 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman healthcheck + sdnotify: Error: container is stopped
4 participants