libpod: wait another interval for healthcheck #22764

giuseppe · 2024-05-20T20:26:12Z

wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change.

Closes: #22760

None

pkg/domain/infra/abi/containers.go

packit-as-a-service · 2024-05-22T09:26:36Z

Ephemeral COPR build failed. @containers/packit-build please check.

giuseppe · 2024-05-22T12:58:40Z

ready for review

Luap99 · 2024-05-22T13:01:31Z

libpod/container_api.go

-	if !c.valid {
-		return -1, define.ErrCtrRemoved
-	}


@mheon PTAL

Not sure how I feel about this, the container struct is invalid at this point taking the lock, taking the lock later etc... is no longer correct.

You are correct. The locking model is explicitly built to bail the moment a container struct is invalid. Another container may already hold this lock number, which can lead to us deadlocking ourselves if we do anything after this point that takes a container lock. I don't see anything right now that does that, but that doesn't mean it won't be added in the future.

Also, we can't really guarantee any of the container's functions that touch the database actually working after this point? Seems extremely risky to continue

I saw that the code in this case would just end up in c.runtime.state.GetContainerExitCode() and retrieve the status from there; but sure let's make this safer. I'll just call c.runtime.state.GetContainerExitCode() if !c.valid, would that work for you?

sounds good to me at least

Luap99 · 2024-05-22T13:06:56Z

libpod/container_internal.go

+		// release the lock while waiting for healtcheck processes to terminate.
+		if !c.batched {
+			c.lock.Unlock()
+		}
 		if err := c.removeTransientFiles(ctx, c.config.StartupHealthCheckConfig != nil && !c.state.StartupHCPassed); err != nil {
 			logrus.Errorf("Removing timer for container %s healthcheck: %v", c.ID(), err)
 		}
+		if !c.batched {
+			c.lock.Lock()


What is the point of the unlock?
This seems super dangerous as we now can end up in this same code path twice which it can not handle properly if that happens.

Also the unlock in the middle of the code is bad coding IMO, it completely breaks any sane locking model that a person can keep track of. cleanup() can be called from so many different places even during init if we start to unlock during that who knowns how many impossible to debug race conditions we get.

And going by bugs like #22651 it is practically impossible to catch these bug during review.

Killing other podman processes in the middle of whatever they are doing is also a bad practice and causes all sort of race issues, The current code just exits on SIGTERM before this patch, that is why I added the shutdown inhibit. I don't see how the healthcheck code is handling that, files are not even written in an atomic fashion. The race we hit in the CI is the healthcheck not able to record the ready status as it could be killed before doing so, and the test I've added causes the process to terminate immediately.

If we want a clean shutdown for the healthcheck process (and I think we should aim for that), we cannot hold the lock since the healthcheck process takes the lock during exec().

Either the podman healtcheck run is already in a locked state which means we cannot be here in cleanup at the same time. Or podman healtcheck run is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.

I think the fundamental issue is we cannot expect a healtcheck to work on short living container.

Either the podman healtcheck run is already in a locked state which means we cannot be here in cleanup at the same time. Or podman healtcheck run is not locked then we can just kill it IMO as in cleanup it must mean the container is not running which means the healthcheck cannot run anyway, so it would not get a proper healtcheck result anyway.

In theory yes, but in practice the healthcheck code is running unlocked as it calls a bunch of different operations that do the locking internally and we try to change it, it is enough we forget one occurrence and we end up in a deadlock.

I think the fundamental issue is we cannot expect a healtcheck to work on short living container.

I didn't foresee opening this can of worms when I added the new test to the CI. But now that I am aware of the root issue, I expect to transition to healthy at some point without having to add an artificial delay to the container:

podman --health-cmd="touch /terminate" --sdnotify=healthy $IMAGE sh -c 'while test \! -e /terminate; do sleep 0.1; done'

yeah you are right in your test case we definitely can execute the healthcheck once.

So basically the issue is the we write the sdnotify socket from the podman run command and not the podman healtchcheck run command, this also means that podman run -d blocks until is is healthy.

Ideally we could have the podman healtchcheck run write directly the notify socket but I guess systemd will not allow this as they are not part of the unit so this will not work.

I also just realized that you are changing the wait behavior and this code is part of the public podman wait API
podman wait --condition healthy so we need to be careful to not cause regressions.

Maybe it is time to re-design this logic. I am thinking about having the podman run command expose a unix socket that the podman healthcheck run command can that write to as signal that it is healthy now. Then podman run can send the notify message and exit. So we would not have to use the polling WaitForConditionWithInterval() at all. We can make the podman healtcheck run just ignore failures in case the socket can not be connected too for future healthcheck runs, or store that fact that it did connect in the container state db.

that sounds good, but do you agree that such re-design is out of scope for the following PR? Let's deal with the CI flake (which is a real issue) first

Well IMO the unlock in the critical section is not safe so I do not want to merge this like this.

I do not want to merge this like this.

so far the only alternative seems to be inventing a new IPC mechanism just to handle sdnotify=healthy.

Assuming the new IPC mechanism is the way to go (it might be a bit overkill for a feature that was broken for almost a year and nobody noticed), how do we deal with it in the short term?

Should we just remove the test?

The unlock pattern is already used with --sdnotify=healthy since its initial implementation:

0cfd127#diff-694b11f8e3815ed5986dc6e81cbc40ecff32b50938a38b44c8d0cb405adb3330R1254-R1260

Right start did it at the end which is less risky but still caused the initial bug here which means unlocking in the middle is by no means simple and has to be avoided at all costs IMO.

Unlocking in cleanup is much worse because we now can get two commands directly after each other into cleanup at the code is not set up to handle this I think. At least I have little confidence in the correctness of the code there already. Adding more failures points will hurt us given in how many places cleanup is called.

If the goal is to fix the flake then let's disable the test until we can work on a better fix.

well if this is what you want: dropped the last patch and added a new one to disable the tests

wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change. Closes: containers#22760 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Now WaitForExit returns the exit code as stored in the db instead of returning an error when the container was removed. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Luap99

LGTM
@mheon PTAL

We should create a new issue once this is merged for the proper race free fix.

openshift-ci · 2024-05-28T12:52:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-05-28T13:18:19Z

/lgtm

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 20, 2024

giuseppe mentioned this pull request May 20, 2024

podman healthcheck + sdnotify: Error: container is stopped #22760

Closed

edsantiago reviewed May 20, 2024

View reviewed changes

pkg/domain/infra/abi/containers.go Outdated Show resolved Hide resolved

giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch 6 times, most recently from 1c4f37b to 86c0acd Compare May 22, 2024 09:16

giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 86c0acd to 71c98e8 Compare May 22, 2024 09:30

giuseppe marked this pull request as ready for review May 22, 2024 12:58

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2024

Luap99 requested changes May 22, 2024

View reviewed changes

giuseppe added 2 commits May 22, 2024 21:34

libpod: wait another interval for healthcheck

e166f6b

wait for another interval when the container transitioned to "stopped" to give more time to the healthcheck status to change. Closes: containers#22760 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

podman: fix --sdnotify=healthy with --rm

d094a9f

Now WaitForExit returns the exit code as stored in the db instead of returning an error when the container was removed. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 71c98e8 to 1a7d1e3 Compare May 22, 2024 19:34

tests: disable tests affected by a race condition

7f567a4

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the give-more-time-to-healthcheck-status-change branch from 1a7d1e3 to 7f567a4 Compare May 27, 2024 11:02

Luap99 approved these changes May 28, 2024

View reviewed changes

openshift-ci bot assigned mheon May 28, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2024

openshift-merge-bot bot merged commit af8fe2b into containers:main May 28, 2024
89 checks passed

This was referenced May 29, 2024

Windows: unable to connect to Podman socket: failed to read identity #22843

Open

windows: system connection: unexpected ports #22844

Closed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: wait another interval for healthcheck #22764

libpod: wait another interval for healthcheck #22764

giuseppe commented May 20, 2024

packit-as-a-service bot commented May 22, 2024

giuseppe commented May 22, 2024

Luap99 May 22, 2024

mheon May 22, 2024

mheon May 22, 2024

giuseppe May 22, 2024

Luap99 May 22, 2024

Luap99 May 22, 2024

giuseppe May 22, 2024

Luap99 May 22, 2024

giuseppe May 22, 2024

Luap99 May 22, 2024

giuseppe May 23, 2024

Luap99 May 27, 2024

giuseppe May 27, 2024

Luap99 May 27, 2024

giuseppe May 27, 2024

Luap99 left a comment

openshift-ci bot commented May 28, 2024

mheon commented May 28, 2024

libpod: wait another interval for healthcheck #22764

libpod: wait another interval for healthcheck #22764

Conversation

giuseppe commented May 20, 2024

packit-as-a-service bot commented May 22, 2024

giuseppe commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented May 28, 2024

mheon commented May 28, 2024