-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-update: test getting "rolled back" instead of "true" #17607
Comments
A "rolled back" instead of "true" suggests that the update did not succeed, so Podman reverted/rolled back to the previous image. My best "guess" is that it's Debian's systemd. I probably need access to a Debian VM to analyze it correctly. If it flakes too hard, I suggest to skip the test on Debian for now. |
The symptoms in containers#17607 point to some race since it does not always flake on Debian (and Debian only). Hence, wait for the service to be ready before building the image to make sure that the service is started with the old image and that everything's in order. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Looks like it's not quite fixed. Reopening. |
Thanks for reopening! I noticed the flake today as well and will take a look. |
This one looks very strange to me. The image has been pulled successfully but something must have gone wrong when restarting the service, hence the rollback. Unfortunately, no logs indicate what went wrong.
This could either be another race or the 10 seconds are not enough (unlikely). @edsantiago do you have any suspicion on what may be wrong or whether there's another race in the tests? |
I don't have a sense for it... but I'm seeing lots of new flakes in auto-update tests. I need some time to curate those; will check back in later. |
OK, I found a smoking gun that confirms the "Timed out waiting" flake is the same as this one:
Example:
("c_registry" is the one showing "rolled back"). I've been seeing a lot of quay.io flakes today... but the logs above do not show those. And if it were a quay.io flake, I would expect non-Debian systems to hit it too. Something on Debian is causing "rolled back" even when all looks (superficially) successful. Here are the "Timed out waiting" flakes: [sys] podman auto-update with multiple services
|
To help debug containers#17607, turn off rollbacks for tests that do not require rollbacks. Error when restarting the systemd units are then not suppressed but returned which should give us more information about what is going on the Debian systems. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Wait for all generated services to be ready to be sure we can iron out race conditions. Also disable rollbacks to make sure we can analyze the error if restarting a service fails. This information may be crucial to understand the flakes on Debian as tracked in containers#17607. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
I opened #17770 to help debug what's going on. I tried reproducing in a CI VM but did not manage to, so I am flying a bit blind. |
auto update system tests: help debug #17607
Hmmm, just saw a similar-but-not-identical failure in f37 root:
|
What a coincidence. I think this should be fixed by #17770 as we are now waiting for all services to be ready before running auto update. |
Well, foo. Failure seen in 17786, and it looks like that one was based on main which included your debug PR. I see nothing useful in the logs, but (fingers crossed) maybe you will. |
My bad. The error is still being eaten. Will prepare a PR. |
Return the error when restarting the unit failed during an update. The task is correctly marked to have failed but we really need to return the error to the user. [NO NEW TESTS NEEDED] - The flakes in containers#17607 will reveal errors. Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Hey, this one popped up yesterday, is it helpful?
It would seem to suggest a quay flake, except, why only on Debian? |
Thanks for sharing! It's probably a temporary fart on the registry. The other symptoms are after having pulled the image and when restarting the systemd units. |
OOH! We got it on f37! |
@vrothberg here ya go. I see nothing of value, but hope you do.
|
That was indeed somehow helpful as it encouraged me lookup what this |
If #17959 really fixes the flake/bug, I will remove the additional echos from the test but I prefer to keep them until it has some time to bake in CI. |
It turns out the restart is _not_ a stop+start but keeps certain resources open and is subject to some timeouts that may differ across distributions' default settings. [NO NEW TESTS NEEDED] as I have absolutely no idea how to reliably cause the failure/flake/race. Also ignore ENOENTS of the CID file when removing a container which has been identified of actually fixing containers#17607. Fixes: containers#17607 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Commit f131eaa changed restart to a stop+start motivated by comments in the systemd man pages that restart behaves different than stop+start, for instance, that it keeps certain resources open and treats timers differently. Yet, the actually fix for containers#17607 in the very same commit was dealing with an ENOENT of the CID file on container removal. As it turns out in in containers#18926, changing to stop+start regressed on restarting dependencies when auto updating a systemd unit. Hence, move back to using restart to make sure that dependent systemd units are restarted as well. An alternative could be recommending to use `BindsTo=` in Quadlet files but this seems less common than `Requires=` and hence more risky to cause issues on user sites. Fixes: containers#18926 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Debian-only so far, and only one logged instance, but I saw this one twice (in non-preserved logs) while reviewing #17305. This is going to start hitting us hard now that Debian is live.
The text was updated successfully, but these errors were encountered: