-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
signal flake: Timed out waiting for BYE #16091
Comments
Could be a timeout issue? We're waiting at most 5 seconds for the container to echo BYE. Maybe the machine was under heavy load? There are many syscalls involved and many chances the scheduler could kick in. |
@edsantiago, I wonder if the test could just do a |
Timeout: are you talking load on the host? On the bare-metal? Could be, but is there any way to determine that? (If you're talking about the VM itself: the only thing running is system tests. Should not be high load).
|
@edsantiago can we start the container with |
That's the only guess I have. 5 seconds seems like a small timeout compared to other ones. |
|
Too late for a coffee for me today :-) Thanks, Ed! |
Failed in Fedora gating tests:
|
@edsantiago do the flakes in the gating tests support the theory of the machine being under load (or just sufficiently slow) to justify bumping timeouts? |
I don't know enough about the gating-test environment to answer that. I actually don't know ANYTHING about that environment. But yeah, I guess there's not much choice other than to bump timeouts. |
Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
New symptom in remote f37-aarch64 root:
Hmmmmm, then again, this is |
Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Bump the timeout waiting for the container to process the signal. The comparatively short timeout is most likely responsible for flakes in gating tests. Fixes: containers#16091 Signed-off-by: Valentin Rothberg <vrothberg@redhat.com> (cherry picked from commit f0ba2d8) Signed-off-by: Lokesh Mandvekar <lsm5@fedoraproject.org>
Reopening, because we have all-new sigproxy tests, and ALL of them are flaking in gating:
|
Should we bump the timeout to 20? This seems to have worked for other tests - assuming when the nodes are under stress. |
I guess there's not much else we can do right now. I'll open a PR. |
Reopening: failed again in 4.4 gating tests, and this is 4.4 so it presumably has the timeout bump |
One of our oldest most frustrating flakes is containers#16091, "Timed out waiting for BYE". In containers#17489 we added some debug output to see if the problem was a container hang of some sort. It does not seem to be (see containers#17675), and the debug output makes it hard to read failure logs, so let's remove it. Signed-off-by: Ed Santiago <santiago@redhat.com>
Continuing to fail in fedora gating |
@edsantiago, does it also happen in CI? I'd love to see (journal) logs. |
It has happened in CI, but so rarely (most recently Nov 2022) that the logs are unavailable. |
This is causing RHEL gating tests to fail. And, I can reproduce 100% on a 1mt f38 VM. I think sigproxy is broken: Terminal 1: $ podman run -i quay.io/libpod/testimage:20221018 sh -c 'trap "echo BYE; exit 0" INT;while :;do sleep 1;done'
[blocks, as expected] Terminal 2: # ps auxww --forest |grep -2 podman
...
fedora 2336 0.0 0.2 7528 4224 pts/0 S 10:25 0:00 \_ -bash
fedora 4613 0.0 2.1 1131356 42880 pts/0 Sl+ 10:31 0:00 \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora 4618 0.2 2.4 1279076 49792 pts/0 Sl+ 10:31 0:00 \_ podman run -i quay.io/libpod/testimage:20221018 sh -c trap "echo BYE; exit 0" INT;while :;do sleep 1;done
fedora 4629 0.0 0.1 5012 2944 pts/0 S 10:31 0:00 \_ /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -e 3 -r 4 --netns-type=path /run/user/1000/netns/netns-7b481dfb-9731-fb9e-34dd-0331d9806ec8 tap0
# kill -INT 4613 <--- this brings terminal 1 back to the prompt BUT DOES NOT EMIT BYE
# ps auxww --forest | grep -2 podman
[still shows the 4618 process]
# kill -INT 4618 <---- ***NOW*** I get the BYE in terminal 1 @containers/podman-maintainers PTAL ASAP. This has been a nasty flake for many months. Now that it's reproducible, let's squash it. Side note: how can this pass in CI????
|
VM available on request |
I give up. The only thing I've learned is that sometimes
^^^ the good case In the good case, like this one, killing the child of All tests were run using This is way beyond me. @containers/podman-maintainers, once again, please look into this. |
The extra reexec happens when we cannot join the rootless pause process namespace. While I cannot reproduce the missing BYE I can still reproduce this behaviour locally by checking the exit code. With the extra reexec we exit 1 while without it we exit as expected 0. |
There are quite a lot of places in podman were we have some signal handlers, most notably libpod/shutdown/handler.go. However when we rexec we do not want any of that and just send all signals we get down to the child obviously. So before we install our signal handler we must first reset all others with signal.Reset(). Also while at it fix a problem were the joinUserAndMountNS() code path would not forward signals at all. This code path is used when you have running containers but the pause process was killed. Fixes containers#16091 Given that signal handlers run in different goroutines parallel it would explain why it flakes sometimes in CI. However to my understanding this flake can only happen when the pause process is dead before we run the podman command. So the question still is what kills the pause process? Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Haven't seen this one before (note, "sigkill" in test name is a misnomer, it's actually SIGINT)
Basically, it looks like container is never receiving the INT signal)
ubuntu root
The text was updated successfully, but these errors were encountered: