-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add epbf program to trace podman cleanup errors in CI #23487
Conversation
4cf3a21
to
f8a2000
Compare
Cockpit tests failed for commit f241ed073bd388cd730c44b432c87caec82754c3. @martinpitt, @jelly, @mvollmer please check. |
I like to run a bpftrace based program in CI to collect better logs for specific processes not observed in the normal testing such as the podman container cleanup command. Given you need to have full privs to run ebpf and the package pulls in an entire toolchain which is almost 500MB in install size we do not add it the the container images to not bloat them without reason. containers/podman#23487 Signed-off-by: Paul Holzinger <pholzing@redhat.com>
In case anyone is interested here is an example log: It doesn't look nice mostly because the argv lines is printed with zero bytes not spaces, I don't have a way to get rid of that in my program but I guess I could pipe into sed to replace all zero bytes with spaces to make it easier to read. |
@edsantiago please pick this one as well for the no retry PR, maybe I can find the cause in this new log. |
Here ya go f39 root. [EDIT: direct link to cleanup tracer log]. This is my second run; first one did not have any relevant flakes. This run had only one. I'll rerun at least once more today. |
This one has two (debian root). It was the only failure in the third run. I've pushed one more time, but will not be checking its status again tonight. |
Ok that doesn't seems to include any new things AFAICS, it does confirm what I have been saying the cleanup is called twice. The only way I see this happening is what the cleanup process failed but in the logs exit codes I clearly see no errors or kill signals send. So there truly must be a condition in our code that is calling cleanup twice even if there was no error the first time... |
@edsantiago I assume the no flake retry is the only thing I need to trigger this? I added the parent pid info to my log as it wasn't clear if netavark teardown was run by the some podman command or two different ones. |
Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible. |
Yes that is what I meant. |
Interesting I found the netns error also in my log for the cleanup process. So it can happen without us ever seeing as wel and without causing a test failure.
https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/cleanup_tracer/podman-cleanup-tracer.log |
4f5828b
to
ed8dc4d
Compare
@edsantiago Great news I am confident that I found the root cause, a locking/container state misuse bug that caused us to add the netns back after we cleaned it up already. |
One new failure seen, f39 root:
Looks more like a race in |
It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:
|
Note the same issue (EBUSY vs ENOENT), this should be fixed by containers/common#2112, vendor in #23519 |
Also d6bcc14 is such a generic issue it would not surprise me if this fixes other weird stop flakes as well. |
Add a new program based on bpftrace[1] to trace all podman processes with arguments and exit code/signals. Additionally this captures stderr from all podman container cleanup processes spawned by conmon which otherwise go to /dev/null and are never seen in any CI logs. Hopefull this allows us to debug strange network cleanup error seen in CI, my plan is to add this to the cirrus setup and upload the logs so we can check them when the flakes happen. [1] /~https://github.com/bpftrace/bpftrace Signed-off-by: Paul Holzinger <pholzing@redhat.com>
In order to get better debug data for cleanup flakes. The argv is printed with 0 bytes so replace them with spaces to make the log readable for humans. Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@edsantiago @mheon PTAL, I would like to get this merged as it just helped me today again for #24044 so I think this is useful in general. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qualified LGTM (I don't grok ebpf, so I've only re-reviewed the runner changes)
@@ -57,3 +57,4 @@ contrib/win-installer/shasums | |||
contrib/win-installer/*.wixobj | |||
contrib/win-installer/*.wixpdb | |||
contrib/win-installer/*.log | |||
podman-cleanup-tracer.log* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the asterisk? (nonblocking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that was a leftover from a previous iteration where I wrote to a *.tmp
file and then called tr
later but then I figured I can just pipe into tr and do not need a extra file so we can drop it in a future cleanup I guess
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
See commits, I just want to check if this even works. Hopefully this allows me to properly understand cleanup errors.
Does this PR introduce a user-facing change?