add epbf program to trace podman cleanup errors in CI #23487

Luap99 · 2024-08-02T15:58:34Z

See commits, I just want to check if this even works. Hopefully this allows me to properly understand cleanup errors.

Does this PR introduce a user-facing change?

None

packit-as-a-service · 2024-08-02T16:22:10Z

Cockpit tests failed for commit f241ed073bd388cd730c44b432c87caec82754c3. @martinpitt, @jelly, @mvollmer please check.

I like to run a bpftrace based program in CI to collect better logs for specific processes not observed in the normal testing such as the podman container cleanup command. Given you need to have full privs to run ebpf and the package pulls in an entire toolchain which is almost 500MB in install size we do not add it the the container images to not bloat them without reason. containers/podman#23487 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2024-08-06T17:08:30Z

In case anyone is interested here is an example log:
https://api.cirrus-ci.com/v1/artifact/task/6197821693493248/cleanup_tracer/podman-cleanup-tracer.log

It doesn't look nice mostly because the argv lines is printed with zero bytes not spaces, I don't have a way to get rid of that in my program but I guess I could pipe into sed to replace all zero bytes with spaces to make it easier to read.

Luap99 · 2024-08-06T19:05:02Z

@edsantiago please pick this one as well for the no retry PR, maybe I can find the cause in this new log.

edsantiago · 2024-08-06T20:53:52Z

Here ya go f39 root. [EDIT: direct link to cleanup tracer log]. This is my second run; first one did not have any relevant flakes. This run had only one. I'll rerun at least once more today.

edsantiago · 2024-08-07T01:48:55Z

This one has two (debian root). It was the only failure in the third run. I've pushed one more time, but will not be checking its status again tonight.

Luap99 · 2024-08-07T09:52:04Z

Ok that doesn't seems to include any new things AFAICS, it does confirm what I have been saying the cleanup is called twice. The only way I see this happening is what the cleanup process failed but in the logs exit codes I clearly see no errors or kill signals send. So there truly must be a condition in our code that is calling cleanup twice even if there was no error the first time...

Luap99 · 2024-08-07T12:20:22Z

@edsantiago I assume the no flake retry is the only thing I need to trigger this?

I added the parent pid info to my log as it wasn't clear if netavark teardown was run by the some podman command or two different ones.

edsantiago · 2024-08-07T12:23:54Z

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

Luap99 · 2024-08-07T12:25:47Z

Not to trigger it, more precisely, to see it. I think the error is happening all the time, it's just that normally we have to examine each individual log to find it. Yes, setting retries to 0 should make failures visible.

Yes that is what I meant.

Luap99 · 2024-08-07T13:25:39Z

Interesting I found the netns error also in my log for the cleanup process. So it can happen without us ever seeing as wel and without causing a test failure.

stderr 07:36:29.528707 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Cleaning up container: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped"
stderr 07:36:29.528744 79038    79009    podman      Error: failed to clean up container e22e54194869285aef9d593f0cdfff9e23d5fdfb14615dbcf71aac829d225a77: cannot get namespace path unless container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 is running: container is stopped
ainerexec   07:36:29.529410 79114    79061    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.529780 79114    79061    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.617636 79114    79061    netavark     0 0
stderr 07:36:29.638209 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to get ips for container ID edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8 on network podman-default-kube-network"
ontainer edb052e52aac57efd13979ce3921481aff6baexec   07:36:29.638942 79150    79038    podman       /usr/libexec/podman/netavark --config /run/containers/networks --rootless=false --aardvark-binary=/usr/libexec/podman/aardvark-dns teardown /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46
cmd    07:36:29.639236 79150    79038    /usr/libexec/podman/netavark�--config�/run/containers/networks�--rootless=false�--aardvark-binary=/usr/libexec/podman/aardvark-dns�teardown�/run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46�
exit   07:36:29.642237 79150    79038    netavark     1 0
stderr 07:36:29.642647 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="IPAM error: failed to find ip for subnet 10.89.0.0/24 on network podman-default-kube-network"
775ab0ca6b567cdd8 on network podman-defaulstderr 07:36:29.642714 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="netavark: open container netns: open /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: IO error: No such file or directory (os error 2)"
-network"
ontainer edb052e52aac57efd13979ce3921481aff6bastderr 07:36:29.642747 79038    79009    podman      time="2024-08-07T07:36:29-05:00" level=error msg="Unable to clean up network for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: \"unmounting network namespace for container edb052e52aac57efd13979ce3921481aff6baa06e556199775ab0ca6b567cdd8: failed to remove ns path: remove /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory, failed to unmount NS: at /run/netns/netns-00a1fc0e-8ec3-0941-ba07-9f4989af1e46: no such file or directory\""
exit   07:36:29.656424 79103    78908    podman       0 0
exit   07:36:29.676194 79038    79009    podman       125 0

https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/cleanup_tracer/podman-cleanup-tracer.log
https://api.cirrus-ci.com/v1/artifact/task/5220866676490240/html/int-podman-rawhide-root-host-sqlite.log.html

Luap99 · 2024-08-09T17:03:57Z

@edsantiago Great news I am confident that I found the root cause, a locking/container state misuse bug that caused us to add the netns back after we cleaned it up already.
Please give d6bcc14 a try. Hopefully this is enough, I will prepare a proper PR with a better description on Monday.

edsantiago · 2024-08-10T18:15:32Z

One new failure seen, f39 root:

# podman [options] network ls --quiet
net5ef1e5e5df3c173e8948e8e487fc7c21c14438e290de0a5a04e3f8b610f60b47
networkIDTest
podman
podman-default-kube-network
time="2024-08-10T17:56:31Z" level=warning msg="Error reading network config file \"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\": EOF"

[FAILED] Unexpected warnings seen on stderr: "time=\"2024-08-10T17:56:31Z\" level=warning msg=\"Error reading network config file \\\"/etc/containers/networks/net25120a9e54ed259eaa0b8b853f1e8cda89d0c538a55825de0e591c2e67950554.json\\\": EOF\""

Looks more like a race in network ls than anything you've changed, but still, it's a new one I don't remember seeing before.

edsantiago · 2024-08-11T12:52:59Z

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

Luap99 · 2024-08-11T16:43:34Z

It took a LOT of runs, but here's what looks like the same issue. rawhide rootless:

$ podman [options] stop --all -t 0
Error: removing container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c network: unmounting network namespace for container 78e1e7c4d390f06129b5a13de8ede2351eafaf3ccc76f665177b7e5ce3bc922c: failed to remove ns path: remove /run/user/2153/netns/netns-1fac2ba0-0d2a-10cb-7dbe-900b0f29ae60: device or resource busy

Note the same issue (EBUSY vs ENOENT), this should be fixed by containers/common#2112, vendor in #23519

Luap99 · 2024-08-11T16:47:53Z

Also d6bcc14 is such a generic issue it would not surprise me if this fixes other weird stop flakes as well.

Add a new program based on bpftrace[1] to trace all podman processes with arguments and exit code/signals. Additionally this captures stderr from all podman container cleanup processes spawned by conmon which otherwise go to /dev/null and are never seen in any CI logs. Hopefull this allows us to debug strange network cleanup error seen in CI, my plan is to add this to the cirrus setup and upload the logs so we can check them when the flakes happen. [1] /~https://github.com/bpftrace/bpftrace Signed-off-by: Paul Holzinger <pholzing@redhat.com>

In order to get better debug data for cleanup flakes. The argv is printed with 0 bytes so replace them with spaces to make the log readable for humans. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2024-09-24T15:36:20Z

@edsantiago @mheon PTAL, I would like to get this merged as it just helped me today again for #24044 so I think this is useful in general.

edsantiago

Qualified LGTM (I don't grok ebpf, so I've only re-reviewed the runner changes)

edsantiago · 2024-09-24T15:41:53Z

.gitignore

@@ -57,3 +57,4 @@ contrib/win-installer/shasums
 contrib/win-installer/*.wixobj
 contrib/win-installer/*.wixpdb
 contrib/win-installer/*.log
+podman-cleanup-tracer.log*


Why the asterisk? (nonblocking)

Oh that was a leftover from a previous iteration where I wrote to a *.tmp file and then called tr later but then I figured I can just pipe into tr and do not need a extra file so we can drop it in a future cleanup I guess

openshift-ci · 2024-09-24T15:43:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-09-24T15:45:32Z

/lgtm

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 2, 2024

Luap99 force-pushed the ebpf branch 2 times, most recently from 4cf3a21 to f8a2000 Compare August 2, 2024 16:22

This was referenced Aug 5, 2024

add bpftrace for CI debugging containers/automation_images#371

Merged

update c/common to add some netns cleanup fixes #23519

Merged

Luap99 force-pushed the ebpf branch from f8a2000 to 34ff046 Compare August 6, 2024 17:00

Luap99 force-pushed the ebpf branch from 34ff046 to 961d2bb Compare August 6, 2024 18:15

Luap99 force-pushed the ebpf branch from 961d2bb to 995e196 Compare August 7, 2024 12:14

Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Aug 7, 2024

Luap99 force-pushed the ebpf branch 6 times, most recently from 4f5828b to ed8dc4d Compare August 8, 2024 14:43

Luap99 mentioned this pull request Aug 8, 2024

libpod: cleanupNetwork() return error #23553

Merged

Luap99 force-pushed the ebpf branch from 691d9da to 0ae64b1 Compare August 8, 2024 17:53

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 9, 2024

Luap99 force-pushed the ebpf branch from 00cad2e to 32b2a62 Compare August 9, 2024 16:12

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2024

Luap99 mentioned this pull request Aug 26, 2024

run --events-backend=file sometimes means journal #23750

Closed

Luap99 force-pushed the ebpf branch from d6bcc14 to 4ce6d01 Compare September 3, 2024 09:25

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024

Luap99 force-pushed the ebpf branch from 4ce6d01 to 88412f6 Compare September 3, 2024 09:25

Luap99 changed the title ~~WIP: add epbf program to trace podman cleanup errors in CI~~ add epbf program to trace podman cleanup errors in CI Sep 3, 2024

Luap99 marked this pull request as ready for review September 3, 2024 09:25

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024

Luap99 force-pushed the ebpf branch from 88412f6 to e38f210 Compare September 20, 2024 13:57

Luap99 mentioned this pull request Sep 24, 2024

CI: system tests: netns leak #24044

Closed

Luap99 added 2 commits September 24, 2024 12:47

CI: run and collect cleanup tracer logs

09ba5db

In order to get better debug data for cleanup flakes. The argv is printed with 0 bytes so replace them with spaces to make the log readable for humans. Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 force-pushed the ebpf branch from e38f210 to 09ba5db Compare September 24, 2024 10:47

edsantiago approved these changes Sep 24, 2024

View reviewed changes

openshift-ci bot assigned mheon Sep 24, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2024

openshift-merge-bot bot merged commit af4b061 into containers:main Sep 24, 2024
88 checks passed

Luap99 deleted the ebpf branch September 24, 2024 15:49

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 24, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add epbf program to trace podman cleanup errors in CI #23487

add epbf program to trace podman cleanup errors in CI #23487

Luap99 commented Aug 2, 2024

packit-as-a-service bot commented Aug 2, 2024

Luap99 commented Aug 6, 2024

Luap99 commented Aug 6, 2024

edsantiago commented Aug 6, 2024 •

edited

Loading

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 9, 2024

edsantiago commented Aug 10, 2024

edsantiago commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Sep 24, 2024

edsantiago left a comment

edsantiago Sep 24, 2024

Luap99 Sep 24, 2024

openshift-ci bot commented Sep 24, 2024

mheon commented Sep 24, 2024

add epbf program to trace podman cleanup errors in CI #23487

add epbf program to trace podman cleanup errors in CI #23487

Conversation

Luap99 commented Aug 2, 2024

Does this PR introduce a user-facing change?

packit-as-a-service bot commented Aug 2, 2024

Luap99 commented Aug 6, 2024

Luap99 commented Aug 6, 2024

edsantiago commented Aug 6, 2024 • edited Loading

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

edsantiago commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 7, 2024

Luap99 commented Aug 9, 2024

edsantiago commented Aug 10, 2024

edsantiago commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Aug 11, 2024

Luap99 commented Sep 24, 2024

edsantiago left a comment

Choose a reason for hiding this comment

edsantiago Sep 24, 2024

Choose a reason for hiding this comment

Luap99 Sep 24, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Sep 24, 2024

mheon commented Sep 24, 2024

edsantiago commented Aug 6, 2024 •

edited

Loading