Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race: system df: failed to get read/write size of container: /proc/something: ENOENT or EINVAL #23909

Closed
edsantiago opened this issue Sep 9, 2024 · 6 comments · Fixed by #24019
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

In window 1:

$ sudo bash -c 'while :;do bin/podman system df --format "{{\"\"}}" || break;done';beep  

In window 2:

$ while :;do hack/bats --root 252:rootfs || break;done

Doesn't take too long for window 1 to fail:

Error: failed to get read/write size of container xxx: readdirent /proc/1347012/net: invalid argument

Also seen:

Error: failed to get read/write size of container xxx: lstat /proc/2688266/fd/5: no such file or directory

Found in parallel testing. It should be possible to minimize the reproducer if necessary.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 9, 2024
@Luap99
Copy link
Member

Luap99 commented Sep 18, 2024

Why do we even read /proc when computing the size? That doesn't make sense as these are not real files.

@Luap99
Copy link
Member

Luap99 commented Sep 18, 2024

Oh I see why, the quadlet is using a overlyfs rootfs over / and podman of course in order to know the size has to walk over everything and because well /proc is a normal mount point on the host it walks /proc there too. The actual containers /proc is mounted in the container mount namespace only so we would not walk that by default.
Only because we see the /proc on the host here it happens...

@Luap99 Luap99 self-assigned this Sep 18, 2024
Luap99 added a commit to Luap99/storage that referenced this issue Sep 18, 2024
As documented in (fs.DirEntry) Info() it can return ENOENT so we must
ignore it there as well. Reported in
containers/podman#23909

Given this is a race I don't think I can add a test here.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Member

Luap99 commented Sep 18, 2024

So I think found the cause for the ENOENT errors (containers/storage#2097) which seems like a real bug that can always happen, however walking /proc in general seem to generate a bunch of weird errors, i.e. EINVAL and that is certainly not an error that we like to ignore by default when walking+stating each file to get the size.

I am not sure how important this is, I guess it blocks you from running this rootfs test in parallel?
We can try something like this:

diff --git a/test/system/252-quadlet.bats b/test/system/252-quadlet.bats
index 3ef297b869..566a512ab5 100644
--- a/test/system/252-quadlet.bats
+++ b/test/system/252-quadlet.bats
@@ -726,7 +726,12 @@ EOF
     local quadlet_file=$PODMAN_TMPDIR/basic_$(safename).container
     cat > $quadlet_file <<EOF
 [Container]
-Rootfs=/:O
+Rootfs=$PODMAN_TMPDIR:O
+Volume=/usr:/usr
+Volume=/lib:/lib
+Volume=/lib64:/lib64
+Volume=/etc:/etc
+Volume=/var:/var
 Exec=sh -c "echo STARTED CONTAINER; echo "READY=1" | socat -u STDIN unix-sendto:\$NOTIFY_SOCKET; top -b"
 Notify=yes
 EOF

Or like we do in other tests use an actual mounted container image as source for rootfs which I guess is more robust then only mounting certain host directories.

@ygalblum
Copy link
Contributor

@Luap99 but then what are you really testing? If we want to avoid this walk, I think the test will need to prepare the rootfs directory and use it and not mount all these directories.

@edsantiago
Copy link
Member Author

See 522c718

@ygalblum
Copy link
Contributor

See 522c718

Yes, this is what I meant. Thanks

edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 19, 2024
Test was written to use / (root). This is not parallel-safe.

Fixes: containers#23909

Signed-off-by: Ed Santiago <santiago@redhat.com>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 20, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Dec 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants