Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix netns leak on container creation and exit code 1 on SIGTERM. #24082

Merged
merged 3 commits into from
Oct 1, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Sep 26, 2024

libpod: ensure we are not killed during netns creation

When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.

Fixes #24044


libpod: rework shutdown handler flow

Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.

The original motivation was to exit 0 for the podman.service in commit
ca7376b. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.

So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.


libpod: remove shutdown.Unregister()

It is never used and needed so let's just remove some dead code.

Does this PR introduce a user-facing change?

Podman no longer exits 0 on SIGTERM by default.
Fixed a race that could cause podman to leak netns files when it was interrupted during the netns creation.

When we are killed during netns setup it will leak the netns path as it
was not commited in the db. This is rather common if you run systemctl
stop on a podman systemd unit. Of course we cannot protect against
SIGKILL but in systemd case we get SIGTERM and we really should not exit
in a critical section like this.

Fixes containers#24044

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Currently podman run -d can exit 0 if we send SIGTERM during startup
even though the contianer was never started. That just doesn't make any
sense is horribly confusing for a external job manager like systemd.

The original motivation was to exit 0 for the podman.service in commit
ca7376b. That does make sense but it should only do so for the
service and only if the server did indeed gracefully shutdown.

So we rework how the exit logic works, do not let the handler perform
the exit. Instead the shutdown package does the exit after all handlers
are run, this solves the issue of ordering. Then we default to exit code
1 like we did before and allow the service exit handler to overwrite the
exit code 0 in case of a graceful shutdown.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
It is never used and needed so let's just remove some dead code.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Sep 26, 2024
@Luap99
Copy link
Member Author

Luap99 commented Sep 26, 2024

@mheon @edsantiago PTAL

Copy link
Contributor

openshift-ci bot commented Sep 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Sep 26, 2024
@edsantiago edsantiago changed the title Fix netns leak on contianer creation and exit code 0 on SIGTERM. Fix netns leak on container creation and exit code 0 on SIGTERM. Sep 26, 2024
@edsantiago
Copy link
Member

LGTM, and I can no longer reproduce the netns leak

@@ -75,6 +83,7 @@ func Start() error {
}
handlerLock.Unlock()
shutdownInhibit.Unlock()
os.Exit(exitCode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this prevent a lot of defer functions from running? I know I avoided it deliberately when I wrote this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean yes but if we exit what do we want to defer here? And most importantly the current handler also just exited so there should not be any functional difference I would say

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we have a bunch of things running in defer that do things like removing files to clean up after ourselves

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but this PR doesn't change the behavior here. Critical sections where we cannot leak need to use shutdown.Inhibit() which is how I fixed the issue with the netns leak.

@Luap99 Luap99 changed the title Fix netns leak on container creation and exit code 0 on SIGTERM. Fix netns leak on container creation and exit code 1 on SIGTERM. Sep 26, 2024
@baude
Copy link
Member

baude commented Sep 30, 2024

LGTM, but defereing for merge to @mheon given his questions ...

@edsantiago
Copy link
Member

Saw the netns leak flake today, almost wept in despair, then remembered that this PR hasn't merged yet.

@mheon
Copy link
Member

mheon commented Oct 1, 2024

/lgtm
I am hesitant but I cannot remember why we didn't want to use exit so I'll merge and hopefully we don't find out the hard way later

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 857a47d into containers:main Oct 1, 2024
74 of 80 checks passed
@Luap99 Luap99 deleted the netns-leak branch October 2, 2024 08:29
@Luap99
Copy link
Member Author

Luap99 commented Oct 2, 2024

I am hesitant but I cannot remember why we didn't want to use exit so I'll merge and hopefully we don't find out the hard way later

The old code did use exit() there is no functional change in that regard.

@mheon
Copy link
Member

mheon commented Oct 2, 2024

LGTM

@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 1, 2025
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jan 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: system tests: netns leak
4 participants