-
Notifications
You must be signed in to change notification settings - Fork 2.5k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Systemd in container hits critical errors when using private cgroup namespace on CentOS 8 #17727
Comments
@giuseppe PTAL |
probably the issue is coming from bind mounting Unfortunately, I don't see any way to fix it, Podman doesn't know in advance what is the destination cgroup so it cannot point to the final cgroup path, and there is no way to express it in the OCI runtime specs, unless we just make the entire cgroup directory rw (which is a security issue). I think all we can do is add an error when this happens. |
opened a PR: #17736 |
Hey @giuseppe, I'm trying to understand this, why can't the systemd cgroup mount be created in the normal way as is done with Is there definitely a security issue with making the mount rw when there's a private cgroup namespace involved? |
the other cgroups are created 'ro'. We want only the systemd named cgroup to be mounted 'rw'. In order to do that without a cgroupns we'd need to know in advance what the target cgroup is so we can bind mount only that; but podman doesn't know it because the cgroup is created by the OCI runtime. |
Ok I follow, but why can't the systemd cgroup mount be namespaced in the container and just mounted read-write? Would that require OCI changes too - is this job of making the systemd mount rw falling on Podman rather than the runtime that normally creates the mounts? |
Yes to specify that we will need some way to specify it in the specs, which is not currently doable. |
But if podman is currently modifying mounts that have been created by the runtime (e.g. creating a rw bind mount), why can't podman just create the private cgroupns systemd mount as a 'cgroup' mount rather than a bind mount? Something like this (untested)? --- a/libpod/container_internal_linux.go
+++ b/libpod/container_internal_linux.go
@@ -270,6 +271,18 @@ func (c *Container) setupSystemd(mounts []spec.Mount, g generate.Generator) erro
}
}
g.AddMount(systemdMnt)
+ } else if hasCgroupNs { // cgroups v1 with cgroupns=private
+ if MountExists(mounts, "/sys/fs/cgroup/systemd") {
+ g.RemoveMount("/sys/fs/cgroup/systemd")
+ }
+ systemdMnt = spec.Mount{
+ Destination: "/sys/fs/cgroup/systemd",
+ Type: "cgroup",
+ Source: "cgroup",
+ Options: []string{"private", "rw", "none", "name=systemd"},
+ }
+ g.AddMount(systemdMnt)
+ g.AddLinuxMaskedPaths("/sys/fs/cgroup/systemd/release_agent")
} else {
mountOptions := []string{"bind", "rprivate"}
skipMount := false |
the OCI runtime will mount the entire cgroup hierarchy not just the systemd controller. The |
we can argue it is an issue in the OCI runtimes, but this is what currently happens; so all that Podman can do is to use a bind mount |
Ohh I thought the podman code was directly creating the individual cgroup/bind mounts, but actually it's just specifying what the runtime should create at a higher-level abstraction, which doesn't allow mounting an individual cgroup v1 mount... I see the problem, and this seems quite problematic in robustly supporting systemd. |
At this point this seems more like a discussion then an issue. |
On cgroup v1 we need to mount only the systemd named hierarchy as writeable, so we configure the OCI runtime to mount /sys/fs/cgroup as read-only and on top of that bind mount /sys/fs/cgroup/systemd. But when we use a private cgroupns, we cannot do that since we don't know the final cgroup path. Also, do not override the mount if there is already one for /sys/fs/cgroup/systemd. Closes: containers#17727 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Issue Description
When you run a systemd container on a Centos 8 host, if you pass
--cgroupns private
, systemd fails to start in non-privileged mode. In privileged mode, it gives off some warnings.This is probably because
/sys/fs/cgroup/systemd
does not get mounted properly. From inside the container, it appears empty.The issue has been tested on Centos 8.2 and Centos 8.4.
This was tested on Ubuntu 18.04 and Ubuntu 20.04 as well. The issue was NOT present. So this looks Centos specific.
Workaround
If you run systemd in legacy mode and turn off podman's systemd mode, there is no error. i.e on passing
-e SYSTEMD_PROC_CMDLINE=systemd.legacy_systemd_cgroup_controller=1 --systemd false
along with--cgroupns private
makes it run as expected.Note: In non-privileged mode, systemd automatically runs in legacy cgroup mode, so that option isnt required.
Additional info which might help
On running the image specified above in non-privileged mode with entrypoint bash:
podman run --rm -it --name sys_ctr_non_priv_bash_entry --cgroupns private --systemd=always --entrypoint bash systemd
If you check the mounts, it is expected behaviour:
But after you run systemd and check the mounts:
Looks like somehow systemd unmounts the setup created by podman
Steps to reproduce the issue
Steps to reproduce the issue
podman build . -t systemd
podman run --rm -it --name sys_ctr --cgroupns private systemd
podman exec -it sys_ctr bash
/sys/fs/cgroup/systemd/
Describe the results you received
There are unexpected errors:
From inside the container,
/sys/fs/cgroup/systemd/
appears empty.Describe the results you expected
Systemd should be running with no warnings.
And systemd cgroup shouldn't be empty. I expected:
podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
No
Additional environment details
Additional environment details
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
The text was updated successfully, but these errors were encountered: