-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove tun/tap from the default device rules #3468
Conversation
LGTM |
Looking through git blame, this was added by commit 9fac183 aka "Initial commit of runc binary", most probably by mistake. Obviously, a container should not have access to tun/tap device, unless it is explicitly specified in configuration. Now, removing this might create a compatibility issue, but I see no other choice. Aside from the obvious misconfiguration, this should also fix the annoying > Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory messages from systemd on every container start, when runc uses systemd cgroup driver, and the system runs an old (< v240) version of systemd (the message was presumably eliminated by [1]). [1] systemd/systemd@d5aecba Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
/cc @djs55 @justincormack @crosbymichael ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
(I'm asking around a bit if others can see potential issues with the change)
Got some feedback from some people; one person mentioned that this could affect users such as Tailscale, who may be using this. We have a joined Slack channel with them, so I put out a question to ask for feedback. Also asking internally at Docker (Docker Desktop team didn't foresee issues they wouldn't be able to resolve), and asked the containerd maintainers. |
Tailscale's instructions and sample docker-compose.yml files say to explicitly include /dev/net/tun, anyone who referenced those shouldn't be impacted.
support@tailscale.com can watch for symptoms that look like a container which had been relying on getting a /dev/net/tun device via the prior default setting, and advise adding it in the docker-compose.yml or a command line argument. |
Cc @giuseppe Is this safe for rootless Podman? |
Include explicit device access for /dev/net/tun , see opencontainers/runc#3468 Signed-off-by: David Scott <dave@recoil.org>
@kolyshkin would it be possible for you to post some test-builds from this PR somewhere for people to download? |
from a quick test, it seems the slirp4netns network works fine |
LGTM |
Those are actually available (for every PR), although not in a very consumable form. Click on "Checks" tab, when click "validate" from the left-side list, and on the bottom of that screen you'll see a big |
oh! I didn't know we were storing artefacts; let me pass that on! |
And.. there I went looking, but it's not in the And download link is /~https://github.com/opencontainers/runc/suites/6379868023/artifacts/231538496 (super confusing!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM(nb)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I don't think this was a mistake -- the initial set of devices added was based on what was needed to get Docker containers to work with most programs people were using.
When I was reworking the devices handling code a year or two ago, I also wondered whether we should drop this (and I dropped some other far more dangerous rules, but left this one because @crosbymichael couldn't remember what this was needed for and I didn't want to break anything in a security update if I could help it).
That being said, Docker has their own default devices rules that are set externally to runc (so keeping this default is not necessary) and I don't see why a standard container would need tun/tap devices (though since the container is in a network namespace, I suspect that it wouldn't be an issue that they can create such devices). However I believe Docker generates their rules using our library functions, so maybe we should do a test build of Docker with this change and test that before we do a release?
Yes, I was thinking about that later; perhaps this change wouldn't affect the defaults on Docker (or Containerd for that matter).
Good point; perhaps won't hurt to do a draft PR in the docker repo to verify (at least CI). |
opened moby/moby#43569 to get a run of CI with these changes |
Seems passing. Seems that we have a consensus this can be merged. @opencontainers/runc-maintainers this is your last chance to say no :) |
We removed CAP_MKNOD from default list in container tools years ago. I see no reason in the world that the default for containers should have this rule. We never hear complaints about this. and I believe CRI-O also defaults to no CAP_MKNOD. |
Biggest impact on the Kubernetes side for me so far, is having to run pods with |
It appears that we are trading one aspect of security for another. While I may not have extensive experience, isn’t specifying NET_ADMIN to limit access to /dev/net/tun an example of implementing the principle of least privilege? Although this change was introduced two years ago, the recent update to containerd, which now uses the runc binary version 1.2.1 instead of 1.1.14, is beginning to impact a broader range of users. This includes Docker Swarm users, who are particularly affected by the inability to specify device mappings like /dev/net/tun directly in the deploy section of a Compose file. |
So, it looks like upper-level tools do not have ways to specify an additional device to be added to the list, and because of that users have to use privileged containers. This appears very wrong to me, so I'm open to reverting the change. |
It is certainly problematic having to grant privileged access, but that doesn’t change the fact that the original issue still stands, a container should not have access to tun/tap device, unless it is explicitly specified in configuration. I’m not sure the scope of NET_ADMIN, but mounting the device like with Docker is probably a better option overall than NET_ADMIN if it could be supported by upper level tools. |
For docker, it looks like it's possible for I'd have to check on the swarm services side; I know swarm services (by design) have been more constrained on some parts to keep the service-spec portable between nodes in a cluster. |
It also seems to be impacting a fair number of users, so we might have to revert it (especially if there really isn't a practical workaround in higher-level orchestration tools). Though, given my earlier comment about Docker having it's own device list maybe we could push for upstream runtimes to add this explicitly on their end? Idk... It would be really nice if we could give a warning to users to let them know they should stop depending on this mis-feature and specify device configurations properly, but unfortunately there isn't a way of knowing ahead of time if a given container is going to use |
Is there an issue we can follow that is tracking decisions on removal? Is the only known way to get this working on kubernetes with 1.31 to elevate all of the internet accessable security critical containers to be fully privileged? k3s, for example, has hit 1.31 with the stable channel, so it's going to be going out widely, and once people put |
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
This breaks my application, I'll need to set privileged=true to use the tun device, which is very riscky than NET_ADMIN. |
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
- Bumped runc to 1.2.3 - In new runc default list of devices was changed (/dev/net/tun is removed) - opencontainers/runc#3468 - Switched to containerd config v2. v1 is deprecated. - There are no subsystems in cgroup v2. If Tag is provided cgroup2 is mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not provided garden cgroup is in format /sys/fs/cgroup/garden. - CPU shares are now replaced with CPU weight. - In cgroups v2 kernel throws an error when large number is provided for CPU weight. In cgroup v1 kernel accepts the number for CPU shares and saves as MAX_SHARES. This behavior is replicated in the SharesBalancer. - CPUCgrouper is manually enabling cgroup controllers since bad cgroup folder is manually created. - CPU usage is read from cpu.stat file for cgroup v2. - In cgroup v2 only leaf cgroups can have processes. Cgroup for containerd garden-init is moved from /sys/fs/cgroup/garden/handle to /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle will contain pea cgroups and can not be leaf. Cgroup resources are manually set on /sys/fs/cgroup/garden/handle and this folder is manually cleaned up. - Switched to updated cloudfoundry docker images from unsupported cfgarden docker images.
Looking through git blame, this was added by commit 9fac183
aka "Initial commit of runc binary", most probably by mistake.
Obviously, a container should not have access to tun/tap device, unless
it is explicitly specified in configuration.
Now, removing this might create a compatibility issue, but I see no
other choice.
Aside from the obvious misconfiguration, this should also fix the
annoying
messages from systemd on every container start, when runc uses systemd
cgroup driver, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1945929
[1] systemd/systemd@d5aecba