Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tun/tap from the default device rules #3468

Merged
merged 1 commit into from
May 10, 2022

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 4, 2022

Looking through git blame, this was added by commit 9fac183
aka "Initial commit of runc binary", most probably by mistake.

Obviously, a container should not have access to tun/tap device, unless
it is explicitly specified in configuration.

Now, removing this might create a compatibility issue, but I see no
other choice.

Aside from the obvious misconfiguration, this should also fix the
annoying

Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory

messages from systemd on every container start, when runc uses systemd
cgroup driver, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1945929

[1] systemd/systemd@d5aecba

@rhatdan
Copy link
Contributor

rhatdan commented May 4, 2022

LGTM

Looking through git blame, this was added by commit 9fac183
aka "Initial commit of runc binary", most probably by mistake.

Obviously, a container should not have access to tun/tap device, unless
it is explicitly specified in configuration.

Now, removing this might create a compatibility issue, but I see no
other choice.

Aside from the obvious misconfiguration, this should also fix the
annoying

> Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory

messages from systemd on every container start, when runc uses systemd
cgroup driver, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).

[1] systemd/systemd@d5aecba

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@thaJeztah
Copy link
Member

/cc @djs55 @justincormack @crosbymichael ptal

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

(I'm asking around a bit if others can see potential issues with the change)

@thaJeztah
Copy link
Member

Got some feedback from some people; one person mentioned that this could affect users such as Tailscale, who may be using this. We have a joined Slack channel with them, so I put out a question to ask for feedback.

Also asking internally at Docker (Docker Desktop team didn't foresee issues they wouldn't be able to resolve), and asked the containerd maintainers.

@DentonGentry
Copy link

Got some feedback from some people; one person mentioned that this could affect users such as Tailscale, who may be using this. We have a joined Slack channel with them, so I put out a question to ask for feedback.

Tailscale's instructions and sample docker-compose.yml files say to explicitly include /dev/net/tun, anyone who referenced those shouldn't be impacted.

  devices:
    - /dev/net/tun:/dev/net/tun
  cap_add:
    - NET_ADMIN
    - NET_RAW

support@tailscale.com can watch for symptoms that look like a container which had been relying on getting a /dev/net/tun device via the prior default setting, and advise adding it in the docker-compose.yml or a command line argument.

@AkihiroSuda
Copy link
Member

Cc @giuseppe Is this safe for rootless Podman?

@AkihiroSuda AkihiroSuda added this to the 1.2.0 milestone May 5, 2022
djs55 added a commit to djs55/vpnkit that referenced this pull request May 5, 2022
Include explicit device access for /dev/net/tun , see
opencontainers/runc#3468

Signed-off-by: David Scott <dave@recoil.org>
@thaJeztah
Copy link
Member

@kolyshkin would it be possible for you to post some test-builds from this PR somewhere for people to download?

@giuseppe
Copy link
Member

giuseppe commented May 5, 2022

Cc @giuseppe Is this safe for rootless Podman?

from a quick test, it seems the slirp4netns network works fine

@mrunalp
Copy link
Contributor

mrunalp commented May 5, 2022

LGTM

@kolyshkin
Copy link
Contributor Author

@kolyshkin would it be possible for you to post some test-builds from this PR somewhere for people to download?

Those are actually available (for every PR), although not in a very consumable form. Click on "Checks" tab, when click "validate" from the left-side list, and on the bottom of that screen you'll see a big release-xxxxxx tarball under "Artifacts". It contains all the static binaries.

@thaJeztah
Copy link
Member

@kolyshkin would it be possible for you to post some test-builds from this PR somewhere for people to download?

Those are actually available (for every PR), although not in a very consumable form. Click on "Checks" tab, when click "validate" from the left-side list, and on the bottom of that screen you'll see a big release-xxxxxx tarball under "Artifacts". It contains all the static binaries.

oh! I didn't know we were storing artefacts; let me pass that on!

@thaJeztah
Copy link
Member

And.. there I went looking, but it's not in the Checks tab, it's in the top-level Actions tab; this is the last run of this PR; /~https://github.com/opencontainers/runc/actions/runs/2272679304

And download link is /~https://github.com/opencontainers/runc/suites/6379868023/artifacts/231538496

(super confusing!)

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM(nb)

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I don't think this was a mistake -- the initial set of devices added was based on what was needed to get Docker containers to work with most programs people were using.

When I was reworking the devices handling code a year or two ago, I also wondered whether we should drop this (and I dropped some other far more dangerous rules, but left this one because @crosbymichael couldn't remember what this was needed for and I didn't want to break anything in a security update if I could help it).

That being said, Docker has their own default devices rules that are set externally to runc (so keeping this default is not necessary) and I don't see why a standard container would need tun/tap devices (though since the container is in a network namespace, I suspect that it wouldn't be an issue that they can create such devices). However I believe Docker generates their rules using our library functions, so maybe we should do a test build of Docker with this change and test that before we do a release?

@thaJeztah
Copy link
Member

That being said, Docker has their own default devices rules that are set externally to runc (so keeping this default is not necessary)

Yes, I was thinking about that later; perhaps this change wouldn't affect the defaults on Docker (or Containerd for that matter).

However I believe Docker generates their rules using our library functions, so maybe we should do a test build of Docker with this change and test that before we do a release?

Good point; perhaps won't hurt to do a draft PR in the docker repo to verify (at least CI).

@thaJeztah
Copy link
Member

opened moby/moby#43569 to get a run of CI with these changes

@kolyshkin
Copy link
Contributor Author

opened moby/moby#43569 to get a run of CI with these changes

Seems passing. Seems that we have a consensus this can be merged. @opencontainers/runc-maintainers this is your last chance to say no :)

@rhatdan
Copy link
Contributor

rhatdan commented Dec 9, 2024

We removed CAP_MKNOD from default list in container tools years ago. I see no reason in the world that the default for containers should have this rule. We never hear complaints about this. and I believe CRI-O also defaults to no CAP_MKNOD.

@maggie44
Copy link

maggie44 commented Dec 12, 2024

Biggest impact on the Kubernetes side for me so far, is having to run pods with privileged instead of NET_ADMIN (tailscale/tailscale#14262 (comment)). Seems like a big jump up in permissions just to gain the functionality that could have been achieved before with NET_ADMIN. I am creating /dev/net/tun in the container rather than host mounting, but have to grant elevated privileges to use it.

@nillestimothy
Copy link

Biggest impact on the Kubernetes side for me so far, is having to run pods with privileged instead of NET_ADMIN (tailscale/tailscale#14262 (comment)). Seems like a big jump up in permissions just to gain the functionality that could have been achieved before with NET_ADMIN. I am creating /dev/net/tun in the container rather than host mounting, but have to grant elevated privileges to use it.

It appears that we are trading one aspect of security for another. While I may not have extensive experience, isn’t specifying NET_ADMIN to limit access to /dev/net/tun an example of implementing the principle of least privilege?

Although this change was introduced two years ago, the recent update to containerd, which now uses the runc binary version 1.2.1 instead of 1.1.14, is beginning to impact a broader range of users. This includes Docker Swarm users, who are particularly affected by the inability to specify device mappings like /dev/net/tun directly in the deploy section of a Compose file.

@kolyshkin
Copy link
Contributor Author

So, it looks like upper-level tools do not have ways to specify an additional device to be added to the list, and because of that users have to use privileged containers. This appears very wrong to me, so I'm open to reverting the change.

@maggie44
Copy link

So, it looks like upper-level tools do not have ways to specify an additional device to be added to the list, and because of that users have to use privileged containers. This appears very wrong to me, so I'm open to reverting the change.

It is certainly problematic having to grant privileged access, but that doesn’t change the fact that the original issue still stands, a container should not have access to tun/tap device, unless it is explicitly specified in configuration. I’m not sure the scope of NET_ADMIN, but mounting the device like with Docker is probably a better option overall than NET_ADMIN if it could be supported by upper level tools.

@thaJeztah
Copy link
Member

So, it looks like upper-level tools do not have ways to specify an additional device to be added to the list,

For docker, it looks like it's possible for docker run (and docker compose) by adding it as device; #3468 (comment)

I'd have to check on the swarm services side; I know swarm services (by design) have been more constrained on some parts to keep the service-spec portable between nodes in a cluster.

@cyphar
Copy link
Member

cyphar commented Dec 14, 2024

It also seems to be impacting a fair number of users, so we might have to revert it (especially if there really isn't a practical workaround in higher-level orchestration tools). Though, given my earlier comment about Docker having it's own device list maybe we could push for upstream runtimes to add this explicitly on their end? Idk...

It would be really nice if we could give a warning to users to let them know they should stop depending on this mis-feature and specify device configurations properly, but unfortunately there isn't a way of knowing ahead of time if a given container is going to use /dev/net/tun (there are ways of getting that information by tracing the container in various ways, but that kind of tracing would not fit with the runtime model of runc -- this would be something that higher level runtimes would need to implement, and I suspect they'd have little interest in doing that).

@ClashTheBunny
Copy link

Is there an issue we can follow that is tracking decisions on removal?

Is the only known way to get this working on kubernetes with 1.31 to elevate all of the internet accessable security critical containers to be fully privileged? k3s, for example, has hit 1.31 with the stable channel, so it's going to be going out widely, and once people put privileged: true into their manifests, you're not going to be getting that removed anytime soon, if ever.

@kolyshkin
Copy link
Contributor Author

#4555

mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 17, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 18, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 18, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 18, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 19, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Dec 19, 2024
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
@SuzukiHonoka
Copy link

This breaks my application, I'll need to set privileged=true to use the tun device, which is very riscky than NET_ADMIN.

mariash added a commit to cloudfoundry/guardian that referenced this pull request Jan 9, 2025
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Jan 9, 2025
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
mariash added a commit to cloudfoundry/guardian that referenced this pull request Jan 9, 2025
- Bumped runc to 1.2.3
- In new runc default list of devices was changed (/dev/net/tun is
  removed) - opencontainers/runc#3468
- Switched to containerd config v2. v1 is deprecated.
- There are no subsystems in cgroup v2. If Tag is provided cgroup2 is
  mounted to /tmp/cgroup-N/unified (for N parallel tests). If Tag is not
  provided garden cgroup is in format /sys/fs/cgroup/garden.
- CPU shares are now replaced with CPU weight.
- In cgroups v2 kernel throws an error when large number is provided for
  CPU weight. In cgroup v1 kernel accepts the number for CPU shares and
  saves as MAX_SHARES. This behavior is replicated in the
  SharesBalancer.
- CPUCgrouper is manually enabling cgroup controllers since bad cgroup
  folder is manually created.
- CPU usage is read from cpu.stat file for cgroup v2.
- In cgroup v2 only leaf cgroups can have processes. Cgroup for
  containerd garden-init is moved from /sys/fs/cgroup/garden/handle to
  /sys/fs/cgroup/garden/handle/init since /sys/fs/cgroup/garden/handle
  will contain pea cgroups and can not be leaf. Cgroup resources are
  manually set on /sys/fs/cgroup/garden/handle and this folder is
  manually cleaned up.
- Switched to updated cloudfoundry docker images from unsupported
  cfgarden docker images.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.