Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api, cgroupv2: skip setting the devices cgroup #2474

Closed
giuseppe opened this issue Jun 16, 2020 · 14 comments · Fixed by #2490
Closed

api, cgroupv2: skip setting the devices cgroup #2474

giuseppe opened this issue Jun 16, 2020 · 14 comments · Fixed by #2490

Comments

@giuseppe
Copy link
Member

The Kubelet uses libcontainer/cgroups to setup cgroups. It would be nice to have a way to skip setting the devices cgroup at all.

@mrunalp
Copy link
Contributor

mrunalp commented Jun 16, 2020

@kolyshkin @AkihiroSuda fyi

@cyphar
Copy link
Member

cyphar commented Jun 17, 2020

😬 Is there a reason for wanting to skip the devices cgroup? The hard requirement for the devices cgroup is to make sure we have a fail-secure setup (if you don't set the devices whitelist, you will allow users to do all sorts of scary things). We could make it so that this requirement is only present for runc but I'm not sure how ugly that would become...

@giuseppe
Copy link
Member Author

the Kubelet uses the libcontainer code to create the parent cgroups (e.g. /sys/fs/cgroup/kubepods). Each container will have its own cgroup under /sys/fs/cgroup/kubepods where the device cgroup is configured.
I don't think it is a problem on cgroup v1, but on cgroup v2, every time the Kubelet restarts and tries to configure the cgroup, it leaks an eBPF program

@cyphar
Copy link
Member

cyphar commented Jun 17, 2020

Ah okay. Yeah we might need to add a way to configure that. The only important thing is that this should be strictly opt-out with a fairly large warning sign next to the configuration option.

@kolyshkin
Copy link
Contributor

Addressed this one in #2490

@cyphar
Copy link
Member

cyphar commented Jul 3, 2020

As discussed in #2490, I believe that the cgroupv2 eBPF issue is a runc bug that we should fix anyway (tracked by #2366). Would the Kubernetes folks still need this if we fixed the cgroupv2 eBPF issue?

@giuseppe
Copy link
Member Author

giuseppe commented Jul 3, 2020

the devices cgroup is not used at all by Kubernetes: /~https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L378-L385

Since the cost of running such eBPF program is close to 0, I am fine if the underlying issue is addressed instead of offering a different API.

@odinuge
Copy link
Contributor

odinuge commented Jan 25, 2021

the devices cgroup is not used at all by Kubernetes:

Well, that is kinda true, but only applies to kubelet and its usage of libcontainer.

However, Kubernetes is implicitly using it via the CRI api since most CRI implementations use runc by default. As I understand the cgroup.SkipDevices, is only for users of the go api, not the binary.

When running kubernetes with with eg. containerd as the runtime (via CRI), it uses the runc binary. Kubernetes is also constantly updating the cpus-set for containers via a component called the "CPU Manager". When trying to run that setup with cgroup v2, containerd start replying to container resource updates (CRI update consists of only CPUset, but unsure what command containerd sends to runc, but can take a look at that) with runc did not terminate successfully: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): argument list too long. This makes it essentially block cpuset updates for containers when using runc.

Can x-post this to #2366 tho, since it might be useful in that context.

@xiaoxubeii
Copy link

xiaoxubeii commented Apr 1, 2021

the devices cgroup is not used at all by Kubernetes:

Well, that is kinda true, but only applies to kubelet and its usage of libcontainer.

However, Kubernetes is implicitly using it via the CRI api since most CRI implementations use runc by default. As I understand the cgroup.SkipDevices, is only for users of the go api, not the binary.

When running kubernetes with with eg. containerd as the runtime (via CRI), it uses the runc binary. Kubernetes is also constantly updating the cpus-set for containers via a component called the "CPU Manager". When trying to run that setup with cgroup v2, containerd start replying to container resource updates (CRI update consists of only CPUset, but unsure what command containerd sends to runc, but can take a look at that) with runc did not terminate successfully: failed to call BPF_PROG_ATTACH (BPF_CGROUP_DEVICE, BPF_F_ALLOW_MULTI): argument list too long. This makes it essentially block cpuset updates for containers when using runc.

Can x-post this to #2366 tho, since it might be useful in that context.

Yes, I think it is no way to set cgroup.SkipDevices directly by CRI api. There is a bit tricky for Kubernetes to use cgroups v2 : (

@bharathguvvala
Copy link

bharathguvvala commented Jun 8, 2021

We are seeing the update cpu resources flow breaking in kubernetes as mentioned in this comment with runc 1.0.0-rc93. Should this issue be reopened?

@bharathguvvala
Copy link

We are seeing the update cpu resources flow breaking in kubernetes as mentioned in this comment with runc 1.0.0-rc93. Should this issue be reopened?

@kolyshkin @AkihiroSuda can we discuss approaches to fix this, so that this will not be an issue for kubernetes?

@cyphar
Copy link
Member

cyphar commented Jun 8, 2021

If CRI is using runc update, #2994 should've already fix this issue -- runc update now skips devices cgroup updates.

But if CRI implementations are calling something other than runc update that seems a bit questionable -- I'd really prefer to not add a new runc flag to disable a security feature (we already have --no-pivot-root and --no-new-keyring -- both of which have caused more headaches than we'd like, because people use them rather than running runc in a usable environment, reducing the security of their containers).

@bharathguvvala
Copy link

Thanks @cyphar -for the comment. I have verified it with the dev build from master and the issue seems to be absent with the fix #2951. May I know in when this is expected to rollout as part of a release.

@cyphar
Copy link
Member

cyphar commented Jun 9, 2021

We're working on a 1.0.0 release at the moment, it was going to be released last week but there's a regression under Docker's CI (we've fixed the issue though -- see #3009). I would expect a new release in a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants