-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroups: add pids controller support #58
Conversation
@@ -49,6 +49,11 @@ type MemoryStats struct { | |||
Stats map[string]uint64 `json:"stats,omitempty"` | |||
} | |||
|
|||
type PidsStats struct { | |||
// current counter usage | |||
Current uint64 `json:"current,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPids? We should also document that it is the number of pids in the cgroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it matches the name of the sysfs file. But it doesn't really matter I guess.
I think we also need to make some changes for the systemd portion. |
@vmarmol Yes, quite. I'll update it in a few hours. |
Fixed up the issues (it also wouldn't build properly because I didn't fix the imports from the old ones). |
The change to systemd part doesn't look sufficient. You also need to add some changes to Apply(). |
We also need to update the spec if we want support in |
The PIDs changes have been merged into Linus' tree as of torvalds/linux@8bdc69b. I'll update this patch in a bit (if it needs updating). |
Since #242 has been merged, it's probably time to get around to reviewing this. /ping @crosbymichael @vmarmol @mrunalp @LK4D4 |
I will take a look tomorrow Sent from my iPhone
|
code, LGTM |
I am just testing this and noticed that we need to set pids.max to 4 for runc to be able to spawn the container. |
@cyphar Can we delay applying the setting so that we can set it to lower values like 1 or 2 for the exec'd process? |
ping @cyphar |
@@ -57,6 +57,11 @@ type Cgroup struct { | |||
// MEM to use | |||
CpusetMems string `json:"cpuset_mems"` | |||
|
|||
// Process limit; set to `0' to disable limit. | |||
// While technically `0' is a valid PID limit, it does not make sense in the | |||
// context of a container -- it is identical to a limit of `1'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But SPEC say
// Maximum number of PIDs. A value < 0 implies "no limit".
Limit int64 `json:"limit"`
Please consider base on the PR #369
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, that should read <= 0
. Whoops. I'll send a PR to fix the comment when I get a chance. The problem with 0
being a valid limit is that it makes the spec non-backwards compatible (and you have to specify it in config.json
) ... unless the change in #369 changes the default?
-- not to mention the fact that a limit of 0
in a pids cgroup doesn't mean anything different to a limit of 1
. Since attaches aren't blocked by cgroup core, you need to have at least one process in the cgroup in order for the limits to affect anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar I think limit of 0 means don't allow the process in container to fork() or clone(). That's really a reasonable scenario I think. Yes, there is at lease 1 process in container, that said, the pids.current is always >= 1. But that does not prevent us to set the pids.max to 0, where I want to prevent fork() and clone() in container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangdongsheng But ... it doesn't actually have a practical meaning. Having a limit of 0
is precisely identical to having a limit of 1
. Both prevent fork()
and clone()
in a container with a single process in it. There's no special code path that the pids controller goes through if the limit is specifically 0
(ref: I wrote it). I do understand that it might seem nice to have a limit of zero to specify "this container should never have any processes in it", but I'm not sure if the resource limiting part of the spec is the right way to go (I'd rather have container types as first-class citizens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think make 0 being a valid limit is necessary, since a cgroup without any processes is meaningless, and we agree that there would always be at least 1 process in container, so set pids.max=1
would be enough to prevent any fork() or clone() in container.
And take the default value of the type (such as 0 of int, null of mapping etc..) as the invalid value is always what we do, except some special types such as bool (like cgroup.OomKillDisable). So I prefer take 0 as the invalid value, and change specs instead. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar So, the reason sounds like following the convention in runc, right? But opencontainers/runtime-spec#233 is trying to set the default value to -1.
And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think pids cgroup is a supporter for opencontainers/runtime-spec#233 .
@yangdongsheng I don't understand why you would think that. You can't set a value of -1
to mean max
in the PIDs controller ...?
I also don't agree with opencontainers/runtime-spec#233 (I just commented with my reservations). I'd be okay with it if we change all of the limits to be pointers (so you can set null
to mean default, or omit the option to also mean default) and then get consumers to deal with the default values explicitly (by checking for null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar Okey, on my second thought, I think idea of "making Limit = 0 valid" from myself would also not reduce the confusion from user. Let's wait for the conclusion in ML about removal cgroup and opencontainers/runtime-spec#233 . For now, I would say, Limit <= 0 measn max
is good enough. :)
But I would read the discussion in cgroup to see why we have to let attaching of pids cgroup break pids.max, if I got some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangdongsheng ... pids.max
isn't broken. In what way do you think it's broken? It's discussed here: http://thread.gmane.org/gmane.linux.kernel.cgroups/13292.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyphar Yes, I see that's intentional. But I just want to read more about the discussion of it to see what did TJ said about it. Thanx for the reference, will open and read it later. :)
@cyphar So, |
@LK4D4 I'd prefer that we wait until opencontainers/runtime-spec#233 is decided on. The current state is that |
@mrunalp Sorry for the late response. Do you know where |
@cyphar I think its here /~https://github.com/opencontainers/runc/blob/master/libcontainer/process_linux.go#L196-L200 So basically we run a bootstrap process, put the bootstrap process into correct cgroups and then start the actual process. So its likely that limit=1 will not work ? |
@dqminh Oh, I see. I can see two solutions to this problem:
I've implemented the second, but I'm not sure if it's the best solution. Full disclosure: I'm not currently on a system where I can test that it works (I'm currently compiling the kernel). |
@cyphar yep, both options are bad unfortunately :( Transparently "Increment pids limit by 1" is actually worse imo because it's only used for bootstrap, also it doesnt correspond to what user asked for. Temporarily increase then decrease the value may not work as well because then we have race condition. Not allow "PidsLimit=1" seems just a bit more reasonable, just because i dont think its possible to really allow it given how the container is bootstrapped. WDYT ? @crosbymichael @LK4D4 @mrunalp @vishh @avagin |
@dqminh I'd like to point out that an off-by-one isn't that bad, because in the context of PID resource exhaustion one PID is not going to make a significant difference. But I do agree that it is not the best solution (and that incrementing then decrementing will have a race). We can't get the bootstrapping process to change the limit, and we don't have control over the final process running in the container. Is there any "callback" when the bootstrapping is complete? |
I am okay with having the setting be some minimum other than 1. |
for _, sys := range subsystems { | ||
// We can't set this here, because after being applied, memcg doesn't | ||
// allow a non-empty cgroup from having its limits changed. | ||
if sys.Name() == "memory" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is true? Did it change in a kernel version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crosbymichael No, this hasn't changed (see 39279b1), it's a specific issue with kernel memory limiting. This is the case for kmemcg
limits. Since the kmemcg
limits are integrated into the memory cgroup, I couldn't think of a "nice" way of only setting half of the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can make this change now or in this pr because updating memory limits, not kernel limits, is a big usecase and this just stops it. Maybe can we keep the changes small and only do the pid change in this pr and update the other things later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
On Fri, Dec 18, 2015 at 10:22 AM, Michael Crosby notifications@github.com
wrote:
In libcontainer/cgroups/fs/apply_raw.go
#58 (comment):@@ -179,11 +180,24 @@ func (m _Manager) GetStats() (_cgroups.Stats, error) {
}func (m *Manager) Set(container *configs.Config) error {
- for name, path := range m.Paths {
sys, err := subsystems.Get(name)
if err == errSubsystemDoesNotExist || !cgroups.PathExists(path) {
- for _, sys := range subsystems {
// We can't set this here, because after being applied, memcg doesn't
// allow a non-empty cgroup from having its limits changed.
if sys.Name() == "memory" {
I don't think we can make this change now or in this pr because updating
memory limits, not kernel limits, is a big usecase and this just stops it.
Maybe can we keep the changes small and only do the pid change in this pr
and update the other things later?—
Reply to this email directly or view it on GitHub
/~https://github.com/opencontainers/runc/pull/58/files#r48053452.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how this code has always worked. The only change I've made is separating .Set()
and .Apply()
. The previous code explicitly did this this way (I actually haven't modified the Apply()
method for MemoryGroup
). In other words, I'm not making any change to how the MemoryGroup
works.
Memory limits are still being set, they're just being set in MemoryGroup.Apply()
rather than MemoryGroup.Set()
.
And we can't "only do the pid change in this PR" because we need to Set()
the cgroup value as late as possible in order for us to use the PIDs cgroup for all legal values. Sure, we could only do late setting for only the PIDs cgroup, but that's just ugly and there are other problems that this solves for other cgroups (the late setting is a common requirement for cgroups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @crosbymichael @vishh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info @cyphar. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially the workflow is as follows:
- Create cgroups
- Set the appropriate limits
- Apply cgroups to the init process.
As of now 1
and 3
are essentially together. We should clean that up soon.
LGTM |
continue | ||
} | ||
|
||
// Get the subsystem path, but don't fial out for not found cgroups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: fial
Add support for the pids cgroup controller to libcontainer, a recent feature that is available in Linux 4.3+. Unfortunately, due to the init process being written in Go, it can spawn an an unknown number of threads due to blocked syscalls. This results in the init process being unable to run properly, and thus small pids.max configs won't work properly. Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to group the two together (especially considering that the bootstrap process is added to the cgroup as well). The only exception to this is the memory cgroup, which requires the configuration to be set before processes can join. Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit (rather than using the system default). Otherwise the user will assume they have security they do not actually have. This mirrors the original Apply() (that would set cgroup configs) semantics. Signed-off-by: Aleksa Sarai <asarai@suse.com>
Due to the fact that the init is implemented in Go (which seemingly randomly spawns new processes and loves eating memory), most cgroup configurations are required to have an arbitrary minimum dictated by the init. This confuses users and makes configuration more annoying than it should. An example of this is pids.max, where Go spawns multiple processes that then cause init to violate the pids cgroup constraint before the container can even start. Solve this problem by setting the cgroup configurations as late as possible, to avoid hitting as many of the resources hogged by the Go init as possible. This has to be done before seccomp rules are applied, as the parent and child must synchronise in order for the parent to correctly set the configurations (and writes might be blocked by seccomp). Signed-off-by: Aleksa Sarai <asarai@suse.com>
LGTM. Nice work @cyphar 👍 |
Please merge this before we vendor specs to include opencontainers/runtime-spec#233, as it would cause quite a few merge conflicts. I've fixed up your comment nit, and explained the test nit. PTAL. 🐳 /cc @mrunalp |
@cyphar you mentioned an issue with systemd. Is that resolved? |
@mrunalp I just ran the tests on my local machine (which has systemd). All of the tests that pass on master pass on my branch. I think we're ready to go. I was mistaken about the systemd issue. |
@cyphar Thanks! 👍 LGTM |
cgroups: add pids controller support
\o/ On Fri, Dec 18, 2015 at 7:55 PM, Mrunal Patel notifications@github.com
|
w00t w00t. 🐳 |
@cyphar Good work.. now carry it on to Docker :) |
@mrunalp I'm sorry. The tests don't actually check that the limits work, which mislead me to believe they worked as expected. I can't seem to test how well systemd runs, because it looks my build of However, from what I can see, the limits systemd doesn't support are broken (we never join them). I'm making a patch to try to fix this, but I can't test it (and the automated tests don't actually test that limits fail). |
@cyphar You can use this to test systemd cgroup support /~https://github.com/opencontainers/runc/compare/master...mrunalp:systemd-cgroups?expand=1 |
Looks like there will be difference in behavior between fs and systemd implementation around what cgroup is set when. I am okay with the differences as long as the final outcome is the same. But others may disagree. I am reverting this PR till will we get the systemd support to work correctly and then we can merge it again. Sorry @cyphar , my bad as well. @crosbymichael @LK4D4 @hqhq Let me know what you think. |
@mrunalp I'm very confused. I've tried to run using systemd on both openSUSE 13.2 and Arch Linux, and (even though both systems run systemd and the kernels have full cgroup support) |
@cyphar I will have limited time to debug that before Monday. Meanwhile, could you try your patch directly using docker?
|
@mrunalp @crosbymichael @hqhq @dqminh @lizf-os I just found out that, actually, I didn't break the memory cgroup (as it was broken before then). If you try to run with the memory cgroup, blkio or the cpu cgroup, it will crash with an error about "no such file or directory" (meaning that the path hasn't had MkdirAll run on it). A lot of these issues appear to stem from deeper issues of The kernel memory cgroup is the most brazen example, where the code doesn't crash, but the limit is not set (this is the case for I've fixed the problem as best I can in my new PR (#446). But I really have doubts about how stable our systemd support really is. |
…piness Add memory swappiness to linux spec
Add support for the pids cgroup controller, a recent feature that is
(see: will be) available in Linux 4.3.
Closes #382
Signed-off-by: Aleksa Sarai cyphar@cyphar.com