Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcontainer: add support for Intel RDT/CAT in runc #1198

Conversation

xiaochenshen
Copy link
Contributor

This PR fixes issue #433

About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:

mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of tasks and schemata configuration for L3 cache
resource constraints.

The file tasks has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file schemata has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).

	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."

For example, on a two-socket machine, L3's schema line could be L3:0=ff;1=c0
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a contiguous bits set and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=x86/cache&id=f20e57892806ad244eaec7a7ae365e78fee53377

An example for runc:

There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:

"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff"
		}
	}
}

Signed-off-by: Xiaochen Shen xiaochen.shen@intel.com

@xiaochenshen
Copy link
Contributor Author

This PR will obsolete #447 for the change of Intel RDT/CAT kernel interface.
Please find the design in #433

@xiaochenshen xiaochenshen force-pushed the rdt-cat-resctrl-cgroup-v1 branch 3 times, most recently from 540606e to 12877b2 Compare November 22, 2016 05:15
NOTE: this patch is only for purpose of compiling runc. It is not necessary
if the dependent runtime-spec patch is merged.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux kernel, the interface is defined and exposed via "resource control"
filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

 The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=x86/cache&id=f20e57892806ad244eaec7a7ae365e78fee53377

An example for runc:
There are two L3 caches in the two-socket machine, the default CBM is 0xfffff
and the max CBM length is 20 bits. This configuration assigns 4/5 of L3 cache
id 0 and the whole L3 cache id 1 for the container:

"linux": {
	"resources": {
		"intelRdt": {
			"l3CacheSchema": "L3:0=ffff0;1=fffff"
		}
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
@crosbymichael
Copy link
Member

If this is not a cgroup why are you modeling it after that? Wouldn't it be much simpler to not try to hide this behind the cgroups interface and just have a simple function call to enable this and get the stats?

@xiaochenshen
Copy link
Contributor Author

@crosbymichael
There are several reasons. Hope these can address your concerns. Thank you!

  1. By nature, Intel RDT/CAT feature is really a "hardware control group" for L3 cache in Intel platform. Can you remember that we had a face-to-face talk in DockerCon Seattle? ☺️ The original patch for Linux kernel was intel_rdt cgroup interface (you can find the history story in Proposal: Intel RDT/CAT support for OCI/runc and Docker #433), but the cgroup patch is rejected by kernel cgroup maintainer for some reasons, like incompatibility and corner cases. The new kernel interface - "resource control" filesystem, which is a "cgroup-like" interface, bypasses kernel cgroup path, but implements the very similar functionality (e.g., process management lifecycle, mkdir/rmdir group filesystem operations) in kernel. And what's more, the kernel issues are not existed in libcontainer cgroup framework.

    From my understanding, the resource constraints in libcontainer is managed by single cgroupManager. We can use clean and simple interfaces like Apply(), Set(), GetStats() for processes operations. In my opinion, the resource constraints framework is still applicable for Intel RDT/CAT.

    You can find more information (e.g., kernel patch, user interface, history story, old/new designs) about Intel RDT/CAT in Proposal: Intel RDT/CAT support for OCI/runc and Docker #433

  2. Design and implementation consideration:
    Actually, I have implemented two solutions for Proposal: Intel RDT/CAT support for OCI/runc and Docker #433 prior to submitting this PR:

    (1) Adding package intelrdt and new resource manager (intelRdtManager) as a new infrastructure in libcontainer. It implements Manager interface which is similar to cgroup manager framework.

    This is the previous version of design which I haven't submitted PR, here are the workable patches in another branch for your reference: /~https://github.com/xiaochenshen/runc/commits/rdt-cat-resctrl-v1

    In Proposal: Intel RDT/CAT support for OCI/runc and Docker #433 (comment), @cyphar doesn't like the design ("Please don't make it so that we have to track an intel-specific feature within the generic code for libcontainer."). I agree with @cyphar that creating a new resource manager for Intel RDT/CAT means some Intel-specific codes scattering in libcontainer. It is not graceful even though I am working for Intel 😇

    From my understanding, @cyphar doesn't object to consolidate Intel RDT/CAT feature with cgroupManager ("If we can't consolidate it with the cgroup code..."). @cyphar also suggested a solution to have generic Manager to consolidate cgroup and other resource management including Intel RDT/CAT. I have tried, but I found the effort is huge and risky, it is hard to refactor the current state of affairs because cgroup codes are very tight-coupling.

    (2) (i.e. This PR) Making use of cgroupManager to consolidate Intel RDT "resource control" filesystem.

    In this design, we implement Intel RDT/CAT as a cgroup subsystem - IntelRdtGroup. Both in the functionality and unit tests change, this introduces minimal impact to libcontainer generic codes, and small impact to other libcontainer/cgroup codes. In addition, we can dynamically add IntelRdtGroup in cgroup subsystemSet only when Intel RDT/CAT is enabled by hardware and kernel.

    At present, personally I prefer the second design even though it is not a real "cgroup" in kernel. With this, we can keep libcontainer resource constraints codes clean and simple.

@cyphar
Copy link
Member

cyphar commented Nov 29, 2016

@xiaochenshen

Design and implementation consideration:

Understand that when I made those comments they were in response to a "design proposal". Personally, it's quite hard for me to reason about a patch if you're going to talk about how you're going to write it. I'm reading through this PR at the moment and I'm thinking that maybe this might be uglier than if it was handled outside of the cgroup code (all of the special casing around == "intel_rdt" is an example of what I was talking about). It probably will be nicer to have another manager for the resource control filesystem.

@xiaochenshen
Copy link
Contributor Author

@cyphar

I'm reading through this PR at the moment and I'm thinking that maybe this might be uglier than if it was handled outside of the cgroup code (all of the special casing around == "intel_rdt" is an example of what I was talking about). It probably will be nicer to have another manager for the resource control filesystem.

I have opened a new PR #1279 to address this.

As you suggested in #433 (comment):

It adds a new "ResourceManager" structure as the base interface for all resource managers, including cgroups manager and incoming IntelRdt manager.

All registered resource managers are consolidated in linuxContainer structure. We can apply to unified operations (e.g., Apply(), Set(), Destroy()) using all of the registered resource managers.

@justincormack
Copy link
Contributor

@xiaochenshen do you want to close this one now the other is open?

@xiaochenshen
Copy link
Contributor Author

@justincormack #1279 obsolete this PR. We can close this one. Thank you.

@caniszczyk
Copy link
Contributor

closing in favor of #1279

@caniszczyk caniszczyk closed this Jan 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants