Skip to content

Commit

Permalink
specs: introduce the concept of a runtime.json
Browse files Browse the repository at this point in the history
Based on our discussion in-person yesterday it seems necessary to
separate the concept of runtime configuration from application
configuration. There are a few motivators:

- To support runtime updates of things like cgroups, rlimits, etc we
  should separate things that are inherently runtime specific from
  things that are static to the application running in the container.

- To support the goal of being able to move a bundle between hosts we
  should make it clear what parts of the spec are and are not portable
  between hosts so that upon landing on a new host the non-portable
  options may be rewritten or removed.

- In order to attach a cryptographic identity to a bundle we must not
  include details in the bundle that are host specific.
  • Loading branch information
Brandon Philips committed Aug 26, 2015
1 parent 9ad789f commit 7232e4b
Show file tree
Hide file tree
Showing 9 changed files with 356 additions and 326 deletions.
20 changes: 10 additions & 10 deletions bundle.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ A standard container bundle is made of the following 3 parts:

# Directory layout

A Standard Container bundle is a directory containing all the content needed to load and run a container. This includes its configuration file (`config.json`) and content directories. The main property of this directory layout is that it can be moved as a unit to another machine and run the same container.
A Standard Container bundle is a directory containing all the content needed to load and run a container.
This includes two configuration files `config.json` and `runtime.json`, and a rootfs directory.
The `config.json` file contains settings that are host independent and application specific such as security permissions, environment variables and arguments.
The `runtime.json` file contains settings that are host specific such as memory limits, local device access and mount points.
The goal is that the bundle can be moved as a unit to another machine and run the same application if `runtime.json` is removed or reconfigured.

The syntax and semantics for `config.json` are described in [this specification](config.md).

One or more *content directories* may be adjacent to the configuration file. This must include at least the root filesystem (referenced in the configuration file by the *root* field) and may include other related content (signatures, other configs, etc.). The interpretation of these resources is specified in the configuration. The names of the directories may be arbitrary, but users should consider using conventional names as in the example below.
A single `rootfs` directory MUST be in the same directory as the `config.json`.
The names of the directories may be arbitrary, but users should consider using conventional names as in the example below.

```
/
!
--- config.json
!
--- rootfs
!
--- signatures
config.json
runtime.json
rootfs/
```

212 changes: 9 additions & 203 deletions config-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,142 +5,7 @@ cgroups, capabilities, LSM, and file system jails to fulfill the spec.
Additional information is needed for Linux over the [default spec configuration](config.md)
in order to configure these various kernel features.

## Linux namespaces

A namespace wraps a global system resource in an abstraction that makes it
appear to the processes within the namespace that they have their own isolated
instance of the global resource. Changes to the global resource are visible to
other processes that are members of the namespace, but are invisible to other
processes. For more information, see [the man page](http://man7.org/linux/man-pages/man7/namespaces.7.html)

Namespaces are specified in the spec as an array of entries. Each entry has a
type field with possible values described below and an optional path element.
If a path is specified, that particular file is used to join that type of namespace.

```json
"namespaces": [
{
"type": "pid",
"path": "/proc/1234/ns/pid"
},
{
"type": "net",
"path": "/var/run/netns/neta"
},
{
"type": "mnt",
},
{
"type": "ipc",
},
{
"type": "uts",
},
{
"type": "user",
},
]
```

#### Namespace types

* **pid** processes inside the container will only be able to see other processes inside the same container.
* **network** the container will have its own network stack.
* **mnt** the container will have an isolated mount table.
* **ipc** processes inside the container will only be able to communicate to other processes inside the same
container via system level IPC.
* **uts** the container will be able to have its own hostname and domain name.
* **user** the container will be able to remap user and group IDs from the host to local users and groups
within the container.

### Access to devices

Devices is an array specifying the list of devices to be created in the container.
Next parameters can be specified:

* type - type of device: 'c', 'b', 'u' or 'p'. More info in `man mknod`
* path - full path to device inside container
* major, minor - major, minor numbers for device. More info in `man mknod`.
There is special value: `-1`, which means `*` for `device`
cgroup setup.
* permissions - cgroup permissions for device. A composition of 'r'
(read), 'w' (write), and 'm' (mknod).
* fileMode - file mode for device file
* uid - uid of device owner
* gid - gid of device owner

```json
"devices": [
{
"path": "/dev/random",
"type": "c",
"major": 1,
"minor": 8,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/urandom",
"type": "c",
"major": 1,
"minor": 9,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/null",
"type": "c",
"major": 1,
"minor": 3,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/zero",
"type": "c",
"major": 1,
"minor": 5,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/tty",
"type": "c",
"major": 5,
"minor": 0,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/full",
"type": "c",
"major": 1,
"minor": 7,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
}
]
```

## Linux control groups

Also known as cgroups, they are used to restrict resource usage for a container and handle
device access. cgroups provide controls to restrict cpu, memory, IO, and network for
the container. For more information, see the [kernel cgroups documentation](https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)

## Linux capabilities
## Capabilities

Capabilities is an array that specifies Linux capabilities that can be provided to the process
inside the container. Valid values are the string after `CAP_` for capabilities defined
Expand All @@ -154,33 +19,15 @@ in [the man page](http://man7.org/linux/man-pages/man7/capabilities.7.html)
]
```

## Linux sysctl

sysctl allows kernel parameters to be modified at runtime for the container.
For more information, see [the man page](http://man7.org/linux/man-pages/man8/sysctl.8.html)

```json
"sysctl": {
"net.ipv4.ip_forward": "1",
"net.core.somaxconn": "256"
}
```
## Rootfs Mount Propagation

## Linux rlimits
rootfsPropagation sets the rootfs's mount propagation. Its value is either slave, private, or shared. [The kernel doc](https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) has more information about mount propagation.

```json
"rlimits": [
{
"type": "RLIMIT_NPROC",
"soft": 1024,
"hard": 102400
}
]
"rootfsPropagation": "slave",
```

rlimits allow setting resource limits. The type is from the values defined in [the man page](http://man7.org/linux/man-pages/man2/setrlimit.2.html). The kernel enforces the soft limit for a resource while the hard limit acts as a ceiling for that value that could be set by an unprivileged process.

## Linux user namespace mappings
## User namespace mappings

```json
"uidMappings": [
Expand All @@ -199,48 +46,7 @@ rlimits allow setting resource limits. The type is from the values defined in [t
]
```

uid/gid mappings describe the user namespace mappings from the host to the container. *hostID* is the starting uid/gid on the host to be mapped to *containerID* which is the starting uid/gid in the container and *size* refers to the number of ids to be mapped. The Linux kernel has a limit of 5 such mappings that can be specified.

## Rootfs Mount Propagation
rootfsPropagation sets the rootfs's mount propagation. Its value is either slave, private, or shared. [The kernel doc](https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) has more information about mount propagation.

```json
"rootfsPropagation": "slave",
```

## Selinux process label

Selinux process label specifies the label with which the processes in a container are run.
For more information about SELinux, see [Selinux documentation](http://selinuxproject.org/page/Main_Page)
```json
"selinuxProcessLabel": "system_u:system_r:svirt_lxc_net_t:s0:c124,c675"
```

## Apparmor profile

Apparmor profile specifies the name of the apparmor profile that will be used for the container.
For more information about Apparmor, see [Apparmor documentation](https://wiki.ubuntu.com/AppArmor)

```json
"apparmorProfile": "acme_secure_profile"
```

## Seccomp

Seccomp provides application sandboxing mechanism in the Linux kernel.
Seccomp configuration allows one to configure actions to take for matched syscalls and furthermore also allows
matching on values passed as arguments to syscalls.
For more information about Seccomp, see [Seccomp kernel documentation](https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt)
The actions and operators are strings that match the definitions in seccomp.h from [libseccomp](/~https://github.com/seccomp/libseccomp) and are translated to corresponding values.

```json
"seccomp": {
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"name": "getcwd",
"action": "SCMP_ACT_ERRNO"
}
]
}
```
uid/gid mappings describe the user namespace mappings from the host to the container.
The mappings represent how the bundle `rootfs` expects the user namespace to be setup and the runtime SHOULD NOT modify the permissions on the rootfs to realize the mapping.
*hostID* is the starting uid/gid on the host to be mapped to *containerID* which is the starting uid/gid in the container and *size* refers to the number of ids to be mapped.
There is a limit of 5 mappings which is the Linux kernel hard limit.
36 changes: 7 additions & 29 deletions spec.go → config.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,30 +14,7 @@ type Spec struct {
// Hostname is the container's host name.
Hostname string `json:"hostname"`
// Mounts profile configuration for adding mounts to the container's filesystem.
Mounts []Mount `json:"mounts"`
// Hooks are the commands run at various lifecycle events of the container.
Hooks Hooks `json:"hooks"`
}

type Hooks struct {
// Prestart is a list of hooks to be run before the container process is executed.
// On Linux, they are run after the container namespaces are created.
Prestart []Hook `json:"prestart"`
// Poststop is a list of hooks to be run after the container process exits.
Poststop []Hook `json:"poststop"`
}

// Mount specifies a mount for a container.
type Mount struct {
// Type specifies the mount kind.
Type string `json:"type"`
// Source specifies the source path of the mount. In the case of bind mounts on
// linux based systems this would be the file on the host.
Source string `json:"source"`
// Destination is the path where the mount will be placed relative to the container's root.
Destination string `json:"destination"`
// Options are fstab style mount options.
Options string `json:"options"`
MountPoints []MountPoint `json:"mounts"`
}

// Process contains information to start a specific application inside the container.
Expand Down Expand Up @@ -72,9 +49,10 @@ type Platform struct {
Arch string `json:"arch"`
}

// Hook specifies a command that is run at a particular event in the lifecycle of a container.
type Hook struct {
Path string `json:"path"`
Args []string `json:"args"`
Env []string `json:"env"`
// MountPoint describes a directory that may be fullfilled by a mount in the runtime.json.
type MountPoint struct {
// Name is a unique descriptive identifier for this mount point.
Name string `json:"name"`
// Path specifies the path of the mount. The path and child directories MUST exist, a runtime MUST NOT create directories automatically to a mount point.
Path string `json:"path"`
}
57 changes: 1 addition & 56 deletions config.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Configuration file

The containers top-level directory MUST contain a configuration file called `config.json`.
The container's top-level directory MUST contain a configuration file called `config.json`.
For now the canonical schema is defined in [spec.go](spec.go) and [spec_linux.go](spec_linux.go), but this will be moved to a formal JSON schema over time.

The configuration file contains metadata necessary to implement standard operations against the container.
Expand Down Expand Up @@ -34,61 +34,6 @@ Each container has exactly one *root filesystem*, specified in the *root* object
}
```

## Mount Configuration

Additional filesystems can be declared as "mounts", specified in the *mounts* array. The parameters are similar to the ones in Linux mount system call. [http://linux.die.net/man/2/mount](http://linux.die.net/man/2/mount)

* **type** (string, required) Linux, *filesystemtype* argument supported by the kernel are listed in */proc/filesystems* (e.g., "minix", "ext2", "ext3", "jfs", "xfs", "reiserfs", "msdos", "proc", "nfs", "iso9660"). Windows: ntfs
* **source** (string, required) a device name, but can also be a directory name or a dummy. Windows, the volume name that is the target of the mount point. \\?\Volume\{GUID}\ (on Windows source is called target)
* **destination** (string, required) where the source filesystem is mounted relative to the container rootfs.
* **options** (string, optional) in the fstab format [https://wiki.archlinux.org/index.php/Fstab](https://wiki.archlinux.org/index.php/Fstab).

*Example (Linux)*

```json
"mounts": [
{
"type": "proc",
"source": "proc",
"destination": "/proc",
"options": ""
},
{
"type": "tmpfs",
"source": "tmpfs",
"destination": "/dev",
"options": "nosuid,strictatime,mode=755,size=65536k"
},
{
"type": "devpts",
"source": "devpts",
"destination": "/dev/pts",
"options": "nosuid,noexec,newinstance,ptmxmode=0666,mode=0620,gid=5"
},
{
"type": "bind",
"source": "/volumes/testing",
"destination": "/data",
"options": "rbind,rw"
}
]
```

*Example (Windows)*

```json
"mounts": [
{
"type": "ntfs",
"source": "\\\\?\\Volume\\{2eca078d-5cbc-43d3-aff8-7e8511f60d0e}\\",
"destination": "C:\\Users\\crosbymichael\\My Fancy Mount Point\\",
"options": ""
}
]
```

See links for details about [mountvol](http://ss64.com/nt/mountvol.html) and [SetVolumeMountPoint](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365561(v=vs.85).aspx) in Windows.

## Process configuration

* **terminal** (bool, optional) specifies whether you want a terminal attached to that process. Defaults to false.
Expand Down
Loading

0 comments on commit 7232e4b

Please sign in to comment.