All computation actually happens on the head node #737

wrpscott · 2018-06-19T21:11:38Z

wrpscott · 2018-06-25T20:20:37Z

Running docker jobs on a compute node requires dockerd to run there as well. Running dockerd on a compute node is not straightforward (missing drivers, problems with iptables, etc.). I did finally succeed in starting dockerd on a compute node, and define a docker swarm with the head node and the single compute node. However, currently, trying to run a job on the compute node crashes it (needs a hw reset).
Docker swarm essentially reproduces a lot of functionality we already have in slurm.
As we today experienced serious issues with the stability of dockerd on our production machine (requiring a reboot), it might be worth trying to find a way to omit the use of dockerd.
I came across the singularity project, which can run docker images as a subprocess of the current user's shell (and also as that user, not as root). It does not need the docker runtime i.e. dockerd.

This system could provide us with the advantages of docker, (traceability of software in pipelines) while being more robust (eliminate dockerd), more secure (no rights escalation, thus no need for docker_wrap), and more efficient (direct mount of sandbox directories, no copying into docker volumes needed) while working nicely with slurm (singularity runs a container on the machine it is invoked on).

donkirkby · 2018-07-12T18:49:08Z

The current plan is to run docker on each compute node and distribute the images through a local registry. Each compute node will have a local hard drive that stores docker images and data volumes. This tutorial shows how to set up a local registry.

docker service create --name registry --publish=5000:5000 \
 --constraint=node.role==manager \
 --mount=type=bind,src=/home/docker,dst=/certs \
 -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 \
 -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/registry.crt \
 -e REGISTRY_HTTP_TLS_KEY=/certs/registry.key \
 registry:latest

The documentation is also helpful.

If we run a registry on the head node as kive-int.cfenet.ubc.ca:5000, and push each image to the local registry as part of the build process, then docker can fetch images from the local repository when we launch a job. I think it could be as simple as adding the registry address before the image names:

docker_wrap.py --sudo --inputs /path/to/sandbox/names.csv \
  --output /path/to/sandbox/ -- \
  kive-int.cfenet.ubc.ca:5000/my-image:v1.0 sandbox1 \
  my_command /mnt/input/names.csv /mnt/output/greetings.csv

donkirkby · 2018-07-12T23:15:36Z

Steps to configure a local hard drive for the compute nodes:

List the current partitions:

 bpsh 0 parted -l
 bpsh 0 blkid
 bpsh 0 lsblk

Turn off swap so you can resize the swap partition: sudo bpsh 0 swapoff /dev/sda1
Unmount the local partition if you want to resize it: sudo bpsh 0 umount /dev/sda2
This step assumes you don't need to keep any data on the local hard drive. Back it up if you want to keep it. Repeat this step with each partition number given by parted -l.
```
 sudo bpsh 0 /usr/sbin/parted /dev/sda rm 1
```

Create a swap partition, considering the Red Hat guidance for swap size.

 sudo bpsh 0 /usr/sbin/parted /dev/sda mkpart primary 'linux-swap(v1)' 1 4GB
 sudo bpsh 0 /usr/sbin/mkswap /dev/sda1
 sudo bpsh 0 /usr/sbin/swapon /dev/sda1

Create an ext4 partition on the rest of the disk.

 sudo bpsh 0 /usr/sbin/parted /dev/sda mkpart -- primary ext4 4GB -1
 sudo bpsh 0 /usr/sbin/mkfs -text4 /dev/sda2
 sudo bpsh 0 mkdir -p /media/local
 sudo bpsh 0 mount /dev/sda2 /media/local

Configure the two new partitions in /etc/beowulf/fstab. The nonfatal option lets a partition get ignored on compute nodes that don't have one.
```
 /dev/sda1		swap            swap    nonfatal        0 0
 /dev/sda2		/media/local		ext4	nonfatal		0 0
```

I sometimes had to reboot before running mkswap or mkfs, because the new partition didn't show up in /dev.

donkirkby · 2018-07-19T16:17:39Z

Even with the local hard drive configured, docker still crashes the compute node. We opened a ticket with Penguin, and they suggested Singularity instead. If we can use a single image file for several container instances, then it seems like a good alternative to docker.

donkirkby · 2018-07-20T15:45:44Z

Singularity seems to work fine on the compute nodes, and I can run several processes at the same time using a single image file.

Installing Singularity on the compute nodes needed an fstab entry, as well as a script in /etc/beowulf/init.d to create an empty folder at /var/singularity/mnt/final.

Support both Docker and Singularity while we test conversions. Remove methods when removing a singularity container.

Also fix Singularity installation on Travis.

Works around a problem with launching Debian Singularity images on a compute node running a CentOS host system. Also add -n to all sudo calls, to avoid getting blocked by password prompts.

For now, Debian Singularity images are not supported. Convert dump_pipeline to use environment variables for configuration.

wrpscott added the bug label Jun 19, 2018

donkirkby added this to the 0.12 simpler pipeline setup milestone Jul 26, 2018

This was referenced Jul 26, 2018

Simplify pipeline configuration #739

Closed

Compare run times before and after the latest release #734

Closed

donkirkby added a commit that referenced this issue Aug 1, 2018

Remove singularity build as part of #737.

bb19322

donkirkby added a commit that referenced this issue Aug 1, 2018

Add ContainerFamily for #737.

dbd52a4

donkirkby added a commit that referenced this issue Aug 2, 2018

Add Container for #737.

4a35eb6

donkirkby added a commit that referenced this issue Aug 2, 2018

Simplify search filters in REST API, as part of #737.

43d69c0

donkirkby added a commit that referenced this issue Aug 3, 2018

Check that container's permissions are less than its family's for #737.

aaee1d5

donkirkby added a commit that referenced this issue Aug 3, 2018

Display existing permissions in PermissionsForm for #737.

ee5c3aa

donkirkby added a commit that referenced this issue Aug 3, 2018

Add removal plan to containers for #737.

7b6041b

donkirkby added a commit that referenced this issue Aug 3, 2018

Validate singularity image files for #737.

5c144fc

donkirkby added a commit that referenced this issue Aug 3, 2018

Calculate MD5 and display file name for #737.

7f12f4f

donkirkby added a commit to cfe-lab/kive-default-docker that referenced this issue Aug 7, 2018

Convert from Docker to Singularity for cfe-lab/Kive#737.

800dfcd

donkirkby added a commit that referenced this issue Aug 7, 2018

Launch methods with Singularity for #737.

ac13e0e

donkirkby self-assigned this Aug 8, 2018

donkirkby added a commit to cfe-lab/MiCall that referenced this issue Aug 9, 2018

Convert from Docker to Singularity for cfe-lab/Kive#737.

b361fb7

donkirkby added a commit that referenced this issue Aug 9, 2018

Convert Docker images to Singularity for #737.

be26512

Support both Docker and Singularity while we test conversions. Remove methods when removing a singularity container.

donkirkby added a commit that referenced this issue Aug 9, 2018

Install Singularity as part of Travis build for #737.

8608985

donkirkby added a commit that referenced this issue Aug 9, 2018

Install dependencies for Singularity as part of Travis build for #737.

86735b7

donkirkby added a commit that referenced this issue Aug 9, 2018

Change back to root folder after installing Singularity for #737.

937b113

donkirkby added a commit that referenced this issue Aug 10, 2018

Configure default container for tests on Travis as part of #737.

92187ae

donkirkby added a commit that referenced this issue Aug 10, 2018

Set container when adding or revising a method as part of #737.

20278e8

Also fix Singularity installation on Travis.

donkirkby added a commit that referenced this issue Aug 10, 2018

Don't build default Singularity container on Travis. (Part of #737.)

9992273

donkirkby closed this as completed in d2ace95 Aug 10, 2018

donkirkby added a commit that referenced this issue Aug 10, 2018

Change container sort order as part of #737.

920d859

donkirkby added a commit that referenced this issue Aug 27, 2018

Still fixing broken tests as part of #737.

5e1b129

donkirkby added a commit that referenced this issue Sep 25, 2018

Remove /bin/sh wrapper from Singularity process as part of #737.

29bc6d8

For now, Debian Singularity images are not supported. Convert dump_pipeline to use environment variables for configuration.

donkirkby mentioned this issue Dec 5, 2018

Prototype Singularity packages instead of Docker containers #473

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All computation actually happens on the head node #737

All computation actually happens on the head node #737

wrpscott commented Jun 19, 2018 •

edited by donkirkby

Loading

wrpscott commented Jun 25, 2018

donkirkby commented Jul 12, 2018

donkirkby commented Jul 12, 2018 •

edited

Loading

donkirkby commented Jul 19, 2018

donkirkby commented Jul 20, 2018

All computation actually happens on the head node #737

All computation actually happens on the head node #737

Comments

wrpscott commented Jun 19, 2018 • edited by donkirkby Loading

wrpscott commented Jun 25, 2018

donkirkby commented Jul 12, 2018

donkirkby commented Jul 12, 2018 • edited Loading

donkirkby commented Jul 19, 2018

donkirkby commented Jul 20, 2018

wrpscott commented Jun 19, 2018 •

edited by donkirkby

Loading

donkirkby commented Jul 12, 2018 •

edited

Loading