Table of Contents
- Setup
- Prerequisites
- OS Configuration and Basic networking setup
- Network File System (NFS) setup
- SLURM setup
- Conda (Mamba) setup
- Install OpenMPI for each node
- Testing the Marzipan Cluster
- Install Raspberry Pi OS imager on your machine (host) to flash the microSD cards
⚠️ NoteInstructions below were tested on macOS Big Sur, no effort was taken on generalising to Widows or Linux OS. Steps will be similar but some details will vary.
field | value |
---|---|
device | Raspberry Pi 4 |
OS | Raspberry Pi OS Lite (64-bit) |
hostnames | node01 , node02 , node03 , node04 |
username | admin as this has root access |
user password | different for each node |
enable SSH | allow public-key authentication only |
When only allowing public-key authentication, you will need to add your machine's public RSA key to the Pi to be able to ssh login later.
# copy the public RSA key of your machine to clipboard
cat ~/.ssh/id_rsa.pub | pbcopy
Paste the public RSA key into the Set authorized_keys for 'admin'
field under
the services tab.
If the commands report no such file or directory, you don't have an RSA key
setup and need to run ssh-keygen
. This is well documented in the Raspberry Pi
Foundations
docs.
📌
When enabling SSH with Public-key authentication only, you save your host machine's (the one you're using to write the image) public key (ecdsa.pub) and the user public key (rsa.pub) into the image. You see the user public key in gray in the Raspberry Pi Imager's GUI. When you SSH into the node, you are already authorised to enter. The node accepts your machine's request to connect gives it's and your host machine exchange their public keys, if not done already, to encrypt each other's messages before sending across via the network.
Your host machine may not have its own RSA key, the Raspberry Pi site has docs detailing every step here.
# try login to each node via its hostname
ssh admin@node01.local
# this allows you connect to your node without
# having to know the local ip address it works by
# sending a broadcast to all machines on the network
# to resolve the IP address. It is made possible by
# Avahi and Bonjour (by Apple)
# get the IP address of each node
hostname -I
# or run the following to ssh into each node, and
# echo their IP address and hostname and save it a
# to local file.
ssh admin@node01.local 'echo "$(hostname -I) $(hostname)"' >> node-ip-addr.txt
ssh admin@node02.local 'echo "$(hostname -I) $(hostname)"' >> node-ip-addr.txt
ssh admin@node03.local 'echo "$(hostname -I) $(hostname)"' >> node-ip-addr.txt
ssh admin@node04.local 'echo "$(hostname -I) $(hostname)"' >> node-ip-addr.txt
The above assumes there are no two nodes with the same hostname. You can also use your router/modem's local page to find out the IP address, just make sure you connect one node at a time to tell them apart.
We will use apt
, RaspberryOS's package manager, to install new
software^.
It keeps an index of latest available versions of all packages. Let's update
that index and perform an update of any software that's already installed:
# update apt's source list
sudo apt update
# check all the software awaiting upgrade
apt list --upgradable
# return a count if that's easier
apt list --upgradable | wc -l
# upgrade installed software
sudo apt full-upgrade
# reboot your nodes
sudo reboot
Repeat this for all nodes.
munge
's authentication is based on credentials that expire shortly after
some
time^.
So every node must to have its time in sync.
sudo apt install ntpdate -y
# reboot
sudo reboot
It is much easier to refer to a node via its hostname
, for each node append
the IP addresses of the other nodes with their hostname. Use the
node-ip–addr.txt
from
earlier. When SLURM connects to a node with its hostname
, the hosts file will
resolve the IP Address.
Example:
# @node 01 /etc/hosts
127.0.1.1 node01
<ip addr> node02
# .. more sibling nodes
# @node 02 /etc/hosts
127.0.1.1 node02
<ip addr> node01
# .. more sibling nodes
⚠️ It's there to keep the node from freezing
Swaps are done on the microSD card, none, including the USB sticks can handle this kind of I/O–degrades significantly [^rpi-se_swap-expansion]. They are also much slower than the internal ram, so once this kicks in the process and node generally becomes significantly slower to respond. The point of the swap is to prevent the OS from killing the
slurmd
causing the SLURM controller to lose all control.
# on each node
sudo vi /etc/dphys-swapfile
# inside /etc/dphys-swapfile
# change to 2GB, actual units MB
CONF_SWAPSIZE=2048
# back in the terminal
# restart dphys-swapfile
/etc/init.d/dphys-swapfile restart
# verify changes
top -n 1 | grep Swap
# should something like this
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 1700.6 avail Mem
The main
nodes must be able to ssh
into its workers
. We'll need this later to perform our munge
smoke
test.
- on the main node generate the ssh key
- for each worker node
- at worker node, change password authentication to
yes
- from main node, copy the public key to target host
- at worker node, change password authentication to
- afterwards, for each worker node, change password authentication to
no
ssh-keygen
As you have flashed the nodes to only accept public-key authentication only
you need to temporarily
enable username-password authentication to authorise the client's (node01
)
public-key`. Make you so disable it afterwards as it makes the node vulnerable
to an unauthorised logins.
sudo vi /etc/ssh/sshd_config
# change to yes or no
PasswordAuthentication yes
# restart the ssh daemon
service sshd restart
# double PasswordAuthentication sshd config with
cat /etc/ssh/sshd_config | grep ^PasswordAuthentication
ssh-copy-id <user-in-worker-node>@<worker-node-hostname>
[Enter user password]
When code for the job is secure copied to the main node, you want worker nodes to have access to that so they can perform their part. This also allows results from each worker to placed in one place.
- on your main node:
- for each worker node:
📌
Everything is a file, including devices which live inside the
/dev
folder. Roughlysda
refers to storage device 'a', the number following this calleda a partition [ref]. Very generally speaking, 'sd' is usually given to a removable storage device. You want to use a removable storage device for the network file system.
Plug in the USB stick into node01
which will act at the network file storage
# run the following command on your main node
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 1 3.8G 0 disk << USB stick
└─sda1 8:1 1 3.8G 0 part /clusterfs << partition 1
...
sudo mkfs.ext4 /dev/sda1
sudo mkdir /clusterfs
Not exactly a great idea when there are multiple users, but fine for now just for me
sudo chown nobody:nogroup -R /clusterfs
sudo chmod 777 -R /clusterfs
# get blkid and find 'sda'
blkid /dev/sda*
# get the UUID
/dev/sda1: UUID="5d72d438-90bc-4cd9-9136-d72863c20934"
On boot you want the flash drive to be automatically mounted as /clusterfs
.
sudo vi /etc/fstab
# append the following with your drive's UUID
UUID=<your-drive-uuid-no-quotes> /clusterfs ext4 defaults 0 2
# mount the drive now
sudo mount -a
You may need to restart systemctl daemon-reload
when prompted. Also if the
node throws a can't find UUID=...
error, get the flash drive's UUID
again as it may have changed.
sudo chown nobody:nogroup -R /clusterfs
sudo chmod 766 -R /clusterfs/
sudo apt install nfs-kernel-server -y
sudo vi /etc/exports
# append the following
/clusterfs <ip-subnet-mask>.0/24(rw,sync,no_root_squash,no_subtree_check)
# example
# if your nodes sit on 192.168.1.XXX then set 192.168.1.0/24
📌
The /24
means 24
bits are used to define the subnet, and the first
IP address begins at 0, (192.168.1.0,
...168.1.1
, etc.) if your IP
address ranges from 192.168.0.0
to 192.168.255.255
then mask as
192.168.0.0/16
as it takes 16 bits to define 192.168
in binary.
📌
name | meaning |
---|---|
rw |
client has read / write access |
sync |
sync occurs on every transaction |
no-root-squah |
root users on client nodes can write files with root permissions |
no_subtree_check |
prevent errors caused when one node write and another reads at the same time |
# update the NFS kernel server
sudo exportfs -a
🍎
macOS's NFS mounts creates a port below 1024
as a non-root
user
which allows the client performs changes to the node01
's clusterfs
folder as any user. This is obviously insecure.
However the convenience for this exercise is that you can copy and pastes
across without scp
. Excluding insecure
won't make the server invulnerable
to unauthorised entry.
/clusterfs <ip-address-schema>.0/24(rw,sync,no_root_squash,no_subtree_check, insecure)
# update the NFS kernel server
sudo exportfs -a
Security is a big topic, and an NFS server should never be exposed outside of a
trusted network. To read more about NFS
security
here.
sudo apt install nfs-common -y
sudo mkdir /clusterfs
sudo chown nobody:nogroup /clusterfs/
sudo chmod -R 777 /clusterfs/
# setup automatic mounting of the nfs share to local /clusterfs
sudo vi /etc/fstab
node01:/clusterfs /clusterfs nfs defaults 0 0
# mount the NFS share
sudo mount -a
# you may need to reload the fstab daemon like so
systemctl daemon-reload
# main node
# you may need to re-set the permission with chmod/chown
touch /clusterfs/node01-says-hello
# worker nodoes
# check if files available
ls /clusterfs
touch /clusterfs/node02-says-hello
# remove those files, on worker nodes
rm /clusterfs/node*
We will setup our slurm
controller on the main node
(node01
) then
setup the worker nodes.
# install slurm controller and munge
sudo apt install slurm-wlm -y
Use scp
to copy this repo's node-config
directory to /clusterfs
.
scp -r ./clusterfs/* admin@node01.local:/clusterfs
📌
It contains declarations for each node, the partition of
node01
tonode04
, and cluster name:marzipan
🎄.A cluster of nodes can be members of a group referred to as a Partition. I have defined two additional partitions apart from
all
:
workers
: excludes themain
node,node01
.main
: excludes allworker
nodes.Each node has a few Features as well, this depends on physical setup, I marked my two to have
high-capacity
(128GB storage),fast
(all cores) for example.The
main
node is also constrained to have 2 cores available for SLURM jobs. The other 2 cores will handle the Network File Server and the SLURM controller. If those services freeze, the entire cluster freezes.
The configuration will be shared across all the nodes. You could copy the
config to the correct location for each node, but updating the config quickly
would be error-prone. So we will use GNU stow
to create a symbolic link.
This way we have one copy and accessible in all the right places. After each
restart all your nodes, making sure the main
node is rebooted last.
# install stow
sudo apt install stow
# go to our config
cd /clusterfs/node-config
# Remove empty environment file
sudo rm /etc/environment
# stow slurm config
sudo stow etc/ -t /etc
# check if successful
ls -lah /etc/slurm
# files that are symbolic links appears like below
slurm.conf -> ../../clusterfs/node-config/etc/slurm/slurm.conf
# reboot
sudo reboot
Copy the main node's munge key to /clusterfs
for sharing
# copy the munge key to be shared with worker nodes
mkdir /clusterfs/node-config/.secrets/munge
sudo cp /etc/munge/munge.key /clusterfs/node-config/.secrets/munge
The node-config/etc
also contains the file environment
which contains a
single environment variable CLUSTER_ENVS
. We will use this later to create,
and refer to shared micromamba
/ conda
environments between jobs. Any
environment variable inside the environment
file will be available to
interactive/non-interactive shells and processes such as sbatch
.
🍋
# copy, extract and name config to /etc/slurm
cd /etc/slurm
cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
sudo gzip -d slurm.conf.simple.gz
sudo mv slurm.conf.simple slurm.conf
# install SLURM Client
sudo apt install slurmd slurm-client -y
Then stow
the slurm config like in the main node.
Copy node01
's munge.key
to the right place in each worker node. This will
enable all nodes to encrypt and decrypt each other's messages.
sudo cp /clusterfs/node-config/.secrets/munge/munge.key \
/etc/munge/munge.key
❓
The
munge.key
is the used by the nodes to encrypt and decrypt each other's messages. For security reasons,munge
will not follow a symbolic link. We can do better and ensure onlymunge
has permissions to read and write, but this has been omitted.
- Enable munge, slurm daemon and slurm controller daemon on the main node.
- Then, enable munge and slurm daemon on the worker nodes.
sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl start slurmd
sudo systemctl enable slurmd
# main node only!
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
Please note you need to have setup passwordless SSH between main and workers.
Here we check if something munged in node02
can be unmunged by node01
:
# from node01, ssh into nodes02-04, example below
ssh admin@node02 munge -n | unmunge | grep STATUS
# expected result like
Success (0)
# if there is a credential error
# reboot your main and worker nodes
# then try again
sudo reboot
Check your nodes on your main node with sinfo
# from you host machine
ssh admin@node01.local sinfo
# expected output
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
main up infinite 1 idle node01
workers up infinite 3 idle node[02-04]
all* up infinite 4 idle node[01-04]
Smoke test job
# let's get the hostname for each node in the partition
srun --nodes=4 hostname
# each nodes echo's its hostname
✅ More tests, tiny tour of SLURM
SLURM is incredibly good at scheduling: when, what nodes, for long, at what priority. Below are some test that demonstrate the
slurm.conf
implemented.
# try targeting specific partitions
srun -p main -N 1 hostname
srun -p workers -N 3 hostname
srun -p all -N 4 hostname
# notice SLURM schedules jobs on worker _before_ main
# only workers assigned
srun -p all -N 1 hostname
srun -p all -N 2 hostname
srun -p all -N 3 hostname
# this submits 4 jobs so the main gets to work
srun -p all -N 4 hostname
# you can target by constraint
# the constraints are 'Features' defined in slurm.conf
# to setup to reflect my nodes
srun -C main -N 1 hostname
srun -C fast,high-capacity -N 2 hostname
srun -C fast -N 3 hostname
# main is constrained to only allow 2 jobs
# that's 1 job per core, the 2/4 cores are reserved for SLURM and NFS
# notce SLURM cannot assign more than two jobs
srun -C main -n 2 hostname
srun -C main -n 3 hostname
Conda is a versatile package manager for installing complex polygot
packages, it is widely used in the data, scientific and finance communities.
However it is a little heavy, so I have opted for micromamba
which is a single
statically linked binary. There is miniconda
, but so far I have had a good
time with micromamba
.
See micromamba installation page.
You can examine the script here.
Install micromamba
:
srun -p all --nodes 4 bash -c \
'"${SHELL}" <(curl -L micro.mamba.pm/install.sh)'
# the script is interactive when installed with a
# terminal is attached, otherwise will quietly
# install as intended here
# check micromamba version on each node
srun -p all --nodes 4 bash -c \
'echo "$(hostname) $(sudo ~/.local/bin/micromamba --version)"'
# each node should report the version of micromamba
# if there is an issue, restart the nodes
# lets the workers take the lead
srun -p workers -N 3 sudo reboot
# restart main node
sudo reboot
# enter super user mode
sudo su -
# submit job to install OpenMPI packages on all nodes
srun --nodes=4 apt install \
openmpi-bin \
openmpi-common \
libopenmpi3 \
libopenmpi-dev \
-y
# restart your nodes
cd /clusterfs/jobs/mpi-hello-c
sbatch SUB
cd /clusterfs/jobs/micromamba-activate
sbatch SUB
# in each expect a new result.out file
# it should be clear it is free of errors