Mamba Server Setup

This guide's purpose is to give a quick overview of how to install everything required for the Mamba Server.

Installation

Install Fedora Server

Install NVidia drivers

# https://www.reddit.com/r/Fedora/comments/12ju2sg/i_need_help_with_installing_nvidia_drivers_to/
sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
sudo dnf update -y
sudo dnf install akmod-nvidia -y
sudo dnf install xorg-x11-drv-nvidia-cuda -y
sudo reboot now

Alternative Install NVidia drivers via dnf module

Links to official install procedure supported by NVidia:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#fedora
```
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/x86_64/cuda-$distro.repo
```
Replace $distro with the latest available version matching the server, currently fedora39.

Remark: often NVidia can be late by updating the version number of their repository. For example the current version of Fedora is 40 and the latest repo version is fedora39. Although the repository version is anterior, there will be no issue installing that version until a new one is made available available.
```
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo
sudo dnf makecache
```
cuda-fedora39-x86_64 should appear in the list of enabled repositories, along with an entry in /etc/yum.repos.d/cuda-fedora39.repo.

Next we install the nvidia driver and cuda toolkit via its now available module. Choose the appropriate version and between preferred closed or opensource module version. That will install a dkms driver that will be automatically updated with every kernel update.
Choose between the latest version of the driver latest-dkms that will be updated with dnf update, of pin a specific version for example 555-dkms. Do the same for cuda toolkit via the meta-package cuda-toolkit or target a specific verison of cuda.

If a previously installed kernel via rpmfusion is installed, remove everything first.
```
sudo dnf autoremove akmod-nvidia xorg-x11-drv-nvidia-*
```
Then install the new drivers via its module.
```
sudo dnf module list
sudo dnf module install nvidia-driver:latest-dkms
```
Check that the dkms module is built successfuly for all installed kernels.
```
$ sudo dkms status 
nvidia/555.42.02, 6.8.10-300.fc40.x86_64, x86_64: installed
```
Finally, proceed with the installation of the cuda toolkit
```
sudo dnf install cuda-toolkit
```
Select the default cuda version.
```
sudo update-alternatives --display cuda
sudo update-alternatives --config cuda
```
Set nvcc and other utilities's PATH in
bashrc or bash_profile per user
/etc/environment or /etc/profile.d/cuda.sh for all users
```
export CUDACXX=/usr/local/cuda/bin/nvcc
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
```
Make the driver persistent
```
sudo systemctl enable nvidia-persistenced.service
sudo systemctl start nvidia-persistenced.service
```
Before rebooting update initial ramdisk.
```
sudo dracut -f
sudo reboot now
```

Install Munge

export MUNGEUSER=1111
sudo groupadd -g $MUNGEUSER munge
sudo useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge
export SLURMUSER=1121
sudo groupadd -g $SLURMUSER slurm
sudo useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm
sudo dnf install munge munge-devel munge-libs -y

sudo dnf install rng-tools -y
rngd -r /dev/urandom

sudo mungekey
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
sudo chown -R munge: /etc/munge/ /var/log/munge/
sudo systemctl enable munge.service
sudo systemctl start munge.service

Install Slurm

sudo dnf openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad -y
sudo dnf install libcgroup libcgroup-tools libcgroup-devel mariadb mariadb-devel mariadb-server -y
sudo dnf install autoconf automake perl -y
dnf install dbus-devel -y

# Build Slurm RPM
sudo su
cd
wget https://download.schedmd.com/slurm/slurm-24.05.0-0rc1.tar.bz2
rpmbuild -ta slurm-24.05.0-0rc1.tar.bz2
cd rpmbuild/RPMS/x86_64/
dnf --nogpgcheck localinstall *.rpm -y
# To reinstall if you recompile
dnf --nogpgcheck reinstall *.rpm -y

Configure Slurm: Copy the configs from norlab-ulaval/dotfiles-mamba-server to /etc/slurm

Ensure the permissions are correct

mkdir /var/spool/slurmctld
chown slurm: /var/spool/slurmctld
chmod 755 /var/spool/slurmctld
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
touch /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log
chown slurm: /var/log/slurm_jobacct.log /var/log/slurm_jobcomp.log

Check that slurm is correctly configured: slurmd -C

Start the services

systemctl enable slurmd.service
systemctl start slurmd.service
systemctl status slurmd.service

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

Setup accounting

systemctl enable mariadb.service
systemctl start mariadb.service

# Inspired by: https://github.com/Artlands/Install-Slurm/blob/master/README.md#setting-up-mariadb-database-master
mysql
# Change password in the following line
> create user 'slurm'@'localhost' identified by '${DB_USER_PASSWORD}'; grant all on slurm_acct_db.* TO 'slurm'@'localhost'; create database slurm_acct_db;
> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost' IDENTIFIED BY '1234' with grant option;
> SHOW VARIABLES LIKE 'have_innodb';
> FLUSH PRIVILEGES;
> CREATE DATABASE slurm_acct_db;
> quit;

# Verify you can login
mysql -p -u slurm

Copy /etc/my.cnf.d/innodb.cnf from norlab-ulaval/dotfiles-mamba-server

Restart mariadb

systemctl stop mariadb
mv /var/lib/mysql/ib_logfile? /tmp/
mv /var/lib/mysql/* /tmp/
systemctl start mariadb

Check the ownership of some files

chown slurm slurmdbd.conf
touch /var/log/slurmctld.log
chown slurm /var/log/slurmctld.log
chown slurm slurm*

Check if slurmdb can start correctly using slurmdbd -D -vvv

Start the services

systemctl enable slurmdbd
systemctl start slurmdbd
systemctl status slurmdbd

systemctl enable slurmctld.service
systemctl start slurmctld.service
systemctl status slurmctld.service

Add accounts

sudo sacctmgr add account norlab Description="Norlab mamba-server" Organization=norlab
sacctmgr add user wigum Account=norlab

If the Slurm services crash at startup, add the following lines to each slurm service (slurmctl, slurmdbd and slurmd) using systemctl edit slurmXXX.service
```
[Service]
Restart=always
RestartSec=5s
```

Setup the LVM: Resize LVM.

lvextend -l +100%FREE fedora
xfs_growfs /dev/fedora/root
# Verify the fs took all the place
lsblk -f

Install nvidia-container-toolkit: nvidia-container-toolkit and setup for container use: cdi-support

sudo dnf install dkms
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
     sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf config-manager --enable nvidia-container-toolkit-experimental
sudo dnf install nvidia-container-toolkit -y
sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

# Create a systemd service to run the following line at startup
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

sudo reboot now

Verify that the network interface is correctly configured. BE SURE TO USE THE enp68s0 interface.

sudo dnf install speedtest-cli -y
speedtest-cli --secure
# Should print approximately 1000Mb/s

If not, change the ethernet port and be sure the config /etc/NetworkManager/system-connections/enp68s0.nmconnection looks like this

# /etc/NetworkManager/system-connections/enp68s0.nmconnection
[connection]
id=enp68s0
uuid=493e6911-a544-4f84-a708-dd54a2fe1aef
type=ethernet
autoconnect=true
interface-name=enp68s0

[ethernet]
auto-negotiate=true
duplex=full
speed=2500

[ipv4]
method=auto

[ipv6]
addr-gen-mode=eui64
method=auto

[proxy]

Install a docker version that supports buildx plugin: Install docker on fedora

Add a new user

sudo useradd -c 'Full name' -m <username> -G docker
sudo passwd <username>

We also recommend setting the following env variable in their .bashrc:

export SQUEUE_FORMAT="%.18i %.9P %.25j %.8u %.2t %.10M %.6D %.20e %b %.8c"

Cronjob to clean podman cache

Add the following cronjob to sudo crontab -u root -e:

cat /etc/passwd | grep /bin/bash | awk -F: '{ print $1}' | while read user; do echo "Processing user $user..." && sudo -u $user -H bash -c "cd && podman system prune -af"; done

Running jobs

As users do not have root access on the Mamba Server, every project should be ran in a container. We recommend using podman.

First, on your host machine, write a Dockerfile to run your project inside a container. Then, build and test that everything works on your machine before testing it on the server.

We recommend putting your data in a directory and to symlink it to the data folder of your project. We describe here how to add volumes to avoid copying the data in the container.

# Build the image
buildah build --layers -t myproject .

# Run docker image
export CONFIG=path/to/config> # for example `config/segsdet.yaml`
export CUDA_VISIBLE_DEVICES=0 # or `0,1` for specific GPUs, will be automatically set by SLURM

podman run --gpus all --rm -it --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"

After you verified everything works on your machine, copy the code on the server and write a Slurm job script.

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=16
#SBATCH --time=10-00:00
#SBATCH --job-name=$NAME
#SBATCH --output=%x-%j.out

cd ~/myproject || exit
buildah build --layers -t myproject .

export CONFIG=path/to/config> # for example `config/segsdet.yaml`

# Notice there is no -it option
podman run --gpus all --rm --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"

Then, you can queue the job using sbatch job.sh and see the queued jobs using squeue. For an easier experience, you can use willGuimont/sjm. After you've verified this works, use the following code to kill the container when the slurm job stops.

# Notice the -d option to detach the process, and no -it option
container_id=$(
  podman run --gpus all --rm -d --ipc host \
  -v .:/app/ \
  -v /app/data \
  -v ./data/coco/:/app/data/coco \
  -v /dev/shm:/dev/shm \
  myproject bash -c "python3 tools/train.py $CONFIG --gpu $CUDA_VISIBLE_DEVICES"
)

stop_container() {
  podman logs $container_id
  podman container stop $container_id
}

trap stop_container EXIT
echo "Container ID: $container_id"
podman wait $container_id

You can then run the job using:

sbatch job.sh

And see the running jobs using:

squeue

Remote connection

SSH and X11 forwarding with GLX support

First make sure X11 forwarding is enabled server side. Check these two lines in /etc/ssh/sshd_config
```
X11Forwarding yes
X11DisplayOffset 10
```
If they were not, enable them and restart sshd
```
sudo systemctl restart sshd
```
Make sure xauth is available on the server or install it
```
sudo dnf install xorg-x11-xauth
```
Next install basic utilities to test GLX and Vulkan capabilities on the server. We'll need them to benchmark the remote connection's performance.
```
sudo dnf install glx-utils vulkan-tools
```
If you encounter problems, make sure on the client-side the server is allowed to display.
Make sure the ip of the server is valid. + to add - to remove from the trusted list.
```
xhost + 132.203.26.231
```
Connect to the server from your client using ssh.
Use the options -X or -Y to redirect X11 via the ssh tunnel. The redirection works despite the fact that the server is headless. But Xorg must be installed.
The -X option will automatically update the DSIPLAY env variable. Note: the ip of the server is subject to change, make sure to have the last updated.
```
ssh -X user@132.203.26.231
```
Test that X redirection is working by executing a simple X graphical application.
```
$ xterm
```
Test GLX support with glxinfo
```
glxinfo
```
Test what GLX implementation is used by default
```
$ glxinfo | grep -i vendor
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
    Vendor: Mesa (0xffffffff)
OpenGL vendor string: Mesa
```
Check both NVidia and Mesa implementations work for GLX passthrough.
```
__GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -i vendor
__GLX_VENDOR_LIBRARY_NAME=mesa glxinfo | grep -i vendor
```
Choose the best implementation between Nvidia and Mesa.
On Nvidia GPUs NVidia's implementation gives the best results.
```
export __GLX_VENDOR_LIBRARY_NAME=nvidia
glxgears
```
For Vulkan aplications the process is similar
```
vulkaninfo
VK_DRIVER_FILES="/usr/share/vulkan/icd.d/nvidia_icd.x86_64.json" vkcube
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mamba Server Setup

Installation

Add a new user

Running jobs

Remote connection

Home

New Students

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Norlab's Recipes

Clone this wiki locally