Description of the system can be found in the system_info document.
The containers are run using docker install instructions are found here
Remember the post install steps
Compose is used to start the main services. Install instructions can be found here
The slaves needs
the Nvidia container toolkit
installed. The installation can be finicky so test the install before proceeding. To test the install
run sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi if all GPU's are you should be OK.
For the system to run you firstly have to configure a wild-card dns record to point to the master instances ip. The current domain is written in a couple of different config files but this can be changed with a quick find and replace.
The traefik instance needs to be provided with a api key to whatever dns provider holds the record. Change the traefik config file to the used provider and inject the correct env vars for that provider. A list of the providers and keys can be found here.
The master and slave can but does not have to run on the same server.
-
To start the master, run the start_master.sh script
-
To start the slave, run the start_slave.sh script
The master instance is health checked and will auto-heal.
The traefik gui interface can be found on port 8090 on the master offline ip.
The kafka gui interface can be found on port 8081 on the master offline ip.