This document contains notes* on configuring a 4-way 8xH100 cluster for use with SimpleTuner.
*This guide does not contain full end-to-end installation instructions. Instead, these serve as considerations to take when following the INSTALL document or one of the quickstart guides.
Multi-node training requires by default the use of shared storage between nodes for the output_dir
Just a basic storage example that will get you started.
1. Install NFS Server Packages
sudo apt update
sudo apt install nfs-kernel-server
2. Configure the NFS Export
Edit the NFS exports file to share the directory:
sudo nano /etc/exports
Add the following line at the end of the file (replace slave_ip
with the actual IP address of your slave machine):
/home/ubuntu/simpletuner/output slave_ip(rw,sync,no_subtree_check)
If you want to allow multiple slaves or an entire subnet, you can use:
/home/ubuntu/simpletuner/output subnet_ip/24(rw,sync,no_subtree_check)
3. Export the Shared Directory
sudo exportfs -a
4. Restart the NFS Server
sudo systemctl restart nfs-kernel-server
5. Verify NFS Server Status
sudo systemctl status nfs-kernel-server
1. Install NFS Client Packages
sudo apt update
sudo apt install nfs-common
2. Create the Mount Point Directory
Ensure the directory exists (it should already exist based on your setup):
sudo mkdir -p /home/ubuntu/simpletuner/output
Note: If the directory contains data, back it up, as mounting will hide existing contents.
3. Mount the NFS Share
Mount the master's shared directory to the slave's local directory (replace master_ip
with the master's IP address):
sudo mount master_ip:/home/ubuntu/simpletuner/output /home/ubuntu/simpletuner/output
4. Verify the Mount
Check that the mount is successful:
mount | grep /home/ubuntu/simpletuner/output
5. Test Write Access
Create a test file to ensure you have write permissions:
touch /home/ubuntu/simpletuner/output/test_file_from_slave.txt
Then, check on the master machine if the file appears in /home/ubuntu/simpletuner/output
.
6. Make the Mount Persistent
To ensure the mount persists across reboots, add it to the /etc/fstab
file:
sudo nano /etc/fstab
Add the following line at the end:
master_ip:/home/ubuntu/simpletuner/output /home/ubuntu/simpletuner/output nfs defaults 0 0
-
User Permissions: Ensure that the
ubuntu
user has the same UID and GID on both machines so that file permissions are consistent. You can check UIDs withid ubuntu
. -
Firewall Settings: If you have a firewall enabled, make sure to allow NFS traffic. On the master machine:
sudo ufw allow from slave_ip to any port nfs
-
Synchronize Clocks: It's good practice to have both systems' clocks synchronized, especially in distributed setups. Use
ntp
orsystemd-timesyncd
. -
Testing DeepSpeed Checkpoints: Run a small DeepSpeed job to confirm that checkpoints are correctly written to the master's directory.
Very-large datasets can be a challenge to efficiently manage. SimpleTuner will automatically shard datasets over each node and distribute pre-processing across every available GPU in the cluster, while using asynchronous queues and threads to maintain throughput.
The discovery backend currently restricts aspect bucket data collection to a single node. This can take an extremely long time with very-large datasets as each image has to be read from storage to retrieve its geometry.
To work-around this problem, the parquet metadata_backend should be used, allowing you to preprocess your data in any manner accessible to you. As outlined in the linked document section, the parquet table contains the filename
, width
, height
, and caption
columns to help quickly and efficiently sort the data into its respective buckets.
Huge datasets, especially when using the T5-XXL text encoder, will consume enormous quantities of space for the original data, the image embeds, and the text embeds.
Using a provider such as Cloudflare R2, one can generate extremely large datasets with very little storage fees.
See the dataloader configuration guide for an example of how to configure the aws
type in multidatabackend.json
- Image data can be stored locally or via S3
- If images are in S3, the preprocessing speed reduces according to network bandwidth
- If images are stored locally, this does not take advantage of NVMe throughput during training
- Image embeds and text embeds can be separately stored on local or cloud storage
- Placing embeds on cloud storage reduce the training rate very little, as they are fetched in parallel
Ideally, all images and all embeds are simply maintained in a cloud storage bucket. This greatly simplifies the risk of issues during pre-processing and resuming training.
If your datasets so large that scanning for new images becomes a bottleneck, adding preserve_data_backend_cache=true
to each dataloader config entry will prevent the backend from being scanned for new images.
Note that you should then use the image_embeds
data backend type (more information here) to allow these cache lists to live separately in case your pre-processing job is interrupted. This will prevent the image list from being re-scanned at startup.
Data compression should be enabled by adding the following to config.json
:
{
...
"--compress_disk_cache": true,
...
}
This will use inline gzip to reduce the amount of redundant disk space consumed by the rather-large text and image embeds.
When using accelerate config
(/home/user/.cache/huggingface/accelerate/default_config.yaml
) to deploy SimpleTuner, these options will take priority over the contents of config/config.env
An example default_config for Accelerate that does not include DeepSpeed:
# this should be updated on EACH node.
machine_rank: 0
# Everything below here is the same on EACH node.
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: NO
enable_cpu_affinity: false
main_process_ip: 10.0.0.100
main_process_port: 8080
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 32
rdzv_backend: static
same_network: false
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
This document doesn't go into as much detail as the dedicated page.
When optimising training on DeepSpeed for multi-node, using the lowest-possible ZeRO level is essential.
For example, an 80G NVIDIA GPU can successfully train Flux with ZeRO level 1 offload, minimising overhead substantially.
Adding the following lines
# Update this from MULTI_GPU to DEEPSPEED
distributed_type: DEEPSPEED
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
gradient_clipping: 0.01
zero3_init_flag: false
zero_stage: 1
For extra performance (with a drawback of compatibility issues) you can enable torch compile by adding the following lines into each node's yaml config:
dynamo_config:
# Update this from NO to INDUCTOR
dynamo_backend: INDUCTOR
dynamo_mode: max-autotune
dynamo_use_dynamic: false
dynamo_use_fullgraph: false
- 4x H100 SXM5 nodes connected via local network
- 1TB of memory per node
- Training cache streaming from shared S3-compatible data backend (Cloudflare R2) in same region
- Batch size of 8 per accelerator, and no gradient accumulation steps
- Total effective batch size is 256
- Resolution is at 1024px with data bucketing enabled
- Speed: 15 seconds per step with 1024x1024 data when full-rank training Flux.1-dev (12B)
Lower batch sizes, lower resolution, and enabling torch compile can bring the speed into iterations per second:
- Reduce resolution to 512px and disable data bucketing (square crops only)
- Swap DeepSpeed from AdamW to Lion fused optimiser
- Enable torch compile with max-autotune
- Speed: 2 iterations per second
- Every node must have the same number of accelerators available
- Only LoRA/LyCORIS can be quantised, so full distributed model training requires DeepSpeed instead
- This is a very high-cost operation, and high batch sizes might slow you down more than you want, requiring scaling up the count of GPUs in the cluster. A careful balance of budgeting should be considered.
- (DeepSpeed) Validations might need to be disabled when training with DeepSpeed ZeRO 3
- (DeepSpeed) Model saving ends up creating weird sharded copies when saving with ZeRO level 3, but levels 1 and 2 function as expected
- (DeepSpeed) The use of DeepSpeed's CPU-based optimisers becomes required as it handles sharding and offload of the optim states.