Flux LoRA multi-GPU training #997

PCanavelli · 2024-09-27T14:22:02Z

PCanavelli
Sep 27, 2024

Hey there,

I am aware that it's currently impossible to train LoRAs with DeepSpeed on (just realised: is it also true for LyCORIS and other adapters?).

Is it also true without DeepSpeed?

I have been trying to set up Accelerate to use all my GPUs. The training starts fine, but it looks like only GPU 0 is being used while all others sit idle (0% load + same iteration speed as with single GPU).

Accelerate config

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

config.json

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "output/models",
    "--lora_type": "standard",
    "--lora_rank": 64,
    "--max_train_steps": 100000,
    "--num_train_epochs": 0,
    "--checkpointing_steps": 5000,
    "--checkpoints_total_limit": 100,
    "--model_type": "lora",
    "--pretrained_model_name_or_path": "/workspace/checkpoints/my_checkpoint",
    "--model_family": "flux",
    "--train_batch_size": 1,
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.0,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_steps": 5000,
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 3.0,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "20",
    "--validation_prompt": "my test prompt",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "1e-4",
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": 100,
    "--validation_torch_compile": "false"
}

multidatabackend.json

[
    {
        "id": "my_training",
        "type": "local",
        "crop": "true",
        "crop_aspect": "square",
        "crop_style": "center",
        "resolution": 1.0,
        "minimum_image_size": 0.25,
        "maximum_image_size": 1.0,
        "target_downsample_size": 1.0,
        "resolution_type": "area",
        "cache_dir_vae": "cache/vae/my_training",
        "instance_data_dir": "/workspace/datasets/my_training",
        "disabled": false,
        "skip_file_discovery": "",
        "caption_strategy": "textfile",
        "metadata_backend": "json"
    },
    {
        "id": "text-embeds",
        "type": "local",
        "dataset_type": "text_embeds",
        "default": true,
        "cache_dir": "cache/text/my_training",
        "disabled": false,
        "write_batch_size": 128
    }
]

nvidia-smi mid-training

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:01:00.0 Off |                    0 |
| N/A   62C    P0            327W /  350W |   42016MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    On  |   00000000:25:00.0 Off |                    0 |
| N/A   32C    P8             33W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    On  |   00000000:41:00.0 Off |                    0 |
| N/A   32C    P8             33W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    On  |   00000000:61:00.0 Off |                    0 |
| N/A   31C    P8             34W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L40S                    On  |   00000000:81:00.0 Off |                    0 |
| N/A   31C    P8             35W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L40S                    On  |   00000000:A1:00.0 Off |                    0 |
| N/A   31C    P8             33W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L40S                    On  |   00000000:C1:00.0 Off |                    0 |
| N/A   30C    P8             33W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L40S                    On  |   00000000:E1:00.0 Off |                    0 |
| N/A   30C    P8             33W /  350W |       3MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Answered by bghira

Sep 27, 2024

bingo! the contents of accelerate's config file are only used for DeepSpeed details, so that we don't have to expect users to constantly change that obscurely-located file for stuff like num GPUs.

View full answer

bghira · 2024-09-27T14:25:06Z

bghira
Sep 27, 2024
Maintainer

check the first few lines of output.

9 replies

PCanavelli Sep 27, 2024
Author

You mean this?

Set custom env vars permanently in config/config.env:                                                              
TRAINING_NUM_PROCESSES not set, defaulting to 1.                                                                   
TRAINING_NUM_MACHINES not set, defaulting to 1.
MIXED_PRECISION not set, defaulting to bf16.
TRAINING_DYNAMO_BACKEND not set, defaulting to no.
ENV not set, defaulting to default.
Using json backend: config/config.json
Updating dependencies. Set DISABLE_UPDATES to prevent this.

bghira Sep 27, 2024
Maintainer

yep, there ya go

PCanavelli Sep 27, 2024
Author

TRAINING_NUM_PROCESSES set to my GPU count I bet?

bghira Sep 27, 2024
Maintainer

bingo! the contents of accelerate's config file are only used for DeepSpeed details, so that we don't have to expect users to constantly change that obscurely-located file for stuff like num GPUs.

Answer selected by PCanavelli

PCanavelli Sep 27, 2024
Author

Awesome, thanks for answering so fast mate

bghira Sep 27, 2024
Maintainer

i'm cryptic about things but not because i'm mad, i just prefer users begin to discover how to find their own answers, which probably makes for a stronger community over time. thanks for playing along.

PCanavelli Sep 27, 2024
Author

No worries, I 100% get and support this logic.

Just one last question while I'm here, since this is my first multi-GPU Flux training:

All GPUs are now at 100%. Yet, training speed seems the same as for single-GPU. Is this expected or did I misconfigure something else? (GPUs are too expensive for me to be willing to risk pissing away hours of training)

bghira Sep 27, 2024
Maintainer

that is expected. each GPU is training the model on its own, in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux LoRA multi-GPU training #997

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Flux LoRA multi-GPU training #997

PCanavelli Sep 27, 2024

Accelerate config

config.json

multidatabackend.json

nvidia-smi mid-training

Replies: 1 comment · 9 replies

bghira Sep 27, 2024 Maintainer

PCanavelli Sep 27, 2024 Author

bghira Sep 27, 2024 Maintainer

PCanavelli Sep 27, 2024 Author

bghira Sep 27, 2024 Maintainer

PCanavelli Sep 27, 2024 Author

bghira Sep 27, 2024 Maintainer

PCanavelli Sep 27, 2024 Author

bghira Sep 27, 2024 Maintainer

PCanavelli
Sep 27, 2024

Replies: 1 comment 9 replies

bghira
Sep 27, 2024
Maintainer

PCanavelli Sep 27, 2024
Author

bghira Sep 27, 2024
Maintainer

PCanavelli Sep 27, 2024
Author

bghira Sep 27, 2024
Maintainer

PCanavelli Sep 27, 2024
Author

bghira Sep 27, 2024
Maintainer

PCanavelli Sep 27, 2024
Author

bghira Sep 27, 2024
Maintainer