Weird text encoder NaNs specifically for FSDP + multi GPU #33376

christopher-beckham · 2024-09-08T20:25:28Z

System Info

transformers version: 4.45.0.dev0
Platform: Linux-5.15.0-1027-gcp-x86_64-with-glibc2.31
Python version: 3.9.19
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.4
Accelerate version: 0.35.0.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: distributed yes but I test with two custom yml files (see below)
Using GPU in script?: yes
GPU type: NVIDIA A100-SXM4-80GB

Both accelerate and transformers are all recent, installed fresh from github.

Who can help?

@ArthurZucker @muellerz just because it seems to be something to do with the combination of fsdp + the instantiation of the tokenizer classes

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I seem to be getting a weird issue when using multi-GPU with loading certain models from transformers. In the toy task below I am simply loading in some tokenizers and text encoders from a certain pretrained model, and yet oddly enough, when I am running this script under multi-GPU + FSDP I am getting NaNs in the text encoder.

For instance, with this script:

from accelerate import Accelerator
from transformers import CLIPTokenizer, T5EncoderModel, T5TokenizerFast, CLIPTextModel
from diffusers.utils import (
    check_min_version
)
import torch

def has_nan(tensor):
    if not isinstance(tensor, torch.Tensor):
        return f"not a tensor, but a {type(tensor)}"
    return torch.isnan(tensor).any().item()

def check_nan_weights(model, mod_name):
    nan_params = []
    for name, param in model.named_parameters():
        if torch.isnan(param.data).any():
            nan_params.append(name)
    
    if nan_params:
        print(f"[{torch.cuda.current_device()}, {mod_name}]: NaN weights detected in the following parameters:")
        for param_name in nan_params:
            print(f"  - {param_name}")
        return True
    return False

from logging import getLogger
logger = getLogger(__name__)

def load_pipeline(accelerator,
                  pretrained_model_name_or_path: str,
                  load_tokenizers: bool = True,
                  revision: str = None,
                  variant: str = None):

    #with accelerator.main_process_first():

    if load_tokenizers:

        # Load the tokenizers
        tokenizer_one = CLIPTokenizer.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer",
            revision=revision,
        )
        tokenizer_two = T5TokenizerFast.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer_2",
            revision=revision,
        )

    #accelerator.wait_for_everyone()

    text_encoder_one = CLIPTextModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder", 
        revision=revision, variant=variant
    )

    text_encoder_two = T5EncoderModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder_2", 
        revision=revision, variant=variant,
    )

    logger.info("check nan weights...")
    check_nan_weights(text_encoder_one, 'te')
    check_nan_weights(text_encoder_two, 'te2')

def main():

    accelerator = Accelerator()

    pipeline = load_pipeline(
        accelerator,
        "black-forest-labs/FLUX.1-dev",
        load_tokenizers=True
    )

if __name__ == "__main__":
    #from torch.multiprocessing import Pool, Process, set_start_method
    #set_start_method('spawn')
    main()

If we run this with 1 gpu via accelerate launch --config_file 1gpu.yml test.py we get no errors. However, with 2 gpu with accelerate launch --config_file 2gpu.yml test.py we get:

[1, te]: NaN weights detected in the following parameters:
  - text_model.encoder.layers.0.self_attn.k_proj.weight
  - text_model.encoder.layers.0.self_attn.k_proj.bias
  - text_model.encoder.layers.0.self_attn.v_proj.weight
  - text_model.encoder.layers.2.self_attn.out_proj.bias
  - text_model.encoder.layers.2.layer_norm1.weight
  - text_model.encoder.layers.2.layer_norm1.bias
 ...
 ...

Note that if we set load_tokenizers=False in load_pipeline, there are no issues. It seems to be something with the tokenizer. I thought this might be some race-condition related issue but when I tried to isolate that behaviour with e.g. the use of accelerator.wait_for_everyone() I still got the same issues.

Furthermore, if I just run the script with accelerate launch test.py with a default config (one which is as vanilla as can be, no FSDP and just enabling multi-GPU) then there are no errors to be found. So this seems to be an issue specifically at the intersection of FSDP and the tokenizer classes.

My accelerate config files are as follows for 1 gpu and 2 gpu (for 2 gpu just set num_processes: 2 of course).

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_min_num_params: 100000000
  fsdp_offload_params: false
  # SHARD_GRAD_OP was the previous strat
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  # SHARDED_STATE_DICT was the old value for above
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

No NaNs.

The text was updated successfully, but these errors were encountered:

Zars19 · 2024-09-09T03:16:36Z

I got the same problem

ArthurZucker · 2024-10-05T13:39:56Z

Hey! You are using the T5 architecture, I think you should be extra careful with the dtype you are using. #17978 and #4287 heavily related!

github-actions · 2024-10-30T08:04:50Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

christopher-beckham added the bug label Sep 8, 2024

LysandreJik added Core: Tokenization Internals of the library; Tokenization. PyTorch FSDP labels Sep 9, 2024

ArthurZucker mentioned this issue Sep 27, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

github-actions bot closed this as completed Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

christopher-beckham commented Sep 8, 2024 •

edited

Loading

Zars19 commented Sep 9, 2024

ArthurZucker commented Oct 5, 2024

github-actions bot commented Oct 30, 2024

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

Comments

christopher-beckham commented Sep 8, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Zars19 commented Sep 9, 2024

ArthurZucker commented Oct 5, 2024

github-actions bot commented Oct 30, 2024

christopher-beckham commented Sep 8, 2024 •

edited

Loading