Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

Closed
2 of 4 tasks
Tracked by #33345
christopher-beckham opened this issue Sep 8, 2024 · 3 comments
Closed
2 of 4 tasks
Tracked by #33345

Weird text encoder NaNs specifically for FSDP + multi GPU #33376

christopher-beckham opened this issue Sep 8, 2024 · 3 comments
Labels
bug Core: Tokenization Internals of the library; Tokenization. PyTorch FSDP

Comments

@christopher-beckham
Copy link

christopher-beckham commented Sep 8, 2024

System Info

  • transformers version: 4.45.0.dev0
  • Platform: Linux-5.15.0-1027-gcp-x86_64-with-glibc2.31
  • Python version: 3.9.19
  • Huggingface_hub version: 0.24.5
  • Safetensors version: 0.4.4
  • Accelerate version: 0.35.0.dev0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: distributed yes but I test with two custom yml files (see below)
  • Using GPU in script?: yes
  • GPU type: NVIDIA A100-SXM4-80GB

Both accelerate and transformers are all recent, installed fresh from github.

Who can help?

@ArthurZucker @muellerz just because it seems to be something to do with the combination of fsdp + the instantiation of the tokenizer classes

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I seem to be getting a weird issue when using multi-GPU with loading certain models from transformers. In the toy task below I am simply loading in some tokenizers and text encoders from a certain pretrained model, and yet oddly enough, when I am running this script under multi-GPU + FSDP I am getting NaNs in the text encoder.

For instance, with this script:

from accelerate import Accelerator
from transformers import CLIPTokenizer, T5EncoderModel, T5TokenizerFast, CLIPTextModel
from diffusers.utils import (
    check_min_version
)
import torch

def has_nan(tensor):
    if not isinstance(tensor, torch.Tensor):
        return f"not a tensor, but a {type(tensor)}"
    return torch.isnan(tensor).any().item()

def check_nan_weights(model, mod_name):
    nan_params = []
    for name, param in model.named_parameters():
        if torch.isnan(param.data).any():
            nan_params.append(name)
    
    if nan_params:
        print(f"[{torch.cuda.current_device()}, {mod_name}]: NaN weights detected in the following parameters:")
        for param_name in nan_params:
            print(f"  - {param_name}")
        return True
    return False

from logging import getLogger
logger = getLogger(__name__)

def load_pipeline(accelerator,
                  pretrained_model_name_or_path: str,
                  load_tokenizers: bool = True,
                  revision: str = None,
                  variant: str = None):

    #with accelerator.main_process_first():

    if load_tokenizers:

        # Load the tokenizers
        tokenizer_one = CLIPTokenizer.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer",
            revision=revision,
        )
        tokenizer_two = T5TokenizerFast.from_pretrained(
            pretrained_model_name_or_path,
            subfolder="tokenizer_2",
            revision=revision,
        )

    #accelerator.wait_for_everyone()

    text_encoder_one = CLIPTextModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder", 
        revision=revision, variant=variant
    )

    text_encoder_two = T5EncoderModel.from_pretrained(
        pretrained_model_name_or_path, subfolder="text_encoder_2", 
        revision=revision, variant=variant,
    )

    logger.info("check nan weights...")
    check_nan_weights(text_encoder_one, 'te')
    check_nan_weights(text_encoder_two, 'te2')

def main():

    accelerator = Accelerator()

    pipeline = load_pipeline(
        accelerator,
        "black-forest-labs/FLUX.1-dev",
        load_tokenizers=True
    )

if __name__ == "__main__":
    #from torch.multiprocessing import Pool, Process, set_start_method
    #set_start_method('spawn')
    main()

If we run this with 1 gpu via accelerate launch --config_file 1gpu.yml test.py we get no errors. However, with 2 gpu with accelerate launch --config_file 2gpu.yml test.py we get:

[1, te]: NaN weights detected in the following parameters:
  - text_model.encoder.layers.0.self_attn.k_proj.weight
  - text_model.encoder.layers.0.self_attn.k_proj.bias
  - text_model.encoder.layers.0.self_attn.v_proj.weight
  - text_model.encoder.layers.2.self_attn.out_proj.bias
  - text_model.encoder.layers.2.layer_norm1.weight
  - text_model.encoder.layers.2.layer_norm1.bias
 ...
 ...

Note that if we set load_tokenizers=False in load_pipeline, there are no issues. It seems to be something with the tokenizer. I thought this might be some race-condition related issue but when I tried to isolate that behaviour with e.g. the use of accelerator.wait_for_everyone() I still got the same issues.

Furthermore, if I just run the script with accelerate launch test.py with a default config (one which is as vanilla as can be, no FSDP and just enabling multi-GPU) then there are no errors to be found. So this seems to be an issue specifically at the intersection of FSDP and the tokenizer classes.

My accelerate config files are as follows for 1 gpu and 2 gpu (for 2 gpu just set num_processes: 2 of course).

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_min_num_params: 100000000
  fsdp_offload_params: false
  # SHARD_GRAD_OP was the previous strat
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  # SHARDED_STATE_DICT was the old value for above
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

No NaNs.

@Zars19
Copy link

Zars19 commented Sep 9, 2024

I got the same problem

@LysandreJik LysandreJik added Core: Tokenization Internals of the library; Tokenization. PyTorch FSDP labels Sep 9, 2024
@ArthurZucker
Copy link
Collaborator

Hey! You are using the T5 architecture, I think you should be extra careful with the dtype you are using. #17978 and #4287 heavily related!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Tokenization Internals of the library; Tokenization. PyTorch FSDP
Projects
None yet
Development

No branches or pull requests

4 participants