Stuck when using model.generate() and acclerator.gather() in the distributed setting #1326

hepengfe · 2023-04-17T21:40:35Z

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 6
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Import T5 encoder-decoder model.
use accelerator.prepare() on model and evaluation data loader for parallel evaluation.
for data in eval_dataloader:
     out  = model.generate()
     accelerator.gather(out)

Expected behavior

It's expected to gather the generated token ids (tensor without freezing). I also tried syncd_gpu=True in generate function but it doesn't work. I also tried to insert accelerator.wait_for_everyone() in many places and it's not working neither. Note that if I use model.generate() by itself, it's working. When it's used with accelerator.gather(out), it gets freezed after few steps.

The text was updated successfully, but these errors were encountered:

sgugger · 2023-04-18T13:22:14Z

It's very likely that each process is generating outputs of different lengths, so you are stuck because accelerator.gather doesn't like that. You can use accelerator.pad_accross_processes to pad your outputs to the same size prior to gathering.

hepengfe · 2023-04-19T10:05:14Z

Thanks the problem is solved by pad the second dimension.

hepengfe closed this as completed Apr 19, 2023

youliangh mentioned this issue Jan 31, 2024

Inappropriate reduce operation of "num_input_tokens_seen" is prone to get training stuck. huggingface/transformers#28791

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck when using model.generate() and acclerator.gather() in the distributed setting #1326

Stuck when using model.generate() and acclerator.gather() in the distributed setting #1326

hepengfe commented Apr 17, 2023 •

edited

Loading

sgugger commented Apr 18, 2023

hepengfe commented Apr 19, 2023

Stuck when using model.generate() and acclerator.gather() in the distributed setting #1326

Stuck when using model.generate() and acclerator.gather() in the distributed setting #1326

Comments

hepengfe commented Apr 17, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

sgugger commented Apr 18, 2023

hepengfe commented Apr 19, 2023

hepengfe commented Apr 17, 2023 •

edited

Loading