Generate: Fix modern llm `generate` calls with `synced_gpus` #34095

gante · 2024-10-11T15:38:48Z

What does this PR do?

Step 5 in #32685
Fixes #32885
Fixes #32603
Fixes #32641

Modern LLMs, i.e. LLMs that support our cache classes, currently fail when the input has a batch size > 1 and synced_gpus = True.

On main, this is what happens with synced_gpus

cache_position stops being updated when generation finishes in a given device, causing cache indexing errors on that device (the cache continues growing because we keep doing dummy forward passes)
if we continue updating cache_position, then slicing input_ids gets out of bounds for the dummy computations (we stop updating input_ids, so it stops growing)

This PR makes the changes to enable generation with the behavior above.

💛 Please note that, because of the efforts in #32685, updating model input preparation requires an update in a single function, as opposed to an update per model 💛

Test script (call with 2+ GPUs) that fails before this PR (from this comment):

import transformers
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def run(rank, size):
    # dist.initialize_dist('gpu')

    name = 'meta-llama/Meta-Llama-3-8B-Instruct'
    tokenizer = transformers.AutoTokenizer.from_pretrained(name)
    pad_token_id = tokenizer.eos_token_id
    model = transformers.AutoModelForCausalLM.from_pretrained(name)

    # rank = dist.get_global_rank()

    model.to(f'cuda:{rank}')

    if rank == 0:
        content = 'Write one short sentence.'
    else:
        content = 'Write one long paragraph.'

    messages = [
        {
            'role': 'user',
            'content': content,
        }
    ]

    tokenized_messages = tokenizer.apply_chat_template(messages, return_tensors='pt')

    padded_messages = torch.cat(
        [
            torch.LongTensor((4096 - 20) * [pad_token_id]),
            tokenized_messages[0],  # [seq]
        ],
        dim=0,
    )
    padded_messages = padded_messages.unsqueeze(0)
    padded_messages = padded_messages.to(f'cuda:{rank}')
    attention_mask = ~(padded_messages == pad_token_id)
    attention_mask = attention_mask.to(f'cuda:{rank}')
    output = model.generate(input_ids=padded_messages, attention_mask=attention_mask, synced_gpus=True, max_new_tokens=200)

    print(tokenizer.decode(output[0]))

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    processes = []
    mp.set_start_method("spawn")
    for rank in range(size):
        p = mp.Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

gante · 2024-10-11T16:38:09Z

@SunMarc this should help with FSDP + generate 🤗

gante · 2024-10-11T16:42:48Z

src/transformers/generation/utils.py

-        # This is needed if return_dict_in_generate is True
-        start_from_empty_dynamic_cache = False
-        past_key_values = model_kwargs.get("past_key_values", None)
-        if isinstance(past_key_values, DynamicCache) or (
-            isinstance(past_key_values, EncoderDecoderCache)
-            and isinstance(past_key_values.self_attention_cache, DynamicCache)
-        ):
-            if past_key_values.get_seq_length() == 0:
-                start_from_empty_dynamic_cache = True
-


Simplifies logic in assisted generation: see the new is_first_iteration variable and its uses :)

ArthurZucker

The decorelation between prepare input for generation and the modeling is very nice.
I don't know how well we test this, if the slow CIs were crying or not, but if yes, then it's already tested and Good to go!

ringohoffman · 2024-10-11T19:41:55Z

This fixes the error I was seeing here:

Default synced_gpus to True when using FullyShardedDataParallel #33483 (comment)

Thank you so much!

gante · 2024-10-12T15:45:43Z

@ArthurZucker I don't think this is being tested!

@SunMarc -- I couldn't find any related test, but multigpu tests have a more elaborated setup, so I could be missing something. Can you confirm?

Meanwhile, I'm merging since this PR unblocks users. If there is no test, I'll open a follow-up PR :)

SunMarc · 2024-10-14T12:10:25Z

@SunMarc -- I couldn't find any related test, but multigpu tests have a more elaborated setup, so I could be missing something. Can you confirm?

I'm not aware of any tests related to multi-gpu and generate with sync_gpus=True. I will have a look at this since we also need to add them for deepspeed and fdsp ! cc @muellerzr

…face#34095)

gante mentioned this pull request Oct 11, 2024

tracker: move prepare_inputs_for_generation into the generation mixin 🧹 #32685

Closed

8 tasks

gante requested review from LysandreJik and SunMarc October 11, 2024 16:36

gante added 5 commits October 11, 2024 16:39

sync gpus

7d4c481

sync gpus

0ee8da4

fix other decoding methods

287911f

nit

53d8e10

fix assisted gen (consistent return api)

262c971

gante force-pushed the prepare_sync_gpus branch from 0f6482a to 262c971 Compare October 11, 2024 16:39

This was referenced Oct 11, 2024

Multi GPU generate with llama shape error #32885

Closed

Shape mismatch when generating with multiple processes #32603

Closed

Bugfix for generation with an early-stopping process #32641

Closed

gante commented Oct 11, 2024

View reviewed changes

ArthurZucker approved these changes Oct 11, 2024

View reviewed changes

gante merged commit 37ea040 into huggingface:main Oct 12, 2024
21 of 23 checks passed

gante deleted the prepare_sync_gpus branch October 12, 2024 15:45

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Generate: Fix modern llm generate calls with synced_gpus (hugging…

0d6dbe9

…face#34095)

JamesHujy mentioned this pull request Feb 4, 2025

Fix bug in prepare_inputs_for_generation function: in Qwen2-VL (#36037) #36038

Open

gante mentioned this pull request Feb 5, 2025

Fix Gemma2 synced multi-GPU generation #35232

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: Fix modern llm `generate` calls with `synced_gpus` #34095

Generate: Fix modern llm `generate` calls with `synced_gpus` #34095

gante commented Oct 11, 2024 •

edited

Loading

gante commented Oct 11, 2024

gante Oct 11, 2024 •

edited

Loading

ArthurZucker left a comment

ringohoffman commented Oct 11, 2024

gante commented Oct 12, 2024

SunMarc commented Oct 14, 2024 •

edited by LysandreJik

Loading

Generate: Fix modern llm generate calls with synced_gpus #34095

Generate: Fix modern llm generate calls with synced_gpus #34095

Conversation

gante commented Oct 11, 2024 • edited Loading

What does this PR do?

gante commented Oct 11, 2024

gante Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ringohoffman commented Oct 11, 2024

gante commented Oct 12, 2024

SunMarc commented Oct 14, 2024 • edited by LysandreJik Loading

Generate: Fix modern llm `generate` calls with `synced_gpus` #34095

Generate: Fix modern llm `generate` calls with `synced_gpus` #34095

gante commented Oct 11, 2024 •

edited

Loading

gante Oct 11, 2024 •

edited

Loading

SunMarc commented Oct 14, 2024 •

edited by LysandreJik

Loading