-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate: Fix modern llm generate
calls with synced_gpus
#34095
Conversation
@SunMarc this should help with FSDP + |
0f6482a
to
262c971
Compare
# This is needed if return_dict_in_generate is True | ||
start_from_empty_dynamic_cache = False | ||
past_key_values = model_kwargs.get("past_key_values", None) | ||
if isinstance(past_key_values, DynamicCache) or ( | ||
isinstance(past_key_values, EncoderDecoderCache) | ||
and isinstance(past_key_values.self_attention_cache, DynamicCache) | ||
): | ||
if past_key_values.get_seq_length() == 0: | ||
start_from_empty_dynamic_cache = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifies logic in assisted generation: see the new is_first_iteration
variable and its uses :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decorelation between prepare input for generation and the modeling is very nice.
I don't know how well we test this, if the slow CIs were crying or not, but if yes, then it's already tested and Good to go!
This fixes the error I was seeing here: Thank you so much! |
@ArthurZucker I don't think this is being tested! @SunMarc -- I couldn't find any related test, but multigpu tests have a more elaborated setup, so I could be missing something. Can you confirm? Meanwhile, I'm merging since this PR unblocks users. If there is no test, I'll open a follow-up PR :) |
I'm not aware of any tests related to multi-gpu and generate with sync_gpus=True. I will have a look at this since we also need to add them for deepspeed and fdsp ! cc @muellerzr |
What does this PR do?
Step 5 in #32685
Fixes #32885
Fixes #32603
Fixes #32641
Modern LLMs, i.e. LLMs that support our cache classes, currently fail when the input has a
batch size > 1
andsynced_gpus = True
.On
main
, this is what happens withsynced_gpus
cache_position
stops being updated when generation finishes in a given device, causing cache indexing errors on that device (the cache continues growing because we keep doing dummy forward passes)cache_position
, then slicinginput_ids
gets out of bounds for the dummy computations (we stop updatinginput_ids
, so it stops growing)This PR makes the changes to enable generation with the behavior above.
💛 Please note that, because of the efforts in #32685, updating model input preparation requires an update in a single function, as opposed to an update per model 💛
Test script (call with 2+ GPUs) that fails before this PR (from this comment):