DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

gante · 2024-02-29T18:05:36Z

What does this PR do?

This PR is a 🔪 mangled🔪 version of generate where torch.compile(model.generate, fullgraph=True) works and returns the same values. It should NOT be merged, but rather be used as a reference -- other PRs will be created that push the needed changes, once at a time, to ensure we don't break other features.

Script to test correctness

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch
import copy

torch_device = "cuda"

EXPECTED_GENERATION = [
    "The best color is the one that complements the skin tone of the",
    "We should not undermind the issues at hand.\nWe should not undermind the issues",
]

tokenizer = AutoTokenizer.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf", padding_side="left", pad_token="<s>"
)
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
).to(torch_device)
inputs = tokenizer(
    ["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt"
).to(model.device)

generation_kwargs = {
    "do_sample": False,
    "max_new_tokens": 10,
}

print("Dynamic cache")
gen_out = model.generate(**inputs, **generation_kwargs)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
print(decoded)
assert decoded == EXPECTED_GENERATION

print("Static cache")
model.generation_config.cache_implementation = "static"
gen_out = model.generate(**inputs, **generation_kwargs)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
assert decoded == EXPECTED_GENERATION
print(decoded)

print("Compiled static cache")
generation_config = copy.deepcopy(model.generation_config)
generation_config.update(**generation_kwargs)
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")
gen_out = compiled_generate(**inputs, generation_config=generation_config)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
print(decoded)
assert decoded == EXPECTED_GENERATION

fixes #27837

HuggingFaceDocBuilderDev · 2024-03-14T17:01:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2024-04-15T08:03:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2024-04-15T08:45:46Z

@gante If this PR is going to be long lived - you can add the WIP label and it will stop the bot closing if stale

gante · 2024-05-29T15:19:03Z

Closed in favor of #30788

gante added 14 commits February 22, 2024 18:07

no cache positions in the public api

e9ca1ea

propagate changes to gemma

3b7fbfb

should not have been deleted

694b265

more precise padded offset calculation

75aebbe

attention mask dtype is sometimes wrong in the tests

88d597b

get_seq_length() working

e499ac9

nits

6cc17ec

gemma

6e4b511

bc nit

232da2a

explicit cache_positions (implicit working when not passed)

04d53a7

add test for implicit cache_position

20baebd

deprecate seen_tokens

646f150

tmp commit

5f67182

MVP working :D

41a91d9

ArthurZucker mentioned this pull request Mar 7, 2024

torch CUDA graphs with HF generate #27837

Open

gante mentioned this pull request Mar 13, 2024

Llama: allow custom 4d masks #29618

Merged

Merge branch 'main' into compile_generate

c1d3cce

gante added 2 commits March 14, 2024 18:49

sort a few issues (compile hangs?)

b32fa42

working again :D

8063ac8

This was referenced Mar 14, 2024

Generate: replace breaks by a loop condition #29662

Merged

Llama: always convert the causal mask in the SDPA code path #29663

Merged

gante and others added 3 commits March 21, 2024 16:51

Merge branch 'main' into compile_generate

6b5b979

working with torch==2.3.0.dev20240315+cu121

eef19f1

smaller diff

3aeb1d4

gante mentioned this pull request Mar 22, 2024

Generate: consistently handle special tokens as tensors #29788

Closed

7 tasks

gante mentioned this pull request Mar 25, 2024

tracker: generate compatibility with torch.compile #28981

Closed

33 tasks

ArthurZucker mentioned this pull request Mar 28, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

ArthurZucker added WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress Compilation Issues related to torchdynamo and torchinductor Cache labels Apr 22, 2024

gante mentioned this pull request May 2, 2024

Generate: consistently handle special tokens as tensors #30624

Merged

6 tasks

gante mentioned this pull request May 13, 2024

Generate: end-to-end compilation #30788

Merged

3 tasks

gante closed this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

gante commented Feb 29, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 14, 2024

github-actions bot commented Apr 15, 2024

amyeroberts commented Apr 15, 2024

gante commented May 29, 2024

DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

DO NOT MERGE: generate compatible with torch.compile(fullgraph=True) #29374

Conversation

gante commented Feb 29, 2024 • edited Loading

What does this PR do?

Script to test correctness

HuggingFaceDocBuilderDev commented Mar 14, 2024

github-actions bot commented Apr 15, 2024

amyeroberts commented Apr 15, 2024

gante commented May 29, 2024

gante commented Feb 29, 2024 •

edited

Loading