Generate: assisted decoding with sample #22862

gante · 2023-04-19T14:42:36Z

What does this PR do?

This PR expands the previous assisted generation PR so as to work with sampling.

Two important notes to review the PR:

I'd suggest starting the review by the docs, so you understand what's going on at a high level. Sampling adds an additional (controllable) heuristic, so the user can control between speed and pure sampling behavior.
In terms of implementation, I've decided to overload the assisted generation function with a few extra lines to handle the sample case. This is to avoid adding a close copy of a 500-line function.

Bellow are some results, so you can understand the balancing act. Execution time obtained on a 3090.

Script

from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
import time

model_id = "EleutherAI/pythia-6.9b-deduped"
assistant_id = "EleutherAI/pythia-160m-deduped"

tokenizer = AutoTokenizer.from_pretrained(model_id)

assistant_model = AutoModelForCausalLM.from_pretrained(assistant_id)
assistant_model = assistant_model.to("cuda")

model_kwargs = {
    "pretrained_model_name_or_path": model_id,
    "device_map": "auto",
    "max_memory": {0: "20GiB", "cpu": "50GiB"},
    "torch_dtype": torch.float16,
}
model = AutoModelForCausalLM.from_pretrained(**model_kwargs)

inputs = tokenizer("Here's how to cook a good ramen:", return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer=tokenizer)

print("Greedy with assistance:")
start = time.time()
model.generate(**inputs, assistant_model=assistant_model, streamer=streamer, max_new_tokens=64)
print(f"Elapsed time: {time.time() - start:.2f} seconds")

for p in (0.0, 0.2, 0.4, 0.6, 0.8, 1.0):
    print(f"Sample with assistance (assisted_keep_proba = {p})")
    torch.manual_seed(0)
    start = time.time()
    model.generate(
        **inputs,
        do_sample=True,
        assistant_model=assistant_model,
        assisted_keep_proba=p,
        streamer=streamer,
        max_new_tokens=64
    )
    print(f"Elapsed time: {time.time() - start:.2f} seconds")

print("Original sample")
torch.manual_seed(0)
start = time.time()
model.generate(**inputs, do_sample=True, streamer=streamer, max_new_tokens=64)
print(f"Elapsed time: {time.time() - start:.2f} seconds")

Sample results

Decoding strategy	Result	Execution time
Greedy (w/assistance)	Here's how to cook a good ramen: 1. Make sure you have a good stock. 2. Make sure you have a good broth. 3. Make sure you have a good ramen. 4. Make sure you have a good ramen. 5. Make sure you have a good ramen.	1.44 seconds
Sample (w/assistance `assisted_keep_proba=0.0`)	Here's how to cook a good ramen: 1. Get a noodle. 2. Get a stock. 3. Get a packet of dried ingredients. 4. Cook the noodles. 5. Cook the stock. 6. Cook the packet of dried ingredients. 7. Enjoy! And	1.44 seconds
Sample (w/assistance `assisted_keep_proba=0.2`)	Here's how to cook a good ramen: 1. Get a noodle vendor. The noodle vendor makes the noodles. Japanese restaurants often have the noodle vendor on-site. 2. Get a pot. The pot is used to cook ramen. 3. Get a pot of boiling water.	1.59 seconds
Sample (w/assistance `assisted_keep_proba=0.4`)	Here's how to cook a good ramen: Step 1: Collect your ingredients. For this recipe you need a big stock pot. That's good. And some water. Step 2: Peel the eggs. Yes, that's it. Four eggs. Step 3: Separate the yolks.	1.71 seconds
Sample (w/assistance `assisted_keep_proba=0.6`)	Here's how to cook a good ramen: Nothing much to take out of the packet. Just a big block of pork fat, some Chinese chilli paste and seasonings. Preheat the oven to 210ºC (410ºF/Gas 6). Place the pork fat, chilli paste and seasoning into a mixing bowl and	2.08 seconds
Sample (w/assistance `assisted_keep_proba=0.8`)	Here's how to cook a good ramen: You'll need: A large pot for boiling noodles A small saucepan for cooking the noodles BBQ chicken or roasted fish, or any grilled healthy protein A box of ramen noodles, noodles that come in shapes and sizes Soups or broth,	2.32 seconds
Sample (w/assistance `assisted_keep_proba=1.0`)	Here's how to cook a good ramen: You take your pre-scalloped noodles, pour boiling water (or your preferred water-to-noodle ratio) over them, and leave them alone for four to five minutes. Once that's done, drain them, season with salt, and heat them up on the stove (microwave won	2.56 seconds
Original Sample)	Here's how to cook a good ramen: You take your pre-scalloped noodles, pour boiling water (or your preferred cooking liquid) over it, and after that you go get your ramen broth, add-ins, and other condiments. You make your seasoning sauce, and heat that up. Mix it all together, and put	2.05 seconds

As it can be seen above, there is a trade off between time and quality. This will certainly be application specific: factual applications will be able to take the most of assisted decoding. In my brief experiments, assisted_keep_proba=0.3 seems like a sensible default.

HuggingFaceDocBuilderDev · 2023-04-19T15:01:21Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for working on this! I think the doc can be made a little bit better before we merge this.

sgugger · 2023-04-20T13:03:59Z

docs/source/en/generation_strategies.mdx

@@ -359,3 +360,26 @@ To enable assisted generation, set the `assistant_model` argument with a model.
 >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
 ```
+
+When using assisted decoding with sampling methods, the `assisted_keep_proba` argument will balance speed with


I do not like this name, and I'm not completely understanding what this argument does from the doc so I can't suggest a new one😅 assisted_threshold maybe?

sgugger · 2023-04-20T13:04:15Z

src/transformers/generation/configuration_utils.py

+        assisted_keep_proba (`float`, *optional*):
+            Used with assisted decoding. When `do_sample` is true, this controls the threshold at which the model will
+            resample candidate tokens. When the model's predicted probability for a candidate token is below this
+            threshold, the candidate token is invalidated and a sampling step. Decreasing this value will aproximate


and a sampling step is performed?

sgugger · 2023-04-20T13:05:20Z

src/transformers/generation/utils.py

+            if do_sample:
+                probs = new_logits[:, -candidate_length - 1 :, :].softmax(dim=-1)
+                max_probs, max_logits = probs[:, :-1, :].topk(1, dim=-1)
+                max_logits[max_probs < assisted_keep_proba] = -1  # invalidate candidate tokens with low proba


So yeah it looks like assisted_threshold could work?

amyeroberts

LGTM - thanks for adding this and the details script and results! ❤️ 🚀

amyeroberts · 2023-04-20T12:17:59Z

src/transformers/generation/configuration_utils.py

@@ -179,6 +181,11 @@ class GenerationConfig(PushToHubMixin):
            A list of pairs of integers which indicates a mapping from generation indices to token indices that will be
            forced before sampling. For example, `[[1, 123]]` means the second generated token will always be a token
            of index 123.
+        assisted_keep_proba (`float`, *optional*):


Default value should be mentioned here

amyeroberts · 2023-04-20T12:26:35Z

src/transformers/generation/utils.py

+                When `do_sample` is true, this controls the threshold at which the model will resample candidate
+                tokens. When the model's predicted probability for a candidate token is below this threshold, the
+                candidate token is invalidated and a sampling step. Decreasing this value will aproximate the decoding
+                process to greedy search, but it will be faster.
            logits_processor (`LogitsProcessorList`, *optional*):
                An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
                used to modify the prediction scores of the language modeling head applied at each generation step.


logits_warper in docstring missing here

gante · 2023-04-22T18:24:56Z

I'm closing this PR because I found a much much better way to handle the sample case 🧠

Stay tuned 🚀

gante added 2 commits April 19, 2023 09:48

add code and docs

85161d2

add tests (and fix corner cases)

c7ccfbe

gante requested review from sgugger and amyeroberts April 19, 2023 17:19

sgugger reviewed Apr 20, 2023

View reviewed changes

amyeroberts approved these changes Apr 20, 2023

View reviewed changes

gante closed this Apr 22, 2023

gante mentioned this pull request Apr 23, 2023

Generate: assisted generation with sample (take 2) #22949

Merged

gante deleted the assisted_sample branch May 18, 2023 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: assisted decoding with sample #22862

Generate: assisted decoding with sample #22862

gante commented Apr 19, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 19, 2023 •

edited

Loading

sgugger left a comment

sgugger Apr 20, 2023

sgugger Apr 20, 2023

sgugger Apr 20, 2023

amyeroberts left a comment

amyeroberts Apr 20, 2023

amyeroberts Apr 20, 2023

gante commented Apr 22, 2023

Generate: assisted decoding with sample #22862

Generate: assisted decoding with sample #22862

Conversation

gante commented Apr 19, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 19, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 20, 2023

Choose a reason for hiding this comment

sgugger Apr 20, 2023

Choose a reason for hiding this comment

sgugger Apr 20, 2023

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Apr 20, 2023

Choose a reason for hiding this comment

amyeroberts Apr 20, 2023

Choose a reason for hiding this comment

gante commented Apr 22, 2023

gante commented Apr 19, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 19, 2023 •

edited

Loading