Generate: speculative decoding #27979

gante · 2023-12-12T15:57:32Z

What does this PR do?

Useful context:
In a recent PR (#27750), the candidate generation in assisted generation got abstracted, so we can host new candidate generation techniques (such as #27722).

This PR:

Reworks assisted candidate generation to call .generate(), instead of having its own custom generation loop. For most models this is nothing more than a nice abstraction. However, for models with a custom generate() function, this means the assistant model will now make use of it! (🤔 does this mean that DistilWhisper gets better numbers with this refactor?) Edit: moved to Generate: assisted decoding now uses generate for the assistant #28030
Adds speculative decoding (paper, see Algorithm 1). This implied a minor interface change in the candidate generation class, which should be okay since it hasn't been released :)

The following tests were run locally and are passing:

RUN_SLOW=1 py.test tests/models/whisper/ -k speculative
py.test tests/ -k test_assisted (which now triggers speculative decoding)

TODO:

Benchmark speculative decoding

gante · 2023-12-12T17:13:21Z

@patrickvonplaten tagging you here for a 2nd set of eyes on the speculative decoding method (changes in utils.py), which I'm assuming you're familiar with. Feel free to delegate to someone else who is familiar with the method! 🤗

gante · 2023-12-12T17:14:25Z

src/transformers/generation/utils.py

+            if do_sample and candidate_logits is not None:
+                # Gets the probabilities from the logits. q_i and p_i denote the model and assistant (respectively)
+                # probabilities of the tokens selected by the assistant.
+                q = candidate_logits.softmax(dim=-1)


These are not the best variable names, but it's hard to compare against the original algorithm if they don't match 🤔 As such, I've decided to keep the original names

I'm fine with it as there's good comments and other variables are well names e.g. is_rejected :)

amyeroberts · 2023-12-13T14:03:28Z

Thanks for adding this! Can we split this up into two separate PRs: one changing the assisted generation and the other adding speculative decoding?

gante · 2023-12-14T09:58:18Z

@amyeroberts pulled the assisted generation changes into this PR: #28030

After it is merged, I will rebase this one and ping you again -- this one will become exclusively about speculative decoding 🤗

gante · 2023-12-14T14:18:32Z

@amyeroberts I've rerun the slow tests, and I can confirm they are passing. Ready for a review :)

amyeroberts

Thanks for adding this!

Can we add some tests, in particular one which checks case 1. and one which makes sure the correct logic branch is being selected e.g. checking candidate_logits is None when expected (might be a test on the candidate generator instead)?

amyeroberts · 2023-12-15T17:19:35Z

src/transformers/generation/utils.py

+            if do_sample and candidate_logits is not None:
+                # Gets the probabilities from the logits. q_i and p_i denote the model and assistant (respectively)
+                # probabilities of the tokens selected by the assistant.
+                q = candidate_logits.softmax(dim=-1)


I'm fine with it as there's good comments and other variables are well names e.g. is_rejected :)

src/transformers/generation/utils.py

patrickvonplaten · 2023-12-15T20:39:56Z

src/transformers/generation/utils.py

+                if do_sample:
+                    probs = new_logits.softmax(dim=-1)
+                    selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]
+                else:
+                    selected_tokens = new_logits.argmax(dim=-1)


Suggested change

if do_sample:

probs = new_logits.softmax(dim=-1)

selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]

else:

selected_tokens = new_logits.argmax(dim=-1)

if do_sample:

probs = new_logits.softmax(dim=-1)

selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]

else:

selected_tokens = new_logits.argmax(dim=-1)

It's probably time to soon factor this out into something like:

selected_tokens = Categorical(new_logits / temperature).sample()

everywhere in generate

Yes! Then equivalent sampling/non-sampling methods (e.g. greedy decoding/samplinh) could be merged into a single function, facilitating maintenance. I'm going to leave it to a follow-up PR, though, to keep this PR exclusively about speculative decoding.

patrickvonplaten · 2023-12-15T20:45:45Z

src/transformers/generation/utils.py

            else:
-                selected_tokens = new_logits.argmax(dim=-1)
+                if do_sample:
+                    probs = new_logits.softmax(dim=-1)


is this case still relevant? Not sure it's a good idea to have two "assisted decoding" do_sample=True cases in our generate. Should we maybe just deprecate this case?

src/transformers/generation/utils.py

patrickvonplaten

Super cool addition!

Not really related to this PR, but I feel like we should start putting all the generation submethods (assisted decoding, greedy & sample (guess we can merge these two), beam search, ...) into their own files by now

My only important comment here is that I don't think it's great that we have 2 assisted generation cases now where do_sample=True. Can we deprecate the "non-official" one?

gante · 2023-12-17T19:38:29Z

@patrickvonplaten the two types of sampling are needed :D

New candidate-based methods are popping up (e.g. #27775), and they don't necessarily have logits. As such, speculative decoding, which needs the candidates' logits, can't be applied to those methods.

patrickvonplaten · 2023-12-18T10:11:08Z

@patrickvonplaten the two types of sampling are needed :D

New candidate-based methods are popping up (e.g. #27775), and they don't necessarily have logits. As such, speculative decoding, which needs the candidates' logits, can't be applied to those methods.

But shouldn't they just be the "own" method now? I.e. I don't think we should put #27775 into the speculative decoding method no?

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

gante · 2023-12-18T10:53:37Z

@patrickvonplaten #27775 does not introduce changes to assisted generation 🤗 In #28030 I've abstracted the candidate generation part of assisted generation. We now load candidate generators the same way as we load the logits processors:

transformers/src/transformers/generation/utils.py

Lines 899 to 919 in e6dcf8a

    
               def _get_candidate_generator( 
        
                   self, 
        
                   generation_config: GenerationConfig, 
        
                   input_ids: torch.LongTensor, 
        
                   inputs_tensor: torch.Tensor, 
        
                   assistant_model: "PreTrainedModel", 
        
                   logits_processor: LogitsProcessorList, 
        
                   model_kwargs: Dict, 
        
               ) -> CandidateGenerator: 
        
                   """ 
        
                   Returns the candidate generator to be used in `assisted_generation` 
        
                   """ 
        
                   candidate_generator = AssistedCandidateGenerator( 
        
                       input_ids=input_ids, 
        
                       assistant_model=assistant_model, 
        
                       logits_processor=logits_processor, 
        
                       model_kwargs=model_kwargs, 
        
                       inputs_tensor=inputs_tensor, 
        
                       eos_token_id=generation_config.eos_token_id, 
        
                   ) 
        
                   return candidate_generator

In assisted generation, we call the candidate generator to get candidate sequences (which may or may not contain associated logits, depending on the method)

transformers/src/transformers/generation/utils.py

Line 4588 in e6dcf8a

    
           candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)

The technique in #27775 can thus be added by adding a new candidate generator in _get_candidate_generator. Other candidate generators may be added the same way, enabling users to experiment with the concept of candidates!

Because needing the logits (for speculative decoding) is a very limiting constraint, I'd rather keep the two sampling paths.

HuggingFaceDocBuilderDev · 2023-12-18T11:07:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2023-12-18T11:56:52Z

@amyeroberts PR comments addressed 🤗

@patrickvonplaten Unless you don't strongly oppose, I'd like to keep the two sampling paths, for the reasons I've written here -- I think it will be beneficial in the long run! :) (otherwise, a whole new generation method has to be written for #27775)

gante · 2023-12-18T15:57:19Z

@amyeroberts -- @patrickvonplaten and I had a chat about whether to keep the two sampling paths or not. For context, here's what we agreed on:

It's okay to leave it as is, and perhaps abstract the different ways we accept candidates into a candidate_checker block.
Be conservative on adding new candidate generators, so we don't end up with unused methods
[in a follow-up PR] squash other cases where the decoding method is the same except for the token selection, like greedy_decoding + sample
[in a follow-up PR] mode each decoding method into its own file. There are several private functions in generation/utils.py that are exclusively used with one generation method.

amyeroberts

Thanks for iterating!

* speculative decoding * fix test * space * better comments * remove redundant test * test nit * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * PR comments --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

jmamou · 2024-01-18T17:01:24Z

@gante
According to experiments reported in Leviathan's paper, speculative decoding (SD) has higher speedup with greedy decoding (temp=0). However, in the current implementation, SD works only with do_sample=True.

gante · 2024-01-19T11:13:01Z

@jmamou speculative decoding with do_sample=False (or temp=0) was already encoded in assisted_generation, before this PR -- try calling model.generate(input_ids, do_sample=False, assistant_model=assistant_model) :)

jmamou · 2024-01-21T10:21:37Z

@gante
Since acceptance criteria are different between speculative decoding and assisted generation, I think that it would be great to be able to run both speculative decoding and assisted generation with no sampling.

jmamou · 2024-01-21T12:39:58Z

@gante
I implemented it. I can submit a PR.

jmamou · 2024-01-24T13:45:29Z

@gante
In previous implementation of assisted generation (4.33) with heuristical update of num_assistant_tokens (or max_assistant_tokens), the value of num_assistant_tokens was preserved between 2 consecutive generate() calls.

In current implementation (4.38), num_assistant_tokens is updated by the candidate_generator during the generation but assistant_model.generation_config.num_assistant_tokens is not updated at the end of the generation. Therefore, next call to generate will start with the initial value of assistant_model.generation_config.num_assistant_tokens (5).

Is it intentional? If that's a bug, I can open a PR to fix it.

gante · 2024-01-27T16:11:58Z

@jmamou

Since acceptance criteria are different between speculative decoding and assisted generation, I think that it would be great to be able to run both speculative decoding and assisted generation with no sampling.

Not sure if this is a good idea

if we see greedy decoding as applying temperature=0, the model probability will be 1 at the most likely token and 0 everywhere else. In turn, this implies that p_i/q_i is >=1 at all positions, and thus all candidate tokens would be accepted 👉 speculative decoding would be the same as simply using the assistant model
If we don't apply temperature=0, then it would be sampling -- in other words, it wouldn't be greedy decoding

In previous implementation of assisted generation (4.33) with heuristical update of num_assistant_tokens (or max_assistant_tokens), the value of num_assistant_tokens was preserved between 2 consecutive generate() calls.
In current implementation (4.38), num_assistant_tokens is updated by the candidate_generator during the generation but assistant_model.generation_config.num_assistant_tokens is not updated at the end of the generation. Therefore, next call to generate will start with the initial value of assistant_model.generation_config.num_assistant_tokens (5).
Is it intentional? If that's a bug, I can open a PR to fix it.

This is a good point! A PR to revert to the previous behaviour (with a test) would be appreciated 🙏

gante force-pushed the candidate_generate_refactor branch from 18a4eda to 993c9ee Compare December 12, 2023 17:08

gante requested review from patrickvonplaten and amyeroberts December 12, 2023 17:10

gante marked this pull request as ready for review December 12, 2023 17:11

gante commented Dec 12, 2023

View reviewed changes

gante mentioned this pull request Dec 14, 2023

Generate: assisted decoding now uses generate for the assistant #28030

Merged

gante added 4 commits December 14, 2023 14:03

speculative decoding

8dbb065

fix test

a726936

space

7e4deab

better comments

e234e1e

gante force-pushed the candidate_generate_refactor branch from 7bf05a9 to e234e1e Compare December 14, 2023 14:03

gante added 2 commits December 14, 2023 14:04

remove redundant test

b4dab21

test nit

f2f99f3

amyeroberts reviewed Dec 15, 2023

View reviewed changes

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

src/transformers/generation/utils.py Show resolved Hide resolved

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Dec 15, 2023

View reviewed changes

Apply suggestions from code review

64c59a5

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

PR comments

c7f1d12

patrickvonplaten approved these changes Dec 18, 2023

View reviewed changes

amyeroberts approved these changes Dec 18, 2023

View reviewed changes

gante merged commit ac97419 into huggingface:main Dec 19, 2023
21 checks passed

gante deleted the candidate_generate_refactor branch January 9, 2024 16:06

wasertech mentioned this pull request Jan 17, 2024

Add support for prompt-lookup speculative decoding vllm-project/vllm#2469

Closed

jmamou mentioned this pull request Jan 29, 2024

fix num_assistant_tokens with heuristic schedule #28759

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: speculative decoding #27979

Generate: speculative decoding #27979

gante commented Dec 12, 2023 •

edited

Loading

gante commented Dec 12, 2023

gante Dec 12, 2023

amyeroberts Dec 15, 2023

amyeroberts commented Dec 13, 2023

gante commented Dec 14, 2023 •

edited

Loading

gante commented Dec 14, 2023 •

edited

Loading

amyeroberts left a comment

amyeroberts Dec 15, 2023

patrickvonplaten Dec 15, 2023 •

edited

Loading

gante Dec 18, 2023

patrickvonplaten Dec 15, 2023

patrickvonplaten left a comment •

edited

Loading

gante commented Dec 17, 2023

patrickvonplaten commented Dec 18, 2023

gante commented Dec 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 18, 2023

gante commented Dec 18, 2023 •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

amyeroberts left a comment

jmamou commented Jan 18, 2024

gante commented Jan 19, 2024 •

edited

Loading

jmamou commented Jan 21, 2024

jmamou commented Jan 21, 2024

jmamou commented Jan 24, 2024

gante commented Jan 27, 2024 •

edited

Loading

Generate: speculative decoding #27979

Generate: speculative decoding #27979

Conversation

gante commented Dec 12, 2023 • edited Loading

What does this PR do?

gante commented Dec 12, 2023

gante Dec 12, 2023

Choose a reason for hiding this comment

amyeroberts Dec 15, 2023

Choose a reason for hiding this comment

amyeroberts commented Dec 13, 2023

gante commented Dec 14, 2023 • edited Loading

gante commented Dec 14, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Dec 15, 2023

Choose a reason for hiding this comment

patrickvonplaten Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

gante Dec 18, 2023

Choose a reason for hiding this comment

patrickvonplaten Dec 15, 2023

Choose a reason for hiding this comment

patrickvonplaten left a comment • edited Loading

Choose a reason for hiding this comment

gante commented Dec 17, 2023

patrickvonplaten commented Dec 18, 2023

gante commented Dec 18, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Dec 18, 2023

gante commented Dec 18, 2023 • edited Loading

gante commented Dec 18, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

jmamou commented Jan 18, 2024

gante commented Jan 19, 2024 • edited Loading

jmamou commented Jan 21, 2024

jmamou commented Jan 21, 2024

jmamou commented Jan 24, 2024

gante commented Jan 27, 2024 • edited Loading

gante commented Dec 12, 2023 •

edited

Loading

gante commented Dec 14, 2023 •

edited

Loading

gante commented Dec 14, 2023 •

edited

Loading

patrickvonplaten Dec 15, 2023 •

edited

Loading

patrickvonplaten left a comment •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Dec 18, 2023 •

edited

Loading

gante commented Jan 19, 2024 •

edited

Loading

gante commented Jan 27, 2024 •

edited

Loading