Granite speech speedup + model saving bugfix #39028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ArthurZucker merged 16 commits into huggingface:main from avihu111:granite_speech_updates

Jun 26, 2025

Contributor

avihu111 commented Jun 25, 2025 •

edited

Loading

What does this PR do?

Speeding up the encoder

Reverting Shaw's positional embedding calculation to einsum results in a significant speedup in both inference/training runtime.
We found it to be x30 times faster than the current explicit dot product using bfloat16.
I kept the explicit dot product in a comment for readability.
I hope that it would be possible.

Fixing issues with loading and saving with an adapter

When saving a checkpoint, the adapter config pointed to the original model, instead of the updated model
It fixes a bug, where we changed _hf_peft_config_loaded when saving
It reverts a tensor renaming that was triggered by adding an adapter.

Maybe there's a better solution for the problems I was facing - I'll be happy to hear your opinion.
I added comments on each code change, along with the necessary context and justification for the change.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @eustlb can you give that a look? 🙏
CC: @avishaiElmakies @alex-jw-brooks

avihu111 and others added 10 commits

June 15, 2025 16:50


          ensure the query is updated during training

db4a4af

avoid unused parameters that DDP does not like


          avoid a crash when kwargs contain padding=True

8ee3429

trainers often pass this argument automatically


          minor

8dec2ba


          Remove mel_spec lazy init, and rename to mel_filters.

4db4c99

this ensures save_pretrained will not crash when saving the processor during training
https://github.com/huggingface/transformers/blob/d5d007a1a0f0c11a726a54c8f00bd71825f84d02/src/transformers/feature_extraction_utils.py#L595


          minor - most feature extractors has a sampling_rate property

98844ec


          Merge branch 'main' into granite_speech_updates

6e68d8c


          speedup relative position embeddings

7064db7


          Merge branch 'huggingface:main' into granite_speech_updates

e94d0a1


          fix several issues in model saving/loading:

9c06f95

- avoid modifying `self._hf_peft_config_loaded` when saving
- adapter_config automatically points to the original base model - a finetuned version should point to the model save dir.
- fixing model weights names, that are changed by adding an adapter.


          Merge branch 'granite_speech_updates' of https://github.com/avihu111/…

6c2db62

…transformers into granite_speech_updates

avihu111 commented

View reviewed changes

src/transformers/models/granite_speech/modeling_granite_speech.py Outdated

    
                      # rel_pos_emb_expanded = rel_pos_emb.view([1, 1, 1] + list(rel_pos_emb.shape))

                      # pos_attn = torch.sum(query_states.unsqueeze(-2) * rel_pos_emb_expanded, dim=-1) * self.scale

                      # einsum gives x30 speedup:

                      pos_attn = torch.einsum('b m h c d, c r d -> b m h c r', query_states, rel_pos_emb) * self.scale

Contributor Author

avihu111 Jun 25, 2025

einsum runs significantly faster (measured with 500 repetitions), and has a smaller memory footprint:

einsum: 25.089 ms
existing (explicit dot): 594.220 ms

I was hoping we could use the einsum implementation to speed up inference/finetuning, and keep the equivalent formulation for readability.

avihu111 commented

View reviewed changes

src/transformers/models/granite_speech/modeling_granite_speech.py

    
                      if is_peft_available and self._hf_peft_config_loaded:

                          super().save_pretrained(*args, **kwargs)

                          adapter_name = self._get_adapter_name()

                          self.peft_config[adapter_name].base_model_name_or_path = save_directory

Contributor Author

avihu111 Jun 25, 2025

ensures the adapter config points to the finetuned model

avihu111 commented

View reviewed changes

src/transformers/models/granite_speech/modeling_granite_speech.py

    
                      self._hf_peft_config_loaded = False

                      super().save_pretrained(*args, **kwargs)

                      super().save_pretrained(save_directory, *args, **kwargs)

                      self._hf_peft_config_loaded = prev_val

Contributor Author

avihu111 Jun 25, 2025

bugfix, ensuring save_pretrained would not change the original value

avihu111 commented

View reviewed changes

src/transformers/models/granite_speech/modeling_granite_speech.py Outdated

    
                  @staticmethod

                  def _fix_state_dict_key_on_save(key) -> Tuple[str, bool]:

                      # save the model with the original weights format

                      return key.replace(".base_layer",""), False

Contributor Author

avihu111 Jun 25, 2025

The adapter changes the original parameter names by adding: .base_layer to each one.
This hack enables save_pretrained() and from_pretrained() to work as expected.

avihu111 added 3 commits

June 25, 2025 07:49


          minor

8b79d9e


          minor

313e4a2


          minor

29a69be

avihu111 marked this pull request as draft

June 25, 2025 08:17


          fixing a crash without peft active

bc152b9

avihu111 changed the title ~~Granite speech updates~~ Granite speech speedup + model saving bugfix

avihu111 marked this pull request as ready for review

June 25, 2025 12:32

ArthurZucker approved these changes

View reviewed changes

Collaborator

ArthurZucker left a comment

thanks! happy to merge as is but in general einsum is not magic, there is an equivalent implementation out there that only uses matrix notation!

src/transformers/models/granite_speech/modeling_granite_speech.py Outdated

Comment on lines 162 to 166

    
                      # faster implementation, equivalent to:

                      # rel_pos_emb_expanded = rel_pos_emb.view([1, 1, 1] + list(rel_pos_emb.shape))

                      # pos_attn = torch.sum(query_states.unsqueeze(-2) * rel_pos_emb_expanded, dim=-1) * self.scale

                      # einsum gives x30 speedup:

                      pos_attn = torch.einsum("b m h c d, c r d -> b m h c r", query_states, rel_pos_emb) * self.scale

Collaborator

ArthurZucker Jun 25, 2025

If einsum is possible I am fairly sur there is a way to do this with just matrix notation! We always avoid einsum in transformers!
Let's add a TODO here as you probably want to have this merged fast!

Contributor Author

avihu111 Jun 25, 2025

I really tried finding an alternative!
matmul is not a great choice, since we need vectorized dot product, and matmul would have redundant computations.
vecdot was a promising direction, but it was actually slower.
I think the einsum speedup has to do with either broadcasting, kernels for half precision - not sure.
I'll make sure to update it if I'll learn something new.
Adding a todo - Thanks Arthur!

Member

Rocketknight1 Jun 25, 2025

Hi @avihu111, just saw this while on watch! Try

(query_states.unsqueeze(-2) @ rel_pos_emb.transpose(-1, -2)).squeeze(-2)

Contributor Author

avihu111 Jun 26, 2025

Thanks @Rocketknight1 for a great suggestion.
That's a cool trick to get bmm to perform vectorized dot product.
For some reason, it still performs on par with the explicit dot product, which is still x50 times slower than einsum 😮

I ran the following code to compare all methods:

        for method in ["einsum", "explicit_dot", "vecdot", "bmm"]:
            with torch.amp.autocast("cuda", torch.bfloat16):
                t1 = time.time()
                for _ in range(500):
                    if method == "einsum":
                        cur_pos_attn = torch.einsum('b m h c d, c r d -> b m h c r', query_states, rel_pos_emb) * self.scale
                    elif method == "explicit_dot":
                        rel_pos_emb_expanded = rel_pos_emb.view([1, 1, 1] + list(rel_pos_emb.shape))
                        cur_pos_attn = torch.sum(query_states.unsqueeze(-2) * rel_pos_emb_expanded, dim=-1) * self.scale
                    elif method == "vecdot":
                        rel_pos_emb_expanded = rel_pos_emb.view([1, 1, 1] + list(rel_pos_emb.shape))
                        cur_pos_attn = torch.linalg.vecdot(query_states.unsqueeze(-2), rel_pos_emb_expanded, dim=-1) * self.scale
                    elif method == "bmm":
                        cur_pos_attn = (query_states.unsqueeze(-2) @ rel_pos_emb.transpose(-1, -2)).squeeze(-2) * self.scale

                print(f"{method} took {(time.time() - t1) * 1000:.3f} ms\t max abs diff is {(cur_pos_attn - pos_attn).abs().max().item():.5f}")

Results:

einsum took 27.996 ms    max abs diff is 0.00000
explicit_dot took 1450.317 ms    max abs diff is 0.01862
vecdot took 919.140 ms   max abs diff is 0.03125
bmm took 1426.787 ms     max abs diff is 0.00195

avihu111 added 2 commits

June 25, 2025 15:06


          add todo to replace einsum

0ebb8f0


          Merge branch 'main' into granite_speech_updates

ArthurZucker merged commit 22b0a89 into huggingface:main

12 checks passed

Collaborator

ArthurZucker commented Jun 26, 2025

Thanks for the thorough checks ! 🤗 makes a lot of sense when we have this huge perf diff!

avihu111 mentioned this pull request

Update Granite Speech results huggingface/open_asr_leaderboard#82

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet