Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Sequential beam search(a.k.a Low-memory beam search) #26304

Merged
merged 18 commits into from
Jan 19, 2024

Conversation

Saibo-creator
Copy link
Contributor

@Saibo-creator Saibo-creator commented Sep 21, 2023

What does this PR do?

This is to address issue in #22639
This PR is based on the idea from
#22639 (comment)

The original implementation of beam search effectively multiplies the batch size memory-wise and compute-wise by the batch size. If you have a batch size of 1 and a beam search of 8, model.forward sees 8 samples as effective batch size. This implementation is not necessary per se and can consume a lot of memory.

The new implementation split the full_batch(num_beam x batch size) inputs into a list of reduced_batch(beam_search_batch_size), run them sequentially and concat them back to a single model_output object. It involves two helper function:

  • def concat_model_outputs(objs: List[ModelOutput]) -> ModelOutput
  • def split_model_inputs( obj: Union[ModelOutput, Dict], split_size: int, full_batch_size: int ) -> List[Union[ModelOutput, Dict]]

The new implementation can be used in 4 decoding methods: beam_search, beam_sample, group_beam_search and constrained_beam_search

The expected behavior is that it produces exactly the same output(logits) as the original implementation.

I tested it with the following quick test and it works regardless of input

from
transformers import GPT2Tokenizer, AutoModelForCausalLM
import numpy as
np
from transformers import (
    AutoTokenizer,

AutoModelForSeq2SeqLM,
    LogitsProcessorList,

MinLengthLogitsProcessor,
    BeamSearchScorer,
)

tokenizer =
GPT2Tokenizer.from_pretrained("gpt2")
model =
AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer.pad_token_id =
tokenizer.eos_token_id
model_inputs = tokenizer('I enjoy walking with my
cute dog', return_tensors='pt')

# activate beam search and
early_stopping
beam_output = model.generate(
    **model_inputs,

max_new_tokens=40,
    num_beams=5,

early_stopping=True
)

print("Output:\n" + 100 *
'-')
print(tokenizer.decode(beam_output[0],
skip_special_tokens=True))

beam_output_w_subbatch = model.generate(

**model_inputs,
    max_new_tokens=40,
    num_beams=5,

early_stopping=True,
    beam_search_batch_size=1
)

print("Output:\n" +
100 * '-')
print(tokenizer.decode(beam_output_w_subbatch[0],
skip_special_tokens=True))

assert (beam_output ==
beam_output_w_subbatch).all(), "Beam search results from sub batch and
full batch are different"

TODO: only did it for pytorch models, if you think this PR is promising, I can do it for tensorflow too.

What does this PR do?

Fixes #22639

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@gante @ArthurZucker

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Saibo-creator 👋 My apologies for the delayed response, I was out of office 🌴

Thank you for opening this PR! Looking at the code, in principle, it looks good to me. The only thing I would change is the flag to trigger this low-memory behavior. We have a flag for that in contrastive search, so we could repurpose it -- then it is True, we would run one beam at a time. That way we avoid adding a new argument to generate (which already has too many arguments).

If you update your code to reuse this existing flag, I'd be happy to approve the PR :) I suspect the code could also be simplified as a result of this change.

P.S.: a test is also missing, to confirm that low-memory code path yields the same results. LMK if you need a hand with that :)

Copy link

github-actions bot commented Nov 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Saibo-creator
Copy link
Contributor Author

repurpose

Hey @gante , sorry I just saw your message. I agree with your suggestion and I will update the code and add a test!

@Saibo-creator
Copy link
Contributor Author

Hello @gante ,
I have reused the flag low_memory in the code and try to unify part of the code used in sequential beam search and sequential contrastive search.
A test is also added.

What makes me confused is the test case where I failed:

self = <tests.models.xglm.test_modeling_xglm.XGLMModelTest testMethod=test_tf_from_pt_safetensors>

    @is_pt_tf_cross_test
    def test_tf_from_pt_safetensors(self):
        for model_class in self.all_model_classes:
            config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
    
            tf_model_class_name = "TF" + model_class.__name__  # Add the "TF" at the beginning
            if not hasattr(transformers, tf_model_class_name):
                # transformers does not have this model in TF version yet
                return
    
            tf_model_class = getattr(transformers, tf_model_class_name)
    
            pt_model = model_class(config)
    
            with tempfile.TemporaryDirectory() as tmpdirname:
                pt_model.save_pretrained(tmpdirname, safe_serialization=True)
                tf_model_1 = tf_model_class.from_pretrained(tmpdirname, from_pt=True)
    
                pt_model.save_pretrained(tmpdirname, safe_serialization=False)
                tf_model_2 = tf_model_class.from_pretrained(tmpdirname, from_pt=True)
    
                # Check models are equal
                for p1, p2 in zip(tf_model_1.weights, tf_model_2.weights):
>                   self.assertTrue(np.allclose(p1.numpy(), p2.numpy()))
E                   AssertionError: False is not true

tests/test_modeling_common.py:3246: AssertionError


FAILED tests/models/speech_to_text/test_modeling_speech_to_text.py::Speech2TextModelTest::test_tf_from_pt_safetensors - AssertionError: False is not true
FAILED tests/models/transfo_xl/test_modeling_transfo_xl.py::TransfoXLModelTest::test_tf_from_pt_safetensors - AssertionError: False is not true
FAILED tests/models/xglm/test_modeling_xglm.py::XGLMModelTest::test_tf_from_pt_safetensors - AssertionError: False is not true

Exited with code exit status 255

This doesn't seem to be directly related to my changes. Maybe you know what's wrong here ;D

@Saibo-creator Saibo-creator changed the title feat: Enable beam search to run without … feat: Sequential beam search Nov 13, 2023
@gante
Copy link
Member

gante commented Nov 15, 2023

Hey @Saibo-creator!

The failing test is indeed unrelated :) We have skipped the test while we work on it, rebasing this branch should yield a green CI :)

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to approve the PR as soon as we replace the added test by a mixin test :)

src/transformers/generation/utils.py Show resolved Hide resolved
@@ -2637,6 +2637,21 @@ def test_beam_search_example_integration(self):

self.assertListEqual(outputs, ["Wie alt bist du?"])

def test_beam_search_low_memory(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding an integration test, I'd like to request adding a mixin -- like this one for contrastive search with low memory.

That way, we can ensure it is fully compatible with all models :) Like in the test linked above, it is okay to skip some models if it is failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good, I will do it ;)

One question, should I add @slow for this mixin test ?

Copy link
Member

@gante gante Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Saibo-creator no need to add the @slow decorator -- we reserve that decorator for tests that use large models. In the mixing tests, we are using small and randomly initialized models :)

src/transformers/generation/utils.py Outdated Show resolved Hide resolved
src/transformers/generation/utils.py Outdated Show resolved Hide resolved
@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Nov 17, 2023

Hello @gante , I would like to have your opinions over these failure cases

FAILED tests/models/blenderbot/test_modeling_blenderbot.py::BlenderbotModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[1, 51, 51, 51, 51], [1, 37, 37, 43, 43]] != [[1, 37, 37, 43, 43], [1, 74, 74, 74, 74]]
FAILED tests/models/bloom/test_modeling_bloom.py::BloomModelTest::test_beam_search_low_memory - RuntimeError: Sizes of tensors must match except in dimension 2. Expected size 2 but got size 8 for tensor number 1 in the list.
FAILED tests/models/codegen/test_modeling_codegen.py::CodeGenModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[135, 102, 167, 94, 172, 103, 209], [233, 32, 101, 213, 234, 172, 103]] != [[135, 102, 167, 94, 172, 103, 209], [233, 32, 101, 203, 147, 47, 164]]
FAILED tests/models/clvp/test_modeling_clvp.py::ClvpDecoderTest::test_beam_search_low_memory - AssertionError: Lists differ: [[38, 235, 110, 144, 159], [93, 177, 267, 144, 55]] != [[38, 235, 110, 177, 144], [93, 177, 67, 235, 116]]
FAILED tests/models/git/test_modeling_git.py::GitModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[56, 96, 46, 38, 92, 42, 1], [25, 63, 12, 22, 49, 91, 19]] != [[56, 96, 46, 38, 92, 42, 1], [25, 63, 12, 45, 72, 77, 69]]
FAILED
FAILED tests/models/gpt_neox/test_modeling_gpt_neox.py::GPTNeoXModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[6, 34, 87, 20, 93, 20, 93], [82, 10, 76, 68, 62, 48, 10]] != [[6, 34, 87, 20, 93, 20, 93], [82, 10, 76, 68, 62, 68, 62]]
FAILED tests/models/gptj/test_modeling_gptj.py::GPTJModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[84, 2, 29, 55, 59, 47, 88], [50, 23, 76, 50, 76, 16, 20]] != [[84, 2, 29, 94, 76, 16, 20], [50, 23, 76, 50, 94, 76, 50]]
FAILED
FAILED tests/models/imagegpt/test_modeling_imagegpt.py::ImageGPTModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[57, 97, 6, 36, 77, 62, 18], [11, 23, 94, 85, 88, 81, 61]] != [[57, 97, 6, 36, 77, 62, 18], [11, 23, 94, 54, 9, 72, 18]]
FAILED tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[83, 48, 56, 54, 56, 54, 56], [91, 84, 33, 72, 56, 54, 56]] != [[83, 48, 56, 54, 56, 54, 56], [91, 84, 33, 72, 56, 70, 78]]
FAILED tests/models/mega/test_modeling_mega.py::MegaModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[53, 16, 33, 21, 21, 34, 70], [82, 45, 21, 21, 34, 34, 70]] != [[53, 16, 33, 21, 34, 34, 70], [82, 45, 21, 21, 34, 34, 70]]
FAILED tests/models/mistral/test_modeling_mistral.py::MistralModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[52, 24, 42, 54, 4, 34, 21], [31, 71, 18, 5, 27, 28, 52]] != [[52, 24, 42, 54, 4, 34, 21], [31, 71, 18, 5, 73, 73, 73]]
FAILED tests/models/mpt/test_modeling_mpt.py::MptModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[8, 13, 85, 19, 19, 19, 19], [35, 61, 88, 19, 19, 19, 19]] != [[8, 13, 85, 19, 19, 19, 19], [35, 61, 88, 92, 92, 92, 92]]
FAILED tests/models/persimmon/test_modeling_persimmon.py::PersimmonModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[81, 83, 11, 58, 81, 28, 26], [54, 81, 18, 38, 18, 38, 18]] != [[81, 83, 11, 85, 58, 13, 38], [54, 81, 18, 38, 18, 38, 18]]
FAILED tests/models/phi/test_modeling_phi.py::PhiModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[28, 14, 49, 19, 56, 62, 50], [9, 30, 60, 19, 19, 19, 57]] != [[28, 14, 49, 37, 88, 35, 80], [9, 30, 60, 19, 57, 9, 57]]

The mixin tests reveal that a subset of models will generate different outputs over sequential beam search and beam search, though the difference is not very big.

It seems all these models are recent llm, such as phi, llama, mistral etc. While for other models, the outputs are identical.

I have tried debugging locally, and I confirm that:

  • the beam search is deterministic across multi-runs
  • the sequential beam search is also deterministic across multi-runs

So indeed these models's outputs differ in the SBS vs BS

Do you have any insights into what may have caused this difference?

On interesting question is : do these models give same outputs when running with different batch sizes ? Because SBS is just like unbatched BS.

Thank you!

@gante
Copy link
Member

gante commented Nov 20, 2023

Hey @Saibo-creator 👋 We might be seeing the same effect as described here

If you run the test multiple times on a single model, from the list of failing tests above, do you always get a failure? In other words, is the test stable or flaky?

To see whether a test is flaky, install

pip install pytest-flakefinder

and then run

pytest --flake-finder --flake-runs=100 tests/test_failing_test.py

👉 Since the models in the mixin are randomly initialized, and the inputs are random as well, all runs are different. If the test is flaky, then the mismatch is probably due to the effect described in the threat at the beginning of this comment, and there are quick solutions for the test :)

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Nov 22, 2023

Hello @gante , thanks for your feedback, I tried runningpytest-flakefinder and indeed the tests pass 7 out of 100 times and fails the other 93 times.

I think it's indeed due to the effect you explained in the thread(btw, bravo for the detailed explanation!)

What is your solution for it ? :)

In addition to the flaky failure case, there are some other failure cases, mostly due to model specific implementations, I dive into some of them:

  • bloom: special implementation of past key,value tensor shape
  • ctrl: TODO
  • fsmt: old model with different cache format, won't fix
  • gpt_bigcode: due to non-standard implementation of past key values , easy to fix
  • reformer: old model with different cache format, won't fix
  • transfo_xl: TODO
  • xlnet: TODO
  • cpm: TODO

I would like to know what is the general principle about these specific models? On the one hand, we could try to add if-else in the code to handle their specificity, but this also decreases the code quality a bit and our energy.

Let me know which models do you think we must have them supported. :)

(just saw this from https://github.com/huggingface/transformers/blob/main/tests/generation/test_utils.py#L1979)

  1. TODO (joao): A few models have different formats, skipping those until the cache refactor is complete
    models_without_standard_cache = ("bloom", "ctrl", "fsmt", "gptbigcode", "mega", "reformer")

@gante
Copy link
Member

gante commented Nov 29, 2023

Hey @Saibo-creator!

Regarding the model-specific implementations, feel free to ignore them for now (including skipping the tests, as you pasted at the end of your comment) :)

Regarding flakiness: you mentioned "the tests pass 7 out of 100 times and fails the other 93 times". Is this per model, or when you test all models at once?
👉 If it is the latter, for all models at once, then it means the per model failure rate is low -- we can simply add the @is_flaky test decorator, adding a comment pointing to the comment I linked.
👉 If it is the former, per model, then the failure rate is super high! There may be something else that we must uncover before merging :)

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Dec 1, 2023

Hey @gante

I think it's the former case, per model...

I need to dive deep into the problem to understand why..

If you are interested, here is what is happening, If I run pytest --flake-finder --flake-runs=100 tests/models/llama/test_modeling_llama.py,

I get 93 failed, 8307 passed, 4700 skipped, 1613 warnings in 930.30s (0:15:30)

and the failing cases are from test_beam_search_low_memory like:

FAILED tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[7, 68, 64, 28, 26, 37, 49], [52, 58, 92, 50, 7, 37, 49]] != [[7, 68, 64, 7, 37, 49, 37], [52, 58, 92, 71, 41, 7, 41]]
FAILED tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[32, 79, 18, 50, 94, 48, 93], [52, 51, 3, 56, 24, 12, 24]] != [[32, 79, 18, 50, 94, 5, 25], [52, 51, 3, 56, 24, 12, 24]]
FAILED tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[39, 97, 28, 57, 0, 0, 0], [34, 69, 91, 34, 63, 0, 0]] != [[39, 97, 28, 57, 0, 0, 0], [34, 69, 91, 5, 58, 18, 14]]

So this is 93/ 100 where the two outputs are different

For the other models such as GPT2, I also get som failure cases, though not frequent
with pytest --flake-finder --flake-runs=10 tests/models/gpt2/test_modeling_gpt2.py

FAILED tests/models/gpt2/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[29, 58, 3, 68, 68, 68, 68], [90, 65, 45, 45, 68, 68, 68]] != [[29, 58, 3, 68, 68, 68, 68], [90,...
FAILED tests/models/gpt2/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[18, 32, 29, 29, 87, 87, 87], [40, 85, 64, 64, 30, 30, 30]] != [[18, 32, 29, 87, 87, 87, 87], [4...
FAILED tests/models/gpt2/test_modeling_gpt2.py::GPT2ModelTest::test_beam_search_low_memory - AssertionError: Lists differ: [[98, 79, 92, 9, 9, 9, 9], [46, 16, 93, 9, 9, 9, 9]] != [[98, 79, 92, 9, 9, 9, 9], [46, 16, 93, 9...
=============================== 3 failed, 907 passed, 510 skipped, 197 warnings in 470.60s (0:07:50) 

@gante
Copy link
Member

gante commented Dec 7, 2023

@Saibo-creator sadly I have no easy advice here :( it could be a kernel-related numerical issue due to different shapes, like when using past_key_values, or it can be a subtle bug.

I see two paths forward:
1 - you are able to pin the source of the mismatch, and we can safely confirm that it is something like a numerical issue
2 - you run a few benchmarks over your PR -- if the resulting metrics are similar with the low memory mode, then there shouldn't be a bug

@Saibo-creator
Copy link
Contributor Author

I spent some time investigating the reason for mismatch. I notice that if batch-size = 1, all tests pass without any failure(as shown in the latest commit)

So it seems this is not a numerical issue but more like something to do with batch processing.

I will try to figure out if my implementation has a bug with batched input.

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Dec 22, 2023

I confirm that if we disable use_cache, the output are identical regardless of batch_size
So now we identified the origin was from the key_value_cache

@Saibo-creator Saibo-creator reopened this Dec 22, 2023
@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Dec 22, 2023

Hello @gante
After a whole afternoon debugging, I found the bug, it's a tiny bug in this line. I somehow wrote for i in range(0, full_batch_size, 1) instead of for i in range(0, full_batch_size, split_size). The former makes no sense at all so I guess it was just a typo, but hidden very well....

I'm happy that it works now : )

(The failure cases are irrelevant)

izyForever and others added 3 commits January 3, 2024 14:42
* fix llava index errors

* forward contrib credits from original implementation and fix

* better fix

* final fixes and fix all tests

* fix

* fix nit

* fix tests

* add regression tests

---------

Co-authored-by: gullalc <gullalc@users.noreply.github.com>
…odules (huggingface#27950)

* v1

* add docstring

* add tests

* add awq 0.1.8

* oops

* fix test
@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Jan 16, 2024

@ArthurZucker I updated code to incorporate your feedbacks. Let me know any other things to do

@JulesGM
Copy link

JulesGM commented Jan 17, 2024

Is there a way the batch size of the beam-search could be made separate from num_return_sequences?

@JulesGM
Copy link

JulesGM commented Jan 17, 2024

For example, for RL, I would like to generate let's say 64 different outputs with a single input sequence. However, my hardware (multiple gpu or not) hardware can't handle more than 16 beams at once.

@JulesGM
Copy link

JulesGM commented Jan 17, 2024

From trying to run the code, it seems like I couldn't get those 64 outputs? Just 16?

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Jan 17, 2024

@JulesGM

For example, for RL, I would like to generate let's say 64 different outputs with a single input sequence. However, my hardware (multiple gpu or not) hardware can't handle more than 16 beams at once.

In your case, say you have a batch size = 1(single input), you want to do beam search of k=64 with num_return_sequences=64 and the maximum number of sequences your gpu can handle in parallel is 16.
If you do beam_search with low_memory=True then this method will divide your beam into 64 batches(each of 1 sequence), and run them sequentially.
Currently there is no argument to specify the parallel_size, it's set to be identical to the batch_size, this is something we could discuss if we want to add. It involves adding a new argument to the generation function.

From trying to run the code, it seems like I couldn't get those 64 outputs? Just 16?

That is something I don't expect to happen. If you set beam search k =64 and num_return_seq=64, you should get 64 outputs, regardless you have low_memory option True or False. Am I wrong ? Could you provide a reproducible example ?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for you patience!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@gante
Copy link
Member

gante commented Jan 19, 2024

Merging as it's approved as it solves the original target use case.

@JulesGM don't refrain from commenting if it doesn't solve your use case: we may still open a follow-up PR to further refine this 🤗

@gante gante merged commit d4fc1eb into huggingface:main Jan 19, 2024
21 checks passed
@gante
Copy link
Member

gante commented Jan 19, 2024

@Saibo-creator thank you for iterating with us 💛

@Saibo-creator
Copy link
Contributor Author

Merging as it's approved as it solves the original target use case.

@JulesGM don't refrain from commenting if it doesn't solve your use case: we may still open a follow-up PR to further refine this 🤗

@JulesGM If you think it's necessary, we could consider adding more flexibility by allowing users to specify the parallel_size for example. It would be an easy add-on.

@Saibo-creator Saibo-creator changed the title feat: Sequential beam search feat: Sequential beam search(aka Low-memory beam search) Jan 19, 2024
@Saibo-creator Saibo-creator changed the title feat: Sequential beam search(aka Low-memory beam search) feat: Sequential beam search(a.k.a Low-memory beam search) Jan 19, 2024
@gante
Copy link
Member

gante commented Jan 19, 2024

@Saibo-creator alternatively, a try/except loop could be tried (try with batch_size -> if it fails, try with batch_size/2 -> ... -> try with batch_size = 1). I'd prefer that to an additional flag -- we already have too many flags in generate :)

wgifford pushed a commit to wgifford/transformers that referenced this pull request Jan 21, 2024
AjayP13 pushed a commit to AjayP13/transformers that referenced this pull request Jan 22, 2024
@Saibo-creator
Copy link
Contributor Author

@Saibo-creator alternatively, a try/except loop could be tried (try with batch_size -> if it fails, try with batch_size/2 -> ... -> try with batch_size = 1). I'd prefer that to an additional flag -- we already have too many flags in generate :)

Thanks for the suggestion @gante ! I think this would be a good improvement to reduce work from user side. Should I open another PR to proceed ?

@JulesGM
Copy link

JulesGM commented Jan 26, 2024

Merging as it's approved as it solves the original target use case.
@JulesGM don't refrain from commenting if it doesn't solve your use case: we may still open a follow-up PR to further refine this 🤗

@JulesGM If you think it's necessary, we could consider adding more flexibility by allowing users to specify the parallel_size for example. It would be an easy add-on.

I think that this would be great. In my personal fork, I currently look at if low_memory is an int, and use it as a sub-batch size if it is the case ahah. Ofc I can see arguments as to why a separate argument would be better, I'm not attached to this solution at all.

@JulesGM
Copy link

JulesGM commented Jan 26, 2024

I also extended the code in my personal fork to beam_sample, basically copy-pasting the beam_search sequential section. I think that that would be a good thing to add.

@gante
Copy link
Member

gante commented Jan 27, 2024

@Saibo-creator yeah, open a new PR, I'd be happy to merge it 🤗

@Saibo-creator
Copy link
Contributor Author

@Saibo-creator yeah, open a new PR, I'd be happy to merge it 🤗

👍🏻 I will make a PR shortly

@JulesGM
Copy link

JulesGM commented Jan 28, 2024

So I noticed that in practice this didn't allow me to increase the beam size by that much at all if the sequence size is long. So I'm experimenting with offloading the key value cache to cpu & only loading the part of the cache that is needed, & it seems to really help.

@Saibo-creator
Copy link
Contributor Author

So I noticed that in practice this didn't allow me to increase the beam size by that much at all if the sequence size is long. So I'm experimenting with offloading the key value cache to cpu & only loading the part of the cache that is needed, & it seems to really help.

Oh, thanks for pointing out @JulesGM . I think your observation makes sense. Offloading the kv-cache should indeed be more efficient!

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Jan 31, 2024

So I noticed that in practice this didn't allow me to increase the beam size by that much at all if the sequence size is long. So I'm experimenting with offloading the key value cache to cpu & only loading the part of the cache that is needed, & it seems to really help.

Hey @JulesGM , this approach is really cool, would you like to share a link to your implementation of this feature? I could benefit from it too :)

KaifAhmad1 added a commit to KaifAhmad1/transformers that referenced this pull request Feb 20, 2024
…uggingface#29145 (#1)

* Add qwen2 (#29145)

* add config, modeling, and tokenization

* add auto and init

* update readme

* update readme

* update team name

* fixup

* fixup

* update config

* update code style

* update for fixup

* update for fixup

* update for fixup

* update for testing

* update for testing

* fix bug for config and tokenization

* fix bug for bos token

* not doctest

* debug tokenizer

* not doctest

* debug tokenization

* debug init for tokenizer

* fix style

* update init

* delete if in token auto

* add tokenizer doc

* add tokenizer in init

* Update dummy_tokenizers_objects.py

* update

* update

* debug

* Update tokenization_qwen2.py

* debug

* Update convert_slow_tokenizer.py

* add copies

* add copied from and make style

* update files map

* update test

* fix style

* fix merge reading and update tests

* fix tests

* fix tests

* fix style

* debug a variable in readme

* Update src/transformers/models/qwen2/configuration_qwen2.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update test and copied from

* fix style

* update qwen2 tokenization  and tests

* Update tokenization_qwen2.py

* delete the copied from after property

* fix style

* update tests

* update tests

* add copied from

* fix bugs

* update doc

* add warning for sliding window attention

* update qwen2 tokenization

* fix style

* Update src/transformers/models/qwen2/modeling_qwen2.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix tokenizer fast

---------

Co-authored-by: Ren Xuancheng <jklj077@users.noreply.github.com>
Co-authored-by: renxuancheng.rxc <renxuancheng.rxc@alibaba-inc.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix SDPA tests (#28552)

* skip bf16 test if not supported by device

* fix

* fix bis

* use is_torch_bf16_available_on_device

* use is_torch_fp16_available_on_device

* fix & use public llama

* use 1b model

* fix flacky test

---------

Co-authored-by: Your Name <you@example.com>

* Allow to train dinov2 with different dtypes like bf16 (#28504)

I want to train dinov2 with bf16 but I get the following error in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L635:

```
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
```

Since the input dtype is torch.float32, the parameter dtype has to be torch.float32...

@LZHgrla and I checked the code of clip vision encoder and found there is an automatic dtype transformation (https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/clip/modeling_clip.py#L181-L182).

So I add similar automatic dtype transformation to modeling_dinov2.py.

* Fix Switch Transformers When sparse_step = 1 (#28564)

Fix sparse_step = 1

I case sparse_step = 1, the current code will not work.

* Save `Processor` (#27761)

* save processor

* Update tests/models/auto/test_processor_auto.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update tests/test_processing_common.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Use `weights_only` only if torch >= 1.13 (#28506)

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [`Core Tokenization`] Support a fix for spm fast models (#26678)

* fix

* last attempt

* current work

* fix forward compatibility

* save all special tokens

* current state

* revert additional changes

* updates

* remove tokenizer.model

* add a test and the fix

* nit

* revert one more break

* fix typefield issue

* quality

* more tests

* fix fields for FC

* more nits?

* new additional changes

* how

* some updates

* the fix

* where do we stand

* nits

* nits

* revert unrelated changes

* nits nits nits

* styling

* don't break llama just yet

* revert llama changes

* safe arg check

* fixup

* Add a test for T5

* Necessary changes

* Tests passing, added tokens need to not be normalized. If the added tokens are normalized, it will the stripping which seems to be unwanted for a normal functioning

* Add even more tests, when normalization is set to True (which does not work :sweat: )

* Add even more tests, when normalization is set to True (which does not work :sweat: )

* Update to main

* nits

* fmt

* more and more test

* comments

* revert change as tests are failing

* make the test more readble

* nits

* refactor the test

* nit

* updates

* simplify

* style

* style

* style convert slow

* Update src/transformers/convert_slow_tokenizer.py

* chore: Fix multiple typos (#28574)

* Add new meta w2v2-conformer BERT-like model (#28165)

* first commit

* correct default value non causal

* update config and modeling code

* update converting checkpoint

* clean modeling and fix tests

* make style

* add new config parameters to docstring

* fix copied from statements

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* make position_embeddings_type docstrings clearer

* clean converting script

* remove function not used

* clean modeling file

* apply suggestion for test file + add convert script to not_doctested

* modify tests according to review - cleaner logic and more tests

* Apply nit suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add checker of valid position embeddings type

* instantiate new layer norm layer with the right eps

* fix freeze_feature_encoder since it can be None in some cases

* add test same output in convert script

* restore wav2vec2conformer and add new model

* create processor and FE + clean

* add new model code

* fix convert script and set default config parameters

* correct model id paths

* make style

* make fix-copies and cleaning files

* fix copied from statements

* complete .md and fixe copies

* clean convert script argument defaults

* fix config parameters docstrings

* fix config docstring

* add copied from and enrich FE tests

* fix copied from and repo-consistency

* add autotokenizer

* make test input length shorter and change docstring code

* fix docstrings and copied from

* add add_adapter to ASR training example

* make testing of adapters more robust

* adapt to multi adapter layers

* refactor input_values->input_features and remove w2v2-bert feature extractor

* remove pretraining model

* remove depreciated features and useless lines

* add copied from and ignore statements to modeling tests

* remove pretraining model #2

* change import in convert script

* change default in convert script

* update readme and remove useless line

* Update tests/models/wav2vec2_bert/test_processor_wav2vec2_bert.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* refactor BERT to Bert for consistency

* remove useless ignore copy statement

* add persistent to buffer in rotary

* add eps in LayerNorm init and remove copied from

* add adapter activation parameters and add copied from statements

* Fix copied statements and add unitest.skip reasons

* add copied statement in test_processor

* refactor processor

* make style

* replace numpy random by torch rand

* remove expected output CTC

* improve converting script with processor class

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* remove gumbel class

* remove tests related to previously deleted class

* Update src/transformers/models/wav2vec2_bert/configuration_wav2vec2_bert.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* correct typos

* remove uused parameters

* update processor to takes both text and audio

* update checkpoints

* update expected output and add ctc expected output

* add label_attention_mask

* replace pt with np in processor tests

* fix typo

* revert to behaviour with labels_attention_mask

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Use `LoggingLevel` context manager in 3 tests (#28575)

* inside with LoggingLevel

* remove is_flaky

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix the documentation checkpoint for xlm-roberta-xl (#28567)

* Fix the documentation checkpoint for xlm-roberta-xl

* Improve docstring consistency

* [ASR Pipe] Update init to set model type and subsequently call parent init method (#28486)

* add image processor arg

* super

* rm args

* [Whisper Tok] Move token ids to CPU when computing offsets (#28485)

* move token ids to cpu

* check for torch attr

* [Whisper] Fix audio classification with weighted layer sum (#28563)

* fix

* tests

* fix test

* Making CTC training example more general (#28582)

* add w2v2bert compatibility

* Update examples/pytorch/speech-recognition/run_speech_recognition_ctc.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Don't save `processor_config.json` if a processor has no extra attribute  (#28584)

* not save if empty

* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* v4.38.dev.0

* Add w2v2bert to pipeline (#28585)

* generalize asr pipeline to fbank models

* change w2v2 pipeline output

* Update test_pipelines_automatic_speech_recognition.py

* feat: Sequential beam search (#26304)

* [Whisper] Finalize batched SOTA long-form generation (#27658)

* finalize

* make fix copies whisper

* [Tests] Make sure that we don't run tests mulitple times

* Update src/transformers/models/whisper/modeling_whisper.py

* [Tests] Make sure that we don't run tests mulitple times

* fix more

* improve

* improve

* improve further

* improve more

* improve

* fix more

* git commit and git push

* fix more

* fix more

* fix more

* New try

* Fix more whisper stuff

* Improve

* correct more

* correct more

* correct more

* Fix some tests

* Add more tests

* correct more

* correct more

* correct more

* push

* correct more

* Fix more

* Better

* without dec mask

* correct more

* clean

* save intermediate

* Fix more

* Fix VAD for large-v2

* Save new

* Correct more

* make cleaner

* correct tests

* correct src

* Finish

* Fix more

* Fix more

* finish

* Fix edge cases

* fix return_dict_in_generate

* fix all tests

* make style

* add docstrings

* add docstrings

* Fix logit processor

* make style

* fix pipeline test

* fix more style

* Apply suggestions from code review

* apply feedback Sanchit

* correct more

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* correct more

* correct more

* correct more

* Fix staticmethod

* correct more

* fix

* fix slow tests

* make style

* fix tokenizer test

* fix tokenizer test

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* finish

* finish

* revert kwargs change

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix wrong xpu device in DistributedType.MULTI_XPU mode (#28386)

* remove elif xpu

* remove redudant code

* [SigLIP] Don't pad by default (#28578)

First draft

* [`Llava`] Fix convert_llava_weights_to_hf.py script (#28570)

* Update convert_llava_weights_to_hf.py

Fix call to `tokenizer.add_tokens`

* Add special_tokens to tokenizer.add_tokens in convert_vipllava_weights_to_hf.py

* Allow add_tokens for ESM (#28535)

* Allow non-special tokens to be added

* Add test, fix token adding code

* Revert changes to id_to_token and token_to_id

* Update the ESM tokenizer to be a bit more standardized

* Update src/transformers/models/esm/tokenization_esm.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Fix `_speculative_sampling` implementation (#28508)

* RWKV: raise informative exception when attempting to manipulate `past_key_values` (#28600)

* Fix auxiliary loss related code in transformers (#28406)

* [DETA] fix freeze/unfreeze function

* Update src/transformers/models/deta/modeling_deta.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/deta/modeling_deta.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* add freeze/unfreeze test case in DETA

* fix type

* fix typo 2

* fix : enable aux and enc loss in training pipeline

* Add unsynced variables from original DETA for training

* modification for passing CI test

* make style

* make fix

* manual make fix

* change deta_modeling_test of configuration 'two_stage' default to TRUE and minor change of dist checking

* remove print

* divide configuration in DetaModel and DetaForObjectDetection

* image smaller size than 224 will give topk error

* pred_boxes and logits should be equivalent to two_stage_num_proposals

* add missing part in DetaConfig

* Update src/transformers/models/deta/modeling_deta.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* add docstring in configure and prettify TO DO part

* change distribute related code to accelerate

* Update src/transformers/models/deta/configuration_deta.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/deta/test_modeling_deta.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* protect importing accelerate

* change variable name to specific value

* wrong import

* fix aux_loss in conditional_detr

* add test aux_loss

* add aux_loss test in deta and table_transformer

* fix yolos since it doesn't have auxiliary function

* fix maskformer auxiliary_loss related code

* make style

* change param 'auxiliary_loss' to 'use_auxiliary_loss'

* change param 'auxiliary_loss' to 'use_auxiliary_loss' in tests

* make style & fix-copies, also revert yolos related parameter

* revert variable name 'use_auxiliary_loss' to 'auxiliary_loss' due to DetrConfig

* revert variable name in yolos

* revert maskformer

* add aux_loss test in maskformer

* make style

* Update src/transformers/models/yolos/configuration_yolos.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* [`GPTNeoX`] Fix BC issue with 4.36 (#28602)

* fix dtype issue

* add a test

* update copied from mentions

* nits

* fixup

* fix copies

* Apply suggestions from code review

* Fix id2label assignment in run_classification.py (#28590)

* Add missing key to TFLayoutLM signature (#28640)

Fix missing bbox in LayoutLM signature

* Avoid root logger's level being changed (#28638)

* avoid root logger's level being changed

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Add config tip to custom model docs (#28601)

Add tip to custom model docs

* Fix lr_scheduler in no_trainer training scripts (#27872)

* Fix lr_scheduler

* Fix lr scheduler

* [`Llava`] Update convert_llava_weights_to_hf.py script (#28617)

* Update convert_llava_weights_to_hf.py script

* Remove config update of adding padding to `vocab_size` and `text_config.vocab_size` which causes `ValueError` exception.
* Remove keys that ends with `inv_freq` from the state dict.
* Add examples and instructions for creating `model_state_dict.bin` that can be used by the script.

* Update convert_llava_weights_to_hf.py

* Update convert_vipllava_weights_to_hf.py

* [`GPTNeoX`] Fix GPTNeoX + Flash Attention 2 issue (#28645)

Update modeling_gpt_neox.py

* Update image_processing_deformable_detr.py (#28561)

* Update image_processing_deformable_detr.py

* Changes after running make fix-copies

* [`SigLIP`] Only import tokenizer if sentencepiece available (#28636)

Only import class if sp available

* Fix phi model doc checkpoint (#28581)

Co-authored-by: Pashmina Cameron <11311835+pashminacameron@users.noreply.github.com>

* get default device through `PartialState().default_device` as it has been officially released (#27256)

get default device through `PartialState().default_device` as it has
been officially released

* integrations: fix DVCLiveCallback model logging (#28653)

* Enable safetensors conversion from PyTorch to other frameworks without the torch requirement (#27599)

* Initial commit

* Requirements & tests

* Tests

* Tests

* Rogue import

* Rogue torch import

* Cleanup

* Apply suggestions from code review

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* bfloat16 management

* Sanchit's comments

* Import shield

* apply suggestions from code review

* correct bf16

* rebase

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: sanchit-gandhi <sanchit@huggingface.co>

* Enable instantiating model with pretrained backbone weights (#28214)

* Enable instantiating model with pretrained backbone weights

* Update tests so backbone checkpoint isn't passed in

* Remove doc updates until changes made in modeling code

* Clarify pretrained import

* Update configs - docs and validation check

* Update src/transformers/utils/backbone_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Clarify exception message

* Update config init in tests

* Add test for when use_timm_backbone=True

* Small test updates

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* `tensor_size` - fix copy/paste error msg typo (#28660)

Fix copy/paste error msg typo

* Fix windows err with checkpoint race conditions (#28637)

Fix windows err

* add dataloader prefetch factor in training args and trainer (#28498)

* add dataloader prefetch factor in training args and trainer

* remove trailing spaces

* prevent dataloader_num_workers == 0 and dataloader_prefetch_factor != None

dataloader_prefetch_factor works only when data is loaded in a different process as the main one. This commit adds the necessary checks to avoid having prefetch_factor set when there is no such process.

* Remove whitespaces in empty line

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Support single token decode for `CodeGenTokenizer` (#28628)

convert token id to list in .decode()

* Remove deprecated eager_serving fn (#28665)

* Remove deprecated eager_serving fn

* Fix the input_signature docstring while I'm here

* fix a hidden bug of `GenerationConfig`, now the `generation_config.json` can be loaded successfully (#28604)

* fix a hidden bug of GenerationConfig

* keep `sort_keys=True` to maintain visibility

* Update src/transformers/generation/configuration_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update configuration_utils.py

in case `obj` is a list, check the items in the list

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update README_es.md (#28612)

Fixing grammatical errors in the text

* Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517)

* fix the function load_balancing_loss_func in Mixtral_Moe to include attention_mask

* format code using black and ruff

* skip computing mask if attention_mask=None

* add tests for load balancing loss Mixtral-Moe

* fix assert loss is different in mixtral_test

* fix pad_leng

* use assertNotAlmostEqual and print to debug

* remove print for debug

* minor updates

* reduce rtol and atol

* Use save_safetensor to disable safe serialization for XLA (#28669)

* Use save_safetensor to disable safe serialization for XLA

https://github.com/huggingface/transformers/issues/28438

* Style fixup

* Add back in generation types (#28681)

* [docs] DeepSpeed (#28542)

* config

* optim

* pre deploy

* deploy

* save weights, memory, troubleshoot, non-Trainer

* done

* Improved type hinting for all attention parameters (#28479)

* Changed type hinting for all attention inputs to 'Optional[Tuple[torch.FloatTensor,...]] = None'

* Fixed the ruff formatting issue

* fixed type hinting for all hidden_states to 'Optional[Tuple[torch.FloatTensor, ...]] = None'

* Changed type hinting in these 12 scripts modeling_dpr.py,modeling_nat.py,idefics/vision.py,modeling_tf_dpr.py,modeling_luke.py,modeling_swin.py,modeling_tf_swin.py,modeling_blip.py,modeling_tf_blip.py,modeling_donut_swin.py,modeling_dinat.py,modeling_swinv2.py

* test fail update

* fixed type hinting for these 15 scripts modeling_xlnet.py,modeling_tf_xlnet.py,modeling_led.py,modeling_tf_led.py,modleing_rwkv.py,modeling_dpt.py,modeling_tf_cvt.py,modeling_clip.py,modeling_flax_clip.py,modeling_tf_clip.py,modeling_longformer.py,modeling_tf_longformer.py,modeling_siglip.py,modeling_clap.py,modeling_git.py

* Changed type hinting in these 12 scripts modeling_dpr.py,modeling_nat.py,idefics/vision.py,modeling_tf_dpr.py,modeling_luke.py,modeling_swin.py,modeling_tf_swin.py,modeling_blip.py,modeling_tf_blip.py,modeling_donut_swin.py,modeling_dinat.py,modeling_swinv2.py

* test fail update

* Removed the myvenv file

* Fixed type hinting for these 8 scripts modeling_tvlt.py,modeling_sam.py,modeling_tf_sam.py,modeling_tvp.py,modeling_rag.py,modeling_tf_rag.py,modeling_tf_xlm.py,modeling_xlm.py

* improve efficient training on CPU documentation (#28646)

* update doc

* revert

* typo fix

* refine

* add dtypes

* Update docs/source/en/perf_train_cpu.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/perf_train_cpu.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/perf_train_cpu.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* no comma

* use avx512-vnni

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* [docs] Fix doc format (#28684)

* fix hfoptions

* revert changes to other files

* fix

* Add Depth Anything (#28654)

* First draft

* More improvements

* More improvements

* More improvements

* More improvements

* Add docs

* Remove file

* Add copied from

* Address comments

* Address comments

* Address comments

* Fix style

* Update docs

* Convert all checkpoints, add integration test

* Rename checkpoints

* Add pretrained backbone attributes

* Fix default config

* Address comment

* Add figure to docs

* Fix bug thanks to @xenova

* Update conversion script

* Fix integration test

* [`chore`] Add missing space in warning (#28695)

Add missing space in warning

* Improve Backbone API docs (#28666)

Update backbones.md

* Update question_answering.md (#28694)

fix typo:

from:

 "model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")"

to:
model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

* [`Vilt`] align input and model dtype in the ViltPatchEmbeddings forward pass  (#28633)

align dtype

* [`docs`] Improve visualization for vertical parallelism (#28583)

The documentation says "We refer to this Model parallelism as “Vertical” because of how models are typically visualized.", but then visualizes the model horizontally. This change visualizes the model indeed vertically.

* Don't fail when `LocalEntryNotFoundError` during `processor_config.json` loading (#28709)

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix duplicate & unnecessary flash attention warnings (#28557)

* fix duplicate & unnecessary flash warnings

* trigger ci

* warning_once

* if/else order

---------

Co-authored-by: Your Name <you@example.com>

* support PeftMixedModel signature inspect (#28321)

* support PeftMixedModel signature inspect

* import PeftMixedModel only peft>=0.7.0

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* fix styling

* Update src/transformers/trainer.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* style fixup

* fix note

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix: corrected misleading log message in save_pretrained function (#28699)

* [`docs`] Update preprocessing.md (#28719)

* Update preprocessing.md

adjust ImageProcessor link to working target (same as in lower section of file)

* Update preprocessing.md

* Initialize _tqdm_active with hf_hub_utils.are_progress_bars_disabled(… (#28717)

Initialize _tqdm_active with hf_hub_utils.are_progress_bars_disabled() to respect HF_HUB_DISABLE_PROGRESS_BARS

It seems like enable_progress_bar() and disable_progress_bar() sync up with huggingface_hub, but the initial value is always True. This changes will make sure the user's preference is respected implicity on initialization.

* Fix `weights_only` (#28725)

fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Stop confusing the TF compiler with ModelOutput objects (#28712)

* Stop confusing the TF compiler with ModelOutput objects

* Stop confusing the TF compiler with ModelOutput objects

* fix: suppress `GatedRepoError` to use cache file (fix #28558). (#28566)

* fix: suppress `GatedRepoError` to use cache file (fix #28558).

* move condition_to_return parameter back to outside.

* Unpin pydantic (#28728)

* try pydantic v2

* try pydantic v2

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [docs] Fix datasets in guides (#28715)

* change datasets

* fix

* [Flax] Update no init test for Flax v0.7.1 (#28735)

* Falcon: removed unused function (#28605)

* Generate: deprecate old src imports (#28607)

* [`Siglip`] protect from imports if sentencepiece not installed (#28737)

[Siglip] protect from imports if sentencepiece not installed

* Add serialization logic to pytree types (#27871)

* Add serialized type name to pytrees

* Modify context

* add serde test

* Fix `DepthEstimationPipeline`'s docstring (#28733)

* fix

* fix

* Fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix input data file extension in examples (#28741)

* [Docs] Fix Typo in English & Japanese CLIP Model Documentation (TMBD -> TMDB) (#28751)

* [Docs] Fix Typo in English CLIP model_doc

* [Docs] Fix Typo in Japanese CLIP model_doc

* PatchtTST and PatchTSMixer fixes (#28083)

* :bug: fix .max bug

* remove prediction_length from regression output dimensions

* fix parameter names, fix output names, update tests

* ensure shape for PatchTST

* ensure output shape for PatchTSMixer

* update model, batch, and expected for regression distribution test

* update test expected

Signed-off-by: Wesley M. Gifford <wmgifford@us.ibm.com>

* Update tests/models/patchtst/test_modeling_patchtst.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/patchtst/test_modeling_patchtst.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/patchtst/test_modeling_patchtst.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/patchtsmixer/modeling_patchtsmixer.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* standardize on patch_length

Signed-off-by: Wesley M. Gifford <wmgifford@us.ibm.com>

* Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Make arguments more explicit

Signed-off-by: Wesley M. Gifford <wmgifford@us.ibm.com>

* adjust prepared inputs

Signed-off-by: Wesley M. Gifford <wmgifford@us.ibm.com>

---------

Signed-off-by: Wesley M. Gifford <wmgifford@us.ibm.com>
Co-authored-by: Wesley M. Gifford <wmgifford@us.ibm.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Enable Gradient Checkpointing in Deformable DETR (#28686)

* Enabled gradient checkpointing in Deformable DETR

* Enabled gradient checkpointing in Deformable DETR encoder

* Removed # Copied from headers in modeling_deta.py to break dependence on Deformable DETR code

* small doc update for CamemBERT (#28644)

* Pin pytest version <8.0.0 (#28758)

* Pin pytest version <8.0.0

* Update setup.py

* make deps_table_update

* Mark test_constrained_beam_search_generate as flaky (#28757)

* Make test_constrained_beam_search_generate as flaky

* Update tests/generation/test_utils.py

* Fix typo of `Block`. (#28727)

* [Whisper] Make tokenizer normalization public (#28136)

* [Whisper] Make tokenizer normalization public

* add to docs

* Support saving only PEFT adapter in checkpoints when using PEFT + FSDP (#28297)

* Update trainer.py

* Revert "Update trainer.py"

This reverts commit 0557e2cc9effa3a41304322032239a3874b948a7.

* Make trainer.py use adapter_only=True when using FSDP + PEFT

* Support load_best_model with adapter_only=True

* Ruff format

* Inspect function args for save_ load_ fsdp utility functions and only pass adapter_only=True if they support it

* Add French translation: french README.md (#28696)

* doc: french README

Signed-off-by: ThibaultLengagne <thibaultl@padok.fr>

* doc: Add Depth Anything

Signed-off-by: ThibaultLengagne <thibaultl@padok.fr>

* doc: Add french link in other docs

Signed-off-by: ThibaultLengagne <thibaultl@padok.fr>

* doc: Add missing links in fr docs

* doc: fix several mistakes in translation

Signed-off-by: ThibaultLengagne <thibaultl@padok.fr>

---------

Signed-off-by: ThibaultLengagne <thibaultl@padok.fr>
Co-authored-by: Sarapuce <alexandreh@padok.fr>

* Don't allow passing `load_in_8bit` and `load_in_4bit` at the same time (#28266)

* Update quantization_config.py

* Style

* Protect from setting directly

* add tests

* Update tests/quantization/bnb/test_4bit.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Move CLIP _no_split_modules to CLIPPreTrainedModel (#27841)

Add _no_split_modules to CLIPModel

* `HfQuantizer` class for quantization-related stuff in `modeling_utils.py` (#26610)

* squashed earlier commits for easier rebase

* rm rebase leftovers

* 4bit save enabled @quantizers

* TMP gptq test use exllama

* fix AwqConfigTest::test_wrong_backend for A100

* quantizers AWQ fixes

* _load_pretrained_model low_cpu_mem_usage branch

* quantizers style

* remove require_low_cpu_mem_usage attr

* rm dtype arg from process_model_before_weight_loading

* rm config_origin from Q-config

* rm inspect from q_config

* fixed docstrings in QuantizationConfigParser

* logger.warning fix

* mv is_loaded_in_4(8)bit to BnbHFQuantizer

* is_accelerate_available error msg fix in quantizer

* split is_model_trainable in bnb quantizer class

* rm llm_int8_skip_modules as separate var in Q

* Q rm todo

* fwd ref to HFQuantizer in type hint

* rm note re optimum.gptq.GPTQQuantizer

* quantization_config in __init__ simplified

* replaced NonImplemented with  create_quantized_param

* rm load_in_4/8_bit deprecation warning

* QuantizationConfigParser refactoring

* awq-related minor changes

* awq-related changes

* awq config.modules_to_not_convert

* raise error if no q-method in q-config in args

* minor cleanup

* awq quantizer docstring

* combine common parts in bnb process_model_before_weight_loading

* revert test_gptq

* .process_model_ cleanup

* restore dict config warning

* removed typevars in quantizers.py

* cleanup post-rebase 16 jan

* QuantizationConfigParser classmethod refactor

* rework of handling of unexpected aux elements of bnb weights

* moved q-related stuff from save_pretrained to quantizers

* refactor v1

* more changes

* fix some tests

* remove it from main init

* ooops

* Apply suggestions from code review

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* fix awq issues

* fix

* fix

* fix

* fix

* fix

* fix

* add docs

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update docs/source/en/hf_quantizer.md

* address comments

* fix

* fixup

* Update src/transformers/modeling_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/modeling_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* address final comment

* update

* Update src/transformers/quantizers/base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/quantizers/auto.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix

* add kwargs update

* fixup

* add `optimum_quantizer` attribute

* oops

* rm unneeded file

* fix doctests

---------

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* [`HfQuantizer`] Move it to "Developper guides" (#28768)

Update _toctree.yml

* Use Conv1d for TDNN (#25728)

* use conv for tdnn

* run make fixup

* update TDNN

* add PEFT LoRA check

* propagate tdnn warnings to others

* add missing imports

* update TDNN in wav2vec2_bert

* add missing imports

* Fix transformers.utils.fx compatibility with torch<2.0 (#28774)

guard sdpa on torch>=2.0

* Further pin pytest version (in a temporary way) (#28780)

fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [`Backbone`] Use `load_backbone` instead of `AutoBackbone.from_config` (#28661)

* Enable instantiating model with pretrained backbone weights

* Remove doc updates until changes made in modeling code

* Use load_backbone instead

* Add use_timm_backbone to the model configs

* Add missing imports and arguments

* Update docstrings

* Make sure test is properly configured

* Include recent DPT updates

* Task-specific pipeline init args (#28439)

* Abstract out pipeline init args

* Address PR comments

* Reword

* BC PIPELINE_INIT_ARGS

* Remove old arguments

* Small fix

* Add tf_keras imports to prepare for Keras 3 (#28588)

* Port core files + ESM (because ESM code is odd)

* Search-replace in modelling code

* Fix up transfo_xl as well

* Fix other core files + tests (still need to add correct import to tests)

* Fix cookiecutter

* make fixup, fix imports in some more core files

* Auto-add imports to tests

* Cleanup, add imports to sagemaker tests

* Use correct exception for importing tf_keras

* Fixes in modeling_tf_utils

* make fixup

* Correct version parsing code

* Ensure the pipeline tests correctly revert to float32 after each test

* Ensure the pipeline tests correctly revert to float32 after each test

* More tf.keras -> keras

* Add dtype cast

* Better imports of tf_keras

* Add a cast for tf.assign, just in case

* Fix callback imports

* Pin Torch to <2.2.0 (#28785)

* Pin torch to <2.2.0

* Pin torchvision and torchaudio as well

* Playing around with versions to see if this helps

* twiddle something to restart the CI

* twiddle it back

* Try changing the natten version

* make fixup

* Revert "Try changing the natten version"

This reverts commit de0d6592c35dc39ae8b5a616c27285db28262d06.

* make fixup

* fix fix fix

* fix fix fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [`bnb`] Fix bnb slow tests (#28788)

fix bnb slow tests

* Prevent MLflow exception from disrupting training (#28779)

Modified MLflow logging metrics from synchronous to asynchronous

Co-authored-by: codiceSpaghetti <alessio.ser@hotmail.it>

* don't initialize the output embeddings if we're going to tie them to input embeddings (#28192)

* test that tied output embeddings aren't initialized on load

* don't initialize the output embeddings if we're going to tie them to the input embeddings

* [`HFQuantizer`] Remove `check_packages_compatibility` logic (#28789)

remove `check_packages_compatibility` logic

* [Whisper] Refactor forced_decoder_ids & prompt ids (#28687)

* up

* Fix more

* Correct more

* Fix more tests

* fix fast tests

* Fix more

* fix more

* push all files

* finish all

* make style

* Fix timestamp wrap

* make style

* make style

* up

* up

* up

* Fix lang detection behavior

* Fix lang detection behavior

* Add lang detection test

* Fix lang detection behavior

* make style

* Update src/transformers/models/whisper/generation_whisper.py

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* better error message

* make style tests

* add warning

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* Resolve DeepSpeed cannot resume training with PeftModel (#28746)

* fix: resolve deepspeed resume peft model issues

* chore: update something

* chore: update model instance pass into is peft model checks

* chore: remove hard code value to tests

* fix: format code

* canonical repos moves (#28795)

* canonical repos moves

* Style

---------

Co-authored-by: Lysandre <lysandre@huggingface.co>

* Wrap Keras methods to support BatchEncoding (#28734)

* Shim the Keras methods to support BatchEncoding

* Extract everything to a convert_batch_encoding function

* Convert BatchFeature too (thanks Amy)

* tf.keras -> keras

* Flax mistral (#26943)

* direct copy from llama work

* mistral modules forward pass working

* flax mistral forward pass with sliding window

* added tests

* added layer collection approach

* Revert "added layer collection approach"

This reverts commit 0e2905bf2236ec323163fc1a9f0c016b21aa8b8f.

* Revert "Revert "added layer collection approach""

This reverts commit fb17b6187ac5d16da7c461e1130514dc3d137a43.

* fixed attention outputs

* added mistral to init and auto

* fixed import name

* fixed layernorm weight dtype

* freeze initialized weights

* make sure conversion consideres bfloat16

* added backend

* added docstrings

* added cache

* fixed sliding window causal mask

* passes cache tests

* passed all tests

* applied make style

* removed commented out code

* applied fix-copies ignored other model changes

* applied make fix-copies

* removed unused functions

* passed generation integration test

* slow tests pass

* fixed slow tests

* changed default dtype from jax.numpy.float32 to float32 for docstring check

* skip cache test  for FlaxMistralForSequenceClassification since if pad_token_id in input_ids it doesn't score previous input_ids

* updated checkpoint since from_pt not included

* applied black style

* removed unused args

* Applied styling and fixup

* changed checkpoint for doc back

* fixed rf after adding it to hf hub

* Add dummy ckpt

* applied styling

* added tokenizer to new ckpt

* fixed slice format

* fix init and slice

* changed ref for placeholder TODO

* added copies from Llama

* applied styling

* applied fix-copies

* fixed docs

* update weight dtype reconversion for sharded weights

* removed Nullable input ids

* Removed unnecessary output attentions in Module

* added embedding weight initialziation

* removed unused past_key_values

* fixed deterministic

* Fixed RMS Norm and added copied from

* removed input_embeds

* applied make style

* removed nullable input ids from sequence classification model

* added copied from GPTJ

* added copied from Llama on FlaxMistralDecoderLayer

* added copied from to FlaxMistralPreTrainedModel methods

* fix test deprecation warning

* freeze gpt neox random_params and fix copies

* applied make style

* fixed doc issue

* skipped docstring test to allign # copied from

* applied make style

* removed FlaxMistralForSequenceClassification

* removed unused padding_idx

* removed more sequence classification

* removed sequence classification

* applied styling and consistency

* added copied from in tests

* removed sequence classification test logic

* applied styling

* applied make style

* removed freeze and fixed copies

* undo test change

* changed repeat_kv to tile

* fixed to key value groups

* updated copyright year

* split casual_mask

* empty to rerun failed pt_flax_equivalence test FlaxWav2Vec2ModelTest

* went back to 2023 for tests_pr_documentation_tests

* went back to 2024

* changed tile to repeat

* applied make style

* empty for retry on Wav2Vec2

* DeepSpeed: hardcode `torch.arange` dtype on `float` usage to avoid incorrect initialization (#28760)

* Add artifact name in job step to maintain job / artifact correspondence (#28682)

* avoid using job name

* apply to other files

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Split daily CI using 2 level matrix (#28773)

* update / add new workflow files

* Add comment

* Use env.NUM_SLICES

* use scripts

* use scripts

* use scripts

* Fix

* using one script

* Fix

* remove unused file

* update

* fail-fast: false

* remove unused file

* fix

* fix

* use matrix

* inputs

* style

* update

* fix

* fix

* no model name

* add doc

* allow args

* style

* pass argument

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [docs] Correct the statement in the docstirng of compute_transition_scores in generation/utils.py (#28786)

* Adding [T5/MT5/UMT5]ForTokenClassification (#28443)

* Adding [T5/MT5/UMT5]ForTokenClassification

* Add auto mappings for T5ForTokenClassification and variants

* Adding ForTokenClassification to the list of models

* Adding attention_mask param to the T5ForTokenClassification test

* Remove outdated comment in test

* Adding EncoderOnly and Token Classification tests for MT5 and UMT5

* Fix typo in umt5 string

* Add tests for all the existing MT5 models

* Fix wrong comment in dependency_versions_table

* Reverting change to common test for _keys_to_ignore_on_load_missing

The test is correctly picking up redundant keys in _keys_to_ignore_on_load_missing.

* Removing _keys_to_ignore_on_missing from MT5 since the key is not used in the model

* Add fix-copies to MT5ModelTest

* Make `is_torch_bf16_available_on_device` more strict (#28796)

fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fix symbolic_trace with kv cache (#28724)

* fix symbolic_trace with kv cache

* comment & better test

* Add tip on setting tokenizer attributes (#28764)

* Add tip on setting tokenizer attributes

* Grammar

* Remove the bit that was causing doc builds to fail

* enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin (#28615)

* enable graident checkpointing in DetaObjectDetection

* fix missing part in original DETA

* make style

* make fix-copies

* Revert "make fix-copies"

This reverts commit 4041c86c29248f1673e8173b677c20b5a4511358.

* remove fix-copies of DetaDecoder

* enable swin gradient checkpointing

* fix gradient checkpointing in donut_swin

* add tests for deta/swin/donut

* Revert "fix gradient checkpointing in donut_swin"

This reverts commit 1cf345e34d3cc0e09eb800d9895805b1dd9b474d.

* change supports_gradient_checkpointing pipeline to PreTrainedModel

* Revert "add tests for deta/swin/donut"

This reverts commit 6056ffbb1eddc3cb3a99e4ebb231ae3edf295f5b.

* Revert "Revert "fix gradient checkpointing in donut_swin""

This reverts commit 24e25d0a14891241de58a0d86f817d0b5d2a341f.

* Simple revert

* enable deformable detr gradient checkpointing

* add gradient in encoder

* [docs] fix some bugs about parameter description (#28806)

Co-authored-by: p_spozzhang <p_spozzhang@tencent.com>

* Add models from deit (#28302)

* Add modelss

* Add 2 more models

* add models to tocrree

* Add modles

* Update docs/source/ja/model_doc/detr.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/deit.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/ja/model_doc/deplot.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix bugs

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* [docs] Backbone (#28739)

* backbones

* fix path

* fix paths

* fix code snippet

* fix links

* [docs] HfQuantizer (#28820)

* tidy

* fix path

* [Docs] Fix spelling and grammar mistakes (#28825)

* Fix typos and grammar mistakes in docs and examples

* Fix typos in docstrings and comments

* Fix spelling of `tokenizer` in model tests

* Remove erroneous spaces in decorators

* Remove extra spaces in Markdown link texts

* Explicitly check if token ID's are None in TFBertTokenizer constructor (#28824)

Add an explicit none-check, since token ids can be 0

* Add missing None check for hf_quantizer (#28804)

* Add missing None check for hf_quantizer

* Add test, fix logic.

* make style

* Switch test model to Mistral

* Comment

* Update tests/test_modeling_utils.py

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Fix issues caused by natten (#28834)

try

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* fix / skip (for now) some tests before switch to torch 2.2 (#28838)

* fix / skip some tests before we can switch to torch 2.2

* style

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Use `-v` for `pytest` on CircleCI  (#28840)

use -v in pytest

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Reduce GPU memory usage when using FSDP+PEFT (#28830)

support FSDP+PEFT

* Mark `test_encoder_decoder_model_generate` for `vision_encoder_deocder` as flaky (#28842)

Mark test as flaky

* Bump dash from 2.3.0 to 2.15.0 in /examples/research_projects/decision_transformer (#28845)

Bump dash in /examples/research_projects/decision_transformer

Bumps [dash](https://github.com/plotly/dash) from 2.3.0 to 2.15.0.
- [Release notes](https://github.com/plotly/dash/releases)
- [Changelog](https://github.com/plotly/dash/blob/dev/CHANGELOG.md)
- [Commits](https://github.com/plotly/dash/compare/v2.3.0...v2.15.0)

---
updated-dependencies:
- dependency-name: dash
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Support custom scheduler in deepspeed training (#26831)

Reuse trainer.create_scheduler to create scheduler for deepspeed

* [Docs] Fix bad doc: replace save with logging (#28855)

Fix bad doc: replace save with logging

* Ability to override clean_code_for_run (#28783)

* Add clean_code_for_run function

* Call clean_code_for_run from agent method

* [WIP] Hard error when ignoring tensors. (#27484)

* [WIP] Hard error when ignoring tensors.

* Better selection/error when saving a checkpoint.

- Find all names we should normally drop (those are in the transformers
  config)
- Find all disjoint tensors (for those we can safely trigger a copy to
  get rid of the sharing before saving)
- Clone those disjoint tensors getting rid of the issue
- Find all identical names (those should be declared in the config
  but we try to find them all anyway.)
- For all identical names:
  - If they are in the config, just ignore them everything is fine
  - If they are not, warn about them.
- For all remainder tensors which are shared yet neither identical NOR
  disjoint. raise a hard error.

* Adding a failing test on `main` that passes here.

* We don't need to keep the subfolder logic in this test.

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* [`Doc`] update contribution guidelines (#28858)

update guidelines

* Correct wav2vec2-bert inputs_to_logits_ratio (#28821)

* Correct wav2vec2-bert inputs_to_logits_ratio

* correct ratio

* correct ratio, clean asr pipeline

* refactor on one line

* Image Feature Extraction pipeline (#28216)

* Draft pipeline

* Fixup

* Fix docstrings

* Update doctest

* Update pipeline_model_mapping

* Update docstring

* Update tests

* Update src/transformers/pipelines/image_feature_extraction.py

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Fix docstrings - review comments

* Remove pipeline mapping for composite vision models

* Add to pipeline tests

* Remove for flava (multimodal)

* safe pil import

* Add requirements for pipeline run

* Account for super slow efficientnet

* Review comments

* Fix tests

* Swap order of kwargs

* Use build_pipeline_init_args

* Add back FE pipeline for Vilt

* Include image_processor_kwargs in docstring

* Mark test as flaky

* Update TODO

* Update tests/pipelines/test_pipelines_image_feature_extraction.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Add license header

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* ClearMLCallback enhancements: support multiple runs and handle logging better (#28559)

* add clearml tracker

* support multiple train runs

* remove bad code

* add UI entries for config/hparams overrides

* handle models in different tasks

* run ruff format

* tidy code based on code review

---------

Co-authored-by: Eugen Ajechiloae <eugenajechiloae@gmail.com>

* Do not use mtime for checkpoint rotation. (#28862)

Resolve https://github.com/huggingface/transformers/issues/26961

* Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support  (#28777)

* This is a test commit

* testing commit

* final commit with some changes

* Removed copy statement

* Fixed formatting issues

* Fixed error added past_key_values in the forward method

* Fixed a trailing whitespace. Damn the formatting rules are strict

* Added the copy statement

* Bump cryptography from 41.0.2 to 42.0.0 in /examples/research_projects/decision_transformer (#28879)

Bump cryptography in /examples/research_projects/decision_transformer

Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.2 to 42.0.0.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pyca/cryptography/compare/41.0.2...42.0.0)

---
updated-dependencies:
- dependency-name: cryptography
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [Docs] Update project names and links in awesome-transformers (#28878)

Update project names and repository links in awesome-transformers

* Fix LongT5ForConditionalGeneration initialization of lm_head (#28873)

* Raise error when using `save_only_model` with `load_best_model_at_end` for DeepSpeed/FSDP (#28866)

* Raise error when using `save_only_model` with `load_best_model_at_end` for DeepSpeed/FSDP

* Update trainer.py

* Fix `FastSpeech2ConformerModelTest` and skip it on CPU (#28888)

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Revert "[WIP] Hard error when ignoring tensors." (#28898)

Revert "[WIP] Hard error when ignoring tensors. (#27484)"

This reverts commit 2da28c4b41bba23969a8afe97c3dfdcbc47a57dc.

* unpin torch (#28892)

* unpin torch

* check

* check

* check

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Explicit server error on gated model (#28894)

* [Docs] Fix backticks in inline code and documentation links (#28875)

Fix backticks in code blocks and documentation links

* Hotfix - make `torchaudio` get the correct version in `torch_and_flax_job` (#28899)

* check

* check

* check

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [Docs] Add missing language options and fix broken links (#28852)

* Add missing entries to the language selector

* Add links to the Colab and AWS Studio notebooks for ONNX

* Use anchor links in CONTRIBUTING.md

* Fix broken hyperlinks due to spaces

* Fix links to OpenAI research articles

* Remove confusing footnote symbols from author names, as they are also considered invalid markup

* fix: Fixed the documentation for `logging_first_step` by removing "evaluate" (#28884)

Fixed the documentation for logging_first_step by removing evaluate.

* fix Starcoder FA2 implementation (#28891)

* Fix Keras scheduler import so it works for older versions of Keras (#28895)

Fix our schedule import so it works for older versions of Keras

* ⚠️ Raise `Exception` when trying to generate 0 tokens ⚠️ (#28621)

* change warning to exception

* Update src/transformers/generation/utils.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* validate `max_new_tokens` > 0 in `GenerationConfig`

* fix truncation test parameterization in `TextGenerationPipelineTests`

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update the cache number (#28905)

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Add npu device for pipeline (#28885)

add npu device for pipeline

Co-authored-by: unit_test <test@unit.com>

* [Docs] Fix placement of tilde character (#28913)

Fix placement of tilde character

* [Docs] Revert translation of '@slow' decorator (#28912)

* Fix utf-8 yaml load for marian conversion to pytorch in Windows (#28618)

Fix utf-8 yaml in marian conversion

* [`Core generation`] Adds support for static KV cache (#27931)

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Remove dead TF loading code (#28926)

Remove dead code

* fix: torch.int32 instead of torch.torch.int32 (#28883)

* pass kwargs in stopping criteria list (#28927)

* Support batched input for decoder start ids (#28887)

* support batched input for decoder start ids

* Fix typos

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* minor changes

* fix: decoder_start_id as list

* empty commit

* empty commit

* empty commit

* empty commit

* empty commit

* empty commit

* empty commit

* empty commit

* empty commit

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* [Docs] Fix broken links and syntax issues (#28918)

* Fix model documentation links in attention.md

* Fix external link syntax

* Fix target anchor names of section links

* Fix copyright statement comments

* Fix documentation headings

* Fix max_position_embeddings default value for llama2 to 4096 #28241 (#28754)

* Changed max_position_embeddings default value from 2048 to 4096

* force push

* Fixed formatting issues. Fixed missing argument in write_model.

* Reverted to the default value 2048 in the Llama config. Added comments for the llama_version argument.

* Fixed issue with default value value of max_position_embeddings in docstring

* Updated help message for llama versions

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Fix a wrong link to CONTRIBUTING.md section in PR template (#28941)

* Fix type annotations on neftune_noise_alpha and fsdp_config TrainingArguments parameters (#28942)

* [i18n-de] Translate README.md to German (#28933)

* Translate README.md to German

* Add links to README_de.md

* Remove invisible characters in README

* Change to a formal tone and fix punctuation marks

* [Nougat] Fix pipeline (#28242)

* Fix pipeline

* Remove print statements

* Address comments

* Address issue

* Remove unused imports

* [Docs] Update README and default pipelines (#28864)

* Update README and docs

* Update README

* Update README

* Convert `torch_dtype` as `str` to actual torch data type (i.e. "float16" …to `torch.float16`) (#28208)

* Convert torch_dtype as str to actual torch data type (i.e. "float16" to torch.float16)

* Check if passed torch_dtype is an attribute in torch

* Update src/transformers/pipelines/__init__.py

Check type via isinstance

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* [`pipelines`] updated docstring with vqa alias (#28951)

updated docstring with vqa alias

* Tests: tag `test_save_load_fast_init_from_base` as flaky (#28930)

* Updated requirements for image-classification samples: datasets>=2.14.0 (#28974)

Updated datasets requirements. Need a package version >= 2.14.0

* Always initialize tied output_embeddings if it has a bias term (#28947)

Continue to initialize tied output_embeddings if it has a bias term

The bias term is not tied, and so will need to be initialized accordingly.

* Clean up staging tmp checkpoint directory (#28848)

clean up remaining tmp checkpoint dir

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

* [Docs] Add language identifiers to fenced code blocks (#28955)

Add language identifiers to code blocks

* [Docs] Add video section (#28958)

Add video section

* [i18n-de] Translate CONTRIBUTING.md to German (#28954)

* Translate contributing.md to German

* Fix formatting issues in contributing.md

* Address review comments

* Fix capitalization

* [`NllbTokenizer`] refactor with added tokens decoder (#27717)

* refactor with addedtokens decoder

* style

* get rid of lang code to id

* style

* keep some things for BC

* update tests

* add the mask token at the end of the vocab

* nits

* nits

* fix final tests

* style

* nits

* Update src/transformers/models/nllb/tokenization_nllb_fast.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* nits

* style?

* Update src/transformers/convert_slow_tokenizer.py

* make it a tad bit more custom

* ruff please stop
Co-Authored by avidale

<dale.david@mail.ru>

* Update
Co-authored-by: avidale
<dale.david@mail.ru>

* Update
Co-authored-by: avidale <dale.david@mail.ru>

* oupts

* ouft

* nites

* test

* fix the remaining failing tests

* style

* fix failing test

* ficx other test

* temp dir + test the raw init

* update test

* style

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Add sudachi_projection option to BertJapaneseTokenizer (#28503)

* add sudachi_projection option

* Upgrade sudachipy>=0.6.8

* add a test case for sudachi_projection

* Compatible with older versions of SudachiPy

* make fixup

* make style

* error message for unidic download

* revert jumanpp test cases

* format options for sudachi_projection

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* format options for sudachi_split_mode and sudachi_dict_type

* comment

* add tests for full_tokenizer kwargs

* pass projection arg directly

* require_sudachi_projection

* make style

* revert upgrade sudachipy

* check is_sudachi_projection_available()

* revert dependency_version_table and bugfix

* style format

* simply raise ImportError

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* simply raise ImportError

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Static Cache: load models with MQA or GQA (#28975)

* Update configuration_llama.py: fixed broken link (#28946)

* Update co…
@thanhlecongg
Copy link

thanhlecongg commented Aug 7, 2024

Hi @Saibo-creator, I have tried to run the model with low_memory=True with high. However, it is quite counter-intuitive that memory usage in low-memory beam search is higher than the original one.

### Low_memory = True
Memory usage:  19.74514389038086
Time taken:  63.430684089660645


### Normal
Memory usage:  10.743547439575195
Time taken:  2.9829838275909424`

This is my code:

import torch
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import numpy as np
import time
from transformers import (
    AutoTokenizer,

AutoModelForSeq2SeqLM,
    LogitsProcessorList,

MinLengthLogitsProcessor,
    BeamSearchScorer,
)

def get_memory_usage():
    return torch.cuda.max_memory_allocated(device=None)/1024/1024/1024

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
model.to('cuda')
tokenizer.pad_token_id = tokenizer.eos_token_id
model_inputs = tokenizer('I enjoy walking with my cute dog' * 50, return_tensors='pt')
model_inputs.to('cuda')

start = time.time()

beam_output = model.generate(**model_inputs,
                            max_new_tokens=40,
                            num_beams=100,
                            early_stopping=True
)

print("Output:\n" + 100 *'-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# Uncomment the following lines to run the code with low_memory = True
# beam_output_w_subbatch = model.generate(
#     **model_inputs,
#     max_new_tokens=40,
#     num_beams = 100,
#     early_stopping = True,
#     low_memory = True
# )

# print("Output:\n" + 100 * '-')
# print(tokenizer.decode(beam_output_w_subbatch[0], skip_special_tokens=True))

print("Memory usage: ", get_memory_usage())
print("Time taken: ", time.time() - start)

I'm thinking that it is because you are stacking outputs on these lines before releasing memory. So it consume double memory (both in outputs_per_sub_batch and outputs) than expected. Do you have any thoughts on this?

Thanks.

@Saibo-creator
Copy link
Contributor Author

Saibo-creator commented Aug 8, 2024

Hello @thanhlecongg , Thanks for the question!

You are looking at the right place.

Why is the memory usage higher in low_memory=True than the vanilla beam search?

The operation of stacking outputs_per_sub_batch effectively doubles the memory usage of the model output.

In cases where the model output is small, this is not a problem. But in cases where the model output is long ( your case), the memory usage can overwhelm the memory usage for computing the model output.

For example, with beam size = 20, sequence length = 350, the model_output.logits is 20 x 350 x vocab_size(50K) x 4 bytes ~ 1.4GB.

This is counter-intuitive, when low_memory=True is supposed to reduce memory usage ?

This is not a big problem per se because depending on the use case, the output may not be huge and the memory bottleneck is the model size and the intermediate variables in forward pass.

One example of a use case where low_memory=True is useful is when the model is very large and the input is relatively short,
for example, use

  • gpt2-xl(1.2 Trillion params) instead of gpt2 (150 Million Params)
  • I enjoy walking with my cute dog * 2 instead of I enjoy walking with my cute dog * 50
  • beam_size=4

Check this gist to verify.

In this case, the vanilla beam search will OOM (my GPU is 8GB) but low_memory=True will work.

Fix

One simple fix is to do the stacking operation in CPU memory and then move it back to GPU memory.

This will make sure the memory usage in case of low_memory=True is always less than the vanilla beam search.

@gante , do you think this is a valid issue? If yes, I can create a PR for this.

@thanhlecongg
Copy link

thanhlecongg commented Aug 8, 2024

Thanks for your quick response and clear explanation. I wonder if we really need to stack all model outputs or only top-k is enough? If only top-k is enough, the memory usage could be lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have a beam search sub batch size to limit memory use