Unbreak torch export with static cache #33287

guangy10 · 2024-09-04T00:04:36Z

What does this PR do?

Fix the issue introduced in #32543, which breaks the torch.export use-cases. The fix makes sense for torch.export as we are always exporting on "cpu" devices atm, hence avoid calling .to(device=key_states.device) on cache for export.

Fixes # (issue)

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc
@ArthurZucker
@dvrogozh

guangy10 · 2024-09-04T00:05:03Z

Verified RUN_SLOW=1 pytest tests/utils/test_cache_utils.py -k test_static_cache_exportability works

guangy10 · 2024-09-04T00:27:34Z

Documentation added

guangy10 · 2024-09-04T00:29:08Z

The repo consistency failure is irrelevant to this PR. Running make fix-copies from latest main (ecd61c62862f925a18b4f063dc17fcaf01826e25) will generate same patches

SunMarc

Thanks for fixing the issue @guangy10 ! Really appreciate it ! I've checked that the static cache multi gpu and compile test are passing.

SunMarc · 2024-09-04T11:44:50Z

About the CI failure, can you rebase on the latest commit ? This should be solved ! Thanks

gante · 2024-09-04T12:15:36Z

@guangy10 (sorry, I don't want to be your antagonist in your PRs, but this might be a source of future related problems. Please bear with me 🙏 )

In torch's original to docs, we can read: "If the self Tensor already has the correct torch.dtype and torch.device, then self is returned.". In other words, a harmless operation, self = self.

We are observing that the same doesn't happen in torch.export. As such, this is an ad hoc fix, and we will likely need similar fixes in other places. On top of that, because the change is not intuitive versus the base torch, it is (correctly) accompanied by a long comment explaining why it is needed. Both problems apply to outside of transformers as well.

Ideally, a more scalable solution would be implemented 🤗 I'm wondering whether one of the options below is possible:

Handle this edge case in torch.export directly. If the device is correctly set before the to call this would become a no-op
Alternatively, this edge case is documented on torch.export, and we can link to your docs instead of having a 4-line comment every time we apply this pattern

WDYT?
(If that is not possible/too hard, of course we will accept this PR)

On top of this, I think we (transformers) can do a better job at setting devices at cache initialization time -- we could initialize the cache given how the model is mapped across devices, as opposed to taking a single device argument.

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

SunMarc · 2024-09-04T12:45:08Z

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

That right ! Yeah it would be better to initialize the cache at initialization time. This should be feasible as long as we can get the model or the device_map at the init.

dvrogozh · 2024-09-04T19:22:10Z

This WA works for me. Note reply from @bdhirsh here: #33179 (comment). They plan to look into this issue on pytorch side to avoid any WA, but it's unclear how long it will take.

guangy10 · 2024-09-04T19:33:51Z

Rebased on latest main.

guangy10 · 2024-09-04T19:38:55Z

@guangy10 (sorry, I don't want to be your antagonist in your PRs, but this might be a source of future related problems. Please bear with me 🙏 )

In torch's original to docs, we can read: "If the self Tensor already has the correct torch.dtype and torch.device, then self is returned.". In other words, a harmless operation, self = self.

We are observing that the same doesn't happen in torch.export. As such, this is an ad hoc fix, and we will likely need similar fixes in other places. On top of that, because the change is not intuitive versus the base torch, it is (correctly) accompanied by a long comment explaining why it is needed. Both problems apply to outside of transformers as well.

Ideally, a more scalable solution would be implemented 🤗 I'm wondering whether one of the options below is possible:

Handle this edge case in torch.export directly. If the device is correctly set before the to call this would become a no-op

Alternatively, this edge case is documented on torch.export, and we can link to your docs instead of having a 4-line comment every time we apply this pattern

WDYT? (If that is not possible/too hard, of course we will accept this PR)

On top of this, I think we (transformers) can do a better job at setting devices at cache initialization time -- we could initialize the cache given how the model is mapped across devices, as opposed to taking a single device argument.

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

Well first it's not an edge case for torch.export. Setting the device explicitly doesn't seem to be helpful. As you can see in #32830, the device is already set at init time. To unblock I think we should just merge this workaround as it passes both export and compile tests, and left a TODO for proper fix in the future. Your thoughts? CC: @SunMarc @gante

src/transformers/cache_utils.py

gante · 2024-09-05T11:25:28Z

@guangy10 happy to merge with a TODO, if you believe #33303 wouldn't fix this issue :)

(our current initialization doesn't handle multi-device setups, so that's why we have a to operation. #33303 should fix it and remove the need to have a to op. Nevertheless, there are other places in our codebase where we assume to operations to be harmless, and may clash with torch.export 🤔 )

guangy10 · 2024-09-06T01:09:16Z

@guangy10 happy to merge with a TODO, if you believe #33303 wouldn't fix this issue :)

(our current initialization doesn't handle multi-device setups, so that's why we have a to operation. #33303 should fix it and remove the need to have a to op. Nevertheless, there are other places in our codebase where we assume to operations to be harmless, and may clash with torch.export 🤔 )

@gante @SunMarc If the export test is fixed in #33303 and it can be landed any time soon, we can discard this workaround and merge #33303 instead. What matters is that we want to start the transformers integration work in executorch once #32830 is merged which requires the export test to be working again.

guangy10 · 2024-09-11T17:15:04Z

@gante If #33303 needs more time to review, can we merge this workaround to unblock?

ArthurZucker · 2024-09-13T12:25:53Z

#33303 will be merged today!

Summary: bypass-github-export-checks [Done] ~~Require PR [Make StaticCache configurable at model construct time](huggingface/transformers#32830) in order to export, lower and run the 🤗 model OOTB.~~ [Done] ~~Require huggingface/transformers#33303 or huggingface/transformers#33287 to be merged to 🤗 `transformers` to resolve the export issue introduced by huggingface/transformers#32543 ----------- Now we can take the integration point from 🤗 `transformers` to lower compatible models to ExecuTorch OOTB. - This PR creates a simple script with recipe of XNNPACK. - This PR also created a secret `EXECUTORCH_HT_TOKEN` to allow download checkpoints in the CI - This PR connects the 🤗 "Export to ExecuTorch" e2e workflow to ExecuTorch CI ### Instructions to run the demo: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m extension.export_util.export_hf_model -hfm "google/gemma-2b" # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m extension.llm.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o tokenizer.bin ``` 3. Build llm runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) 4. Run the lowered model ``` cmake-out/examples/models/llama2/llama_main --model_path=gemma.pte --tokenizer_path=tokenizer.bin --prompt="My name is" ``` OOTB output and perf ``` I 00:00:00.003110 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003360 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003380 executorch:cpuinfo_utils.cpp:158] Number of efficient cores 4 I 00:00:00.003384 executorch:main.cpp:65] Resetting threadpool with num threads = 6 I 00:00:00.014716 executorch:runner.cpp:51] Creating LLaMa runner: model_path=gemma.pte, tokenizer_path=tokenizer_gemma.bin I 00:00:03.065359 executorch:runner.cpp:66] Reading metadata from model I 00:00:03.065391 executorch:metadata_util.h:43] get_n_bos: 1 I 00:00:03.065396 executorch:metadata_util.h:43] get_n_eos: 1 I 00:00:03.065399 executorch:metadata_util.h:43] get_max_seq_len: 123 I 00:00:03.065402 executorch:metadata_util.h:43] use_kv_cache: 1 I 00:00:03.065404 executorch:metadata_util.h:41] The model does not contain use_sdpa_with_kv_cache method, using default value 0 I 00:00:03.065405 executorch:metadata_util.h:43] use_sdpa_with_kv_cache: 0 I 00:00:03.065407 executorch:metadata_util.h:41] The model does not contain append_eos_to_prompt method, using default value 0 I 00:00:03.065409 executorch:metadata_util.h:43] append_eos_to_prompt: 0 I 00:00:03.065411 executorch:metadata_util.h:41] The model does not contain enable_dynamic_shape method, using default value 0 I 00:00:03.065412 executorch:metadata_util.h:43] enable_dynamic_shape: 0 I 00:00:03.130388 executorch:metadata_util.h:43] get_vocab_size: 256000 I 00:00:03.130405 executorch:metadata_util.h:43] get_bos_id: 2 I 00:00:03.130408 executorch:metadata_util.h:43] get_eos_id: 1 My name is Melle. I am a 20 year old girl from Belgium. I am living in the southern part of Belgium. I am 165 cm tall and I weigh 45kg. I like to play sports like swimming, running and playing tennis. I am very interested in music and I like to listen to classical music. I like to sing and I can play the piano. I would like to go to the USA because I like to travel a lot. I am looking for a boy from the USA who is between 18 and 25 years old. I PyTorchObserver {"prompt_tokens":4,"generated_tokens":118,"model_load_start_ms":1723685715497,"model_load_end_ms":1723685718612,"inference_start_ms":1723685718612,"inference_end_ms":1723685732965,"prompt_eval_end_ms":1723685719087,"first_token_ms":1723685719087,"aggregate_sampling_time_ms":182,"SCALING_FACTOR_UNITS_PER_SECOND":1000} I 00:00:17.482472 executorch:stats.h:70] Prompt Tokens: 4 Generated Tokens: 118 I 00:00:17.482475 executorch:stats.h:76] Model Load Time: 3.115000 (seconds) I 00:00:17.482481 executorch:stats.h:86] Total inference time: 14.353000 (seconds) Rate: 8.221278 (tokens/second) I 00:00:17.482483 executorch:stats.h:94] Prompt evaluation: 0.475000 (seconds) Rate: 8.421053 (tokens/second) I 00:00:17.482485 executorch:stats.h:105] Generated 118 tokens: 13.878000 (seconds) Rate: 8.502666 (tokens/second) I 00:00:17.482486 executorch:stats.h:113] Time to first generated token: 0.475000 (seconds) I 00:00:17.482488 executorch:stats.h:120] Sampling time over 122 tokens: 0.182000 (seconds) ``` Pull Request resolved: #4723 Reviewed By: huydhn, kirklandsign Differential Revision: D62543933 Pulled By: guangy10 fbshipit-source-id: 00401a39ba03d7383e4b284d25c8fc62a6695b34

This was referenced Sep 4, 2024

fix multi-gpu with static cache #32543

Merged

WIP: Clone tensors to be able to mutate #33179

Closed

static cache: RuntimeError: cannot mutate tensors with frozen storage #33178

Closed

guangy10 force-pushed the workaround_to_unbreak_export branch 3 times, most recently from 63fbea6 to 8b6b7ee Compare September 4, 2024 00:26

SunMarc self-requested a review September 4, 2024 11:08

SunMarc approved these changes Sep 4, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker September 4, 2024 11:37

SunMarc mentioned this pull request Sep 4, 2024

Fix the initialization of the cache when we have multi gpu #33303

Merged

guangy10 force-pushed the workaround_to_unbreak_export branch from 8b6b7ee to 8aa6e33 Compare September 4, 2024 19:33

dvrogozh reviewed Sep 4, 2024

View reviewed changes

src/transformers/cache_utils.py Show resolved Hide resolved

Unbreak torch export with static cache

300b0ed

guangy10 force-pushed the workaround_to_unbreak_export branch from 8aa6e33 to 300b0ed Compare September 4, 2024 21:05

guangy10 requested review from dvrogozh and SunMarc September 4, 2024 21:05

guangy10 mentioned this pull request Sep 11, 2024

[Experimental] Script to export 🤗 models pytorch/executorch#4723

Closed

SunMarc closed this in #33303 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbreak torch export with static cache #33287

Unbreak torch export with static cache #33287

guangy10 commented Sep 4, 2024 •

edited

Loading

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

SunMarc left a comment •

edited

Loading

SunMarc commented Sep 4, 2024

gante commented Sep 4, 2024 •

edited

Loading

SunMarc commented Sep 4, 2024

dvrogozh commented Sep 4, 2024

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

gante commented Sep 5, 2024 •

edited

Loading

guangy10 commented Sep 6, 2024

guangy10 commented Sep 11, 2024

ArthurZucker commented Sep 13, 2024

Unbreak torch export with static cache #33287

Unbreak torch export with static cache #33287

Conversation

guangy10 commented Sep 4, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

SunMarc commented Sep 4, 2024

gante commented Sep 4, 2024 • edited Loading

SunMarc commented Sep 4, 2024

dvrogozh commented Sep 4, 2024

guangy10 commented Sep 4, 2024

guangy10 commented Sep 4, 2024

gante commented Sep 5, 2024 • edited Loading

guangy10 commented Sep 6, 2024

guangy10 commented Sep 11, 2024

ArthurZucker commented Sep 13, 2024

guangy10 commented Sep 4, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading

gante commented Sep 4, 2024 •

edited

Loading

gante commented Sep 5, 2024 •

edited

Loading