Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbreak torch export with static cache #33287

Closed

Conversation

guangy10
Copy link
Contributor

@guangy10 guangy10 commented Sep 4, 2024

What does this PR do?

Fix the issue introduced in #32543, which breaks the torch.export use-cases. The fix makes sense for torch.export as we are always exporting on "cpu" devices atm, hence avoid calling .to(device=key_states.device) on cache for export.

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc
@ArthurZucker
@dvrogozh

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 4, 2024

Verified RUN_SLOW=1 pytest tests/utils/test_cache_utils.py -k test_static_cache_exportability works

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 4, 2024

Documentation added

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 4, 2024

The repo consistency failure is irrelevant to this PR. Running make fix-copies from latest main (ecd61c62862f925a18b4f063dc17fcaf01826e25) will generate same patches

@SunMarc SunMarc self-requested a review September 4, 2024 11:08
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the issue @guangy10 ! Really appreciate it ! I've checked that the static cache multi gpu and compile test are passing.

@SunMarc SunMarc requested a review from ArthurZucker September 4, 2024 11:37
@SunMarc
Copy link
Member

SunMarc commented Sep 4, 2024

About the CI failure, can you rebase on the latest commit ? This should be solved ! Thanks

@gante
Copy link
Member

gante commented Sep 4, 2024

@guangy10 (sorry, I don't want to be your antagonist in your PRs, but this might be a source of future related problems. Please bear with me 🙏 )

In torch's original to docs, we can read: "If the self Tensor already has the correct torch.dtype and torch.device, then self is returned.". In other words, a harmless operation, self = self.

We are observing that the same doesn't happen in torch.export. As such, this is an ad hoc fix, and we will likely need similar fixes in other places. On top of that, because the change is not intuitive versus the base torch, it is (correctly) accompanied by a long comment explaining why it is needed. Both problems apply to outside of transformers as well.

Ideally, a more scalable solution would be implemented 🤗 I'm wondering whether one of the options below is possible:

  1. Handle this edge case in torch.export directly. If the device is correctly set before the to call this would become a no-op
  2. Alternatively, this edge case is documented on torch.export, and we can link to your docs instead of having a 4-line comment every time we apply this pattern

WDYT?
(If that is not possible/too hard, of course we will accept this PR)


On top of this, I think we (transformers) can do a better job at setting devices at cache initialization time -- we could initialize the cache given how the model is mapped across devices, as opposed to taking a single device argument.

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

@SunMarc
Copy link
Member

SunMarc commented Sep 4, 2024

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

That right ! Yeah it would be better to initialize the cache at initialization time. This should be feasible as long as we can get the model or the device_map at the init.

@dvrogozh
Copy link
Contributor

dvrogozh commented Sep 4, 2024

This WA works for me. Note reply from @bdhirsh here: #33179 (comment). They plan to look into this issue on pytorch side to avoid any WA, but it's unclear how long it will take.

@guangy10 guangy10 force-pushed the workaround_to_unbreak_export branch from 8b6b7ee to 8aa6e33 Compare September 4, 2024 19:33
@guangy10
Copy link
Contributor Author

guangy10 commented Sep 4, 2024

Rebased on latest main.

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 4, 2024

@guangy10 (sorry, I don't want to be your antagonist in your PRs, but this might be a source of future related problems. Please bear with me 🙏 )

In torch's original to docs, we can read: "If the self Tensor already has the correct torch.dtype and torch.device, then self is returned.". In other words, a harmless operation, self = self.

We are observing that the same doesn't happen in torch.export. As such, this is an ad hoc fix, and we will likely need similar fixes in other places. On top of that, because the change is not intuitive versus the base torch, it is (correctly) accompanied by a long comment explaining why it is needed. Both problems apply to outside of transformers as well.

Ideally, a more scalable solution would be implemented 🤗 I'm wondering whether one of the options below is possible:

  1. Handle this edge case in torch.export directly. If the device is correctly set before the to call this would become a no-op
  2. Alternatively, this edge case is documented on torch.export, and we can link to your docs instead of having a 4-line comment every time we apply this pattern

WDYT? (If that is not possible/too hard, of course we will accept this PR)

On top of this, I think we (transformers) can do a better job at setting devices at cache initialization time -- we could initialize the cache given how the model is mapped across devices, as opposed to taking a single device argument.

@SunMarc this should enable us to get rid of the to operations after a deprecation cycle, correct?

Well first it's not an edge case for torch.export. Setting the device explicitly doesn't seem to be helpful. As you can see in #32830, the device is already set at init time. To unblock I think we should just merge this workaround as it passes both export and compile tests, and left a TODO for proper fix in the future. Your thoughts? CC: @SunMarc @gante

@guangy10 guangy10 force-pushed the workaround_to_unbreak_export branch from 8aa6e33 to 300b0ed Compare September 4, 2024 21:05
@guangy10 guangy10 requested review from dvrogozh and SunMarc September 4, 2024 21:05
@gante
Copy link
Member

gante commented Sep 5, 2024

@guangy10 happy to merge with a TODO, if you believe #33303 wouldn't fix this issue :)

(our current initialization doesn't handle multi-device setups, so that's why we have a to operation. #33303 should fix it and remove the need to have a to op. Nevertheless, there are other places in our codebase where we assume to operations to be harmless, and may clash with torch.export 🤔 )

@guangy10
Copy link
Contributor Author

guangy10 commented Sep 6, 2024

@guangy10 happy to merge with a TODO, if you believe #33303 wouldn't fix this issue :)

(our current initialization doesn't handle multi-device setups, so that's why we have a to operation. #33303 should fix it and remove the need to have a to op. Nevertheless, there are other places in our codebase where we assume to operations to be harmless, and may clash with torch.export 🤔 )

@gante @SunMarc If the export test is fixed in #33303 and it can be landed any time soon, we can discard this workaround and merge #33303 instead. What matters is that we want to start the transformers integration work in executorch once #32830 is merged which requires the export test to be working again.

@guangy10
Copy link
Contributor Author

@gante If #33303 needs more time to review, can we merge this workaround to unblock?

@ArthurZucker
Copy link
Collaborator

#33303 will be merged today!

facebook-github-bot pushed a commit to pytorch/executorch that referenced this pull request Sep 14, 2024
Summary:
bypass-github-export-checks

[Done] ~~Require PR [Make StaticCache configurable at model construct time](huggingface/transformers#32830) in order to export, lower and run the 🤗 model OOTB.~~
[Done] ~~Require huggingface/transformers#33303 or huggingface/transformers#33287 to be merged to 🤗 `transformers` to resolve the export issue introduced by huggingface/transformers#32543

-----------

Now we can take the integration point from 🤗 `transformers` to lower compatible models to ExecuTorch OOTB.
  - This PR creates a simple script with recipe of XNNPACK.
  - This PR also created a secret `EXECUTORCH_HT_TOKEN` to allow download checkpoints in the CI
  - This PR connects the 🤗 "Export to ExecuTorch" e2e workflow to ExecuTorch CI

### Instructions to run the demo:

1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch:
```
python -m extension.export_util.export_hf_model -hfm "google/gemma-2b" # The model is exported statical dims with static KV cache
```
2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime:
```
python -m extension.llm.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o tokenizer.bin
```
3. Build llm runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate)

4. Run the lowered model
```
cmake-out/examples/models/llama2/llama_main --model_path=gemma.pte --tokenizer_path=tokenizer.bin --prompt="My name is"
```
OOTB output and perf
```
I 00:00:00.003110 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003360 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003380 executorch:cpuinfo_utils.cpp:158] Number of efficient cores 4
I 00:00:00.003384 executorch:main.cpp:65] Resetting threadpool with num threads = 6
I 00:00:00.014716 executorch:runner.cpp:51] Creating LLaMa runner: model_path=gemma.pte, tokenizer_path=tokenizer_gemma.bin
I 00:00:03.065359 executorch:runner.cpp:66] Reading metadata from model
I 00:00:03.065391 executorch:metadata_util.h:43] get_n_bos: 1
I 00:00:03.065396 executorch:metadata_util.h:43] get_n_eos: 1
I 00:00:03.065399 executorch:metadata_util.h:43] get_max_seq_len: 123
I 00:00:03.065402 executorch:metadata_util.h:43] use_kv_cache: 1
I 00:00:03.065404 executorch:metadata_util.h:41] The model does not contain use_sdpa_with_kv_cache method, using default value 0
I 00:00:03.065405 executorch:metadata_util.h:43] use_sdpa_with_kv_cache: 0
I 00:00:03.065407 executorch:metadata_util.h:41] The model does not contain append_eos_to_prompt method, using default value 0
I 00:00:03.065409 executorch:metadata_util.h:43] append_eos_to_prompt: 0
I 00:00:03.065411 executorch:metadata_util.h:41] The model does not contain enable_dynamic_shape method, using default value 0
I 00:00:03.065412 executorch:metadata_util.h:43] enable_dynamic_shape: 0
I 00:00:03.130388 executorch:metadata_util.h:43] get_vocab_size: 256000
I 00:00:03.130405 executorch:metadata_util.h:43] get_bos_id: 2
I 00:00:03.130408 executorch:metadata_util.h:43] get_eos_id: 1
My name is Melle. I am a 20 year old girl from Belgium. I am living in the southern part of Belgium. I am 165 cm tall and I weigh 45kg. I like to play sports like swimming, running and playing tennis. I am very interested in music and I like to listen to classical music. I like to sing and I can play the piano. I would like to go to the USA because I like to travel a lot. I am looking for a boy from the USA who is between 18 and 25 years old. I
PyTorchObserver {"prompt_tokens":4,"generated_tokens":118,"model_load_start_ms":1723685715497,"model_load_end_ms":1723685718612,"inference_start_ms":1723685718612,"inference_end_ms":1723685732965,"prompt_eval_end_ms":1723685719087,"first_token_ms":1723685719087,"aggregate_sampling_time_ms":182,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:17.482472 executorch:stats.h:70] 	Prompt Tokens: 4    Generated Tokens: 118
I 00:00:17.482475 executorch:stats.h:76] 	Model Load Time:		3.115000 (seconds)
I 00:00:17.482481 executorch:stats.h:86] 	Total inference time:		14.353000 (seconds)		 Rate: 	8.221278 (tokens/second)
I 00:00:17.482483 executorch:stats.h:94] 		Prompt evaluation:	0.475000 (seconds)		 Rate: 	8.421053 (tokens/second)
I 00:00:17.482485 executorch:stats.h:105] 		Generated 118 tokens:	13.878000 (seconds)		 Rate: 	8.502666 (tokens/second)
I 00:00:17.482486 executorch:stats.h:113] 	Time to first generated token:	0.475000 (seconds)
I 00:00:17.482488 executorch:stats.h:120] 	Sampling time over 122 tokens:	0.182000 (seconds)
```

Pull Request resolved: #4723

Reviewed By: huydhn, kirklandsign

Differential Revision: D62543933

Pulled By: guangy10

fbshipit-source-id: 00401a39ba03d7383e4b284d25c8fc62a6695b34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants