-
Notifications
You must be signed in to change notification settings - Fork 31.3k
[tests] Add Context-parallel CI tests #41860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Looks fantastic, Kashif! Thank you for adding the missing tests. Now I don't need to worry about breaking your CP integration while integrating ALST/UlyssesSP. |
|
oh, one request, since we will now have more than one context parallel, you might want to rename the file+class to include |
| training_args = parser.parse_args_into_dataclasses()[0] | ||
|
|
||
| # Use SmolLM (small Llama-based model that works with CP) | ||
| model_name = "HuggingFaceTB/SmolLM-135M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting that "hf-internal-testing/tiny-random-LlamaForCausalLM" fails here:
stderr: [rank0]: File "/code/users/stas/github/transformers-alst-integration/src/transformers/models/llama/modeling_llama.py", line 292, in forward
stderr: [rank0]: attn_output = self.o_proj(attn_output)
stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
stderr: [rank0]: return self._call_impl(*args, **kwargs)
stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
stderr: [rank0]: return forward_call(*args, **kwargs)
stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/linear.py", line 125, in forward
stderr: [rank0]: return F.linear(input, self.weight, self.bias)
stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
stderr: [rank0]: RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x32 and 16x16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes... i got the same so i switched to "HuggingFaceTB/SmolLM-135M"
|
Thanks a lot. I will check on our multi-gpu CI runners too. |
|
would it be possible to exercise the rest of the torch/fsdp PC config here: that is to add all config values for the Torch PC plugin, like |
|
Kashif, I'd suggest dropping the basic test, since it gets repeated almost identically in the 2nd test - will save $$ and time. |
|
@kashif, I took your test as a foundation for mine and further worked on it, while also making it self-contained w/o needing to include in the already bloated repo more files. Please feel free to re-use, with the note that I made some tweaks to |
|
@stas00 I've made the script in the self-contained style and incorporated your suggestions |
|
@ydshieh we should ideally run this only on the multi-gpu setup, is that possible? |
|
Looks great, Kashif. |
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this ! Just a minor nit
|
@SunMarc moved to fsdp folder |
Yes, we have multi-gpu runners (A10). I will check how it goes with that. |
|
@kashif The test is run and pass in our multi-gpu A10 runner, great! |
|
amazing thank you @ydshieh |
* intial * simplify tests * add test_cp_equivalence * removed fsdp_transformer_layer_cls_to_wrap * use DataCollatorForLanguageModeling * remove use_cache=False. * changes from review * make script self contained * moved to fsdp folder * fix class name
What does this PR do?
Adds two context parallel tests for the CI
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.