[tests] Add Context-parallel CI tests #41860

kashif · 2025-10-25T14:03:31Z

What does this PR do?

Adds two context parallel tests for the CI

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-10-25T14:18:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stas00 · 2025-10-25T17:37:08Z

Looks fantastic, Kashif! Thank you for adding the missing tests. Now I don't need to worry about breaking your CP integration while integrating ALST/UlyssesSP.

stas00 · 2025-10-26T15:46:18Z

oh, one request, since we will now have more than one context parallel, you might want to rename the file+class to include torch in it - via #41832 and huggingface/accelerate#3817 we will now have pc.cp_backend = {torch|deepspeed} torch being the default.

stas00 · 2025-10-26T16:05:25Z

tests/fsdp/test_context_parallel.py

+    training_args = parser.parse_args_into_dataclasses()[0]
+
+    # Use SmolLM (small Llama-based model that works with CP)
+    model_name = "HuggingFaceTB/SmolLM-135M"


It's interesting that "hf-internal-testing/tiny-random-LlamaForCausalLM" fails here:

stderr: [rank0]: File "/code/users/stas/github/transformers-alst-integration/src/transformers/models/llama/modeling_llama.py", line 292, in forward stderr: [rank0]: attn_output = self.o_proj(attn_output) stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl stderr: [rank0]: return self._call_impl(*args, **kwargs) stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl stderr: [rank0]: return forward_call(*args, **kwargs) stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: [rank0]: File "/home/yak/miniconda3/envs/dev/lib/python3.12/site-packages/torch/nn/modules/linear.py", line 125, in forward stderr: [rank0]: return F.linear(input, self.weight, self.bias) stderr: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ stderr: [rank0]: RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x32 and 16x16)

yes... i got the same so i switched to "HuggingFaceTB/SmolLM-135M"

ydshieh · 2025-10-27T13:27:58Z

Thanks a lot. I will check on our multi-gpu CI runners too.

tests/trainer/context_parallel_config.yaml

stas00 · 2025-10-27T15:10:18Z

would it be possible to exercise the rest of the torch/fsdp PC config here:

parallelism_config:
  parallelism_config_dp_replicate_size: 1
  parallelism_config_dp_shard_size: 1
  parallelism_config_tp_size: 1
  parallelism_config_cp_size: 2

  --------------
  parallelism_config_cp_comm_strategy: alltoall

that is to add all config values for the Torch PC plugin, like cp_comm_strategy? it also helps with tests-as-documentation/example

stas00 · 2025-10-27T22:56:54Z

Kashif, I'd suggest dropping the basic test, since it gets repeated almost identically in the 2nd test - will save $$ and time.

stas00 · 2025-10-28T02:24:47Z

@kashif, I took your test as a foundation for mine and further worked on it, while also making it self-contained w/o needing to include in the already bloated repo more files.

https://github.com/huggingface/transformers/pull/41832/files#diff-bfd2f0924ca5096a9cf7bff5929081cf6552b70df8bb40632d3b11273c9554af

Please feel free to re-use, with the note that I made some tweaks to testing_utils.py that it relies on - your version of my lib is quite outdated, I have made quite a few changes over the recent years. If you want to sync back the latest version is here https://github.com/stas00/ml-engineering/blob/master/testing/testing_utils.py

kashif · 2025-11-01T19:30:34Z

@stas00 I've made the script in the self-contained style and incorporated your suggestions

kashif · 2025-11-01T19:31:08Z

@ydshieh we should ideally run this only on the multi-gpu setup, is that possible?

stas00 · 2025-11-01T20:29:52Z

Looks great, Kashif.

SunMarc

Thanks for adding this ! Just a minor nit

tests/fsdp/test_context_parallel.py

kashif · 2025-11-05T10:32:03Z

@SunMarc moved to fsdp folder

ydshieh · 2025-11-05T11:13:06Z

@ydshieh we should ideally run this only on the multi-gpu setup, is that possible?

Yes, we have multi-gpu runners (A10). I will check how it goes with that.

ydshieh · 2025-11-05T14:27:24Z

@kashif The test is run and pass in our multi-gpu A10 runner, great!

kashif · 2025-11-05T14:32:25Z

amazing thank you @ydshieh

* intial * simplify tests * add test_cp_equivalence * removed fsdp_transformer_layer_cls_to_wrap * use DataCollatorForLanguageModeling * remove use_cache=False. * changes from review * make script self contained * moved to fsdp folder * fix class name

kashif added 3 commits October 25, 2025 11:48

intial

63efa18

simplify tests

1ff587b

add test_cp_equivalence

0bcf34e

kashif requested a review from SunMarc October 25, 2025 14:04

removed fsdp_transformer_layer_cls_to_wrap

6d42d9a

kashif added 2 commits October 25, 2025 14:20

use DataCollatorForLanguageModeling

e18436c

remove use_cache=False.

df8aaac

kashif mentioned this pull request Oct 25, 2025

[Trainer] accelerate contextparallel support in trainer #40205

Merged

kashif requested a review from ydshieh October 25, 2025 15:18

stas00 reviewed Oct 26, 2025

View reviewed changes

ydshieh reviewed Oct 27, 2025

View reviewed changes

tests/trainer/context_parallel_config.yaml Outdated Show resolved Hide resolved

stas00 mentioned this pull request Oct 27, 2025

Deepspeed Ulysses/ALST integration huggingface/accelerate#3817

Merged

6 tasks

kashif added 2 commits November 1, 2025 19:14

changes from review

4ec11c2

make script self contained

a4a187e

kashif requested a review from ydshieh November 4, 2025 08:51

Merge branch 'main' into cp-ci-tests

977c586

SunMarc approved these changes Nov 4, 2025

View reviewed changes

tests/fsdp/test_context_parallel.py Show resolved Hide resolved

kashif added 3 commits November 5, 2025 11:27

moved to fsdp folder

5a7d1aa

fix class name

8b39cd5

Merge branch 'main' into cp-ci-tests

9bd5dae

SunMarc merged commit 0c4a202 into main Nov 5, 2025
16 checks passed

SunMarc deleted the cp-ci-tests branch November 5, 2025 10:40

[tests] Add Context-parallel CI tests #41860

[tests] Add Context-parallel CI tests #41860

Uh oh!

Conversation

kashif commented Oct 25, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 25, 2025

Uh oh!

stas00 commented Oct 25, 2025

Uh oh!

stas00 commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

kashif Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Oct 27, 2025

Uh oh!

Uh oh!

stas00 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 27, 2025

Uh oh!

stas00 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Nov 1, 2025

Uh oh!

kashif commented Nov 1, 2025

Uh oh!

stas00 commented Nov 1, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kashif commented Nov 5, 2025

Uh oh!

Uh oh!

ydshieh commented Nov 5, 2025

Uh oh!

ydshieh commented Nov 5, 2025

Uh oh!

kashif commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

stas00 commented Oct 26, 2025 •

edited

Loading

stas00 commented Oct 27, 2025 •

edited

Loading

stas00 commented Oct 28, 2025 •

edited

Loading