Skip to content

Enhances process_group_test #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 21, 2025
Merged

Conversation

allenwang28
Copy link
Contributor

See #109 for more context, but there was a desire to:

  1. Have collectives be tested on their own
  2. Test true resiliency behaviors

What does this PR do?

  • Removes _test_multi_pg and replaces it with individual run_{collective}_tests
  • Makes each run_{collective}_test robust to different world sizes (i.e. 1, 2, or 3) to be more generalizable
  • Introduces MultiPgBaseTest, a base test that spins up process groups and threadpool executors on setup (and destroys on teardown) so they can be re-used inbetween each test invocation.
  • MultiPgBaseTest introduces:
    • _run_parallel, a convenience function to run a collective test across its process groups, and
    • _run_with_resiliency, a convenience function that runs a collective test, invokes failure modes and tests for correct resiliency behaviors.
  • Removes all tests that utilized _test_multi_pg into GlooMultiPgTest, BabyGlooMultiPgTest, BabyNcclMultiPgTest and BabyNcclResiliencyTest
    • Note that BabyNcclMultiPgTest and BabyNcclResiliencyTest are separated so that the former can test for 2 GPUs, and the latter for 3. While all collectives could theoretically be tested with only 2 GPUs, it seems a bit trivial to run with only one process group.

Note that the test time increases from ~17s to ~43.48s:


==============================  60 passed in 43.48s ============================== 

mostly due to the resiliency tests taking awhile to run. Without the resiliency tests it takes 24.3 seconds:

$ pytest torchft/process_group_test.py -k "not resiliency"  && pkill -U $(whoami) py
thon
====================================== test session starts ======================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/allencwang/workspace/torchft
configfile: pytest.ini
plugins: typeguard-2.13.3
collected 60 items / 23 deselected / 37 selected                                                

torchft/process_group_test.py .....................................                       [100%]

============================== 37 passed, 23 deselected in 24.29s ==============================

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 20, 2025
@allenwang28 allenwang28 linked an issue Feb 21, 2025 that may be closed by this pull request
@allenwang28 allenwang28 merged commit c782f4e into pytorch:main Feb 21, 2025
6 checks passed
@allenwang28 allenwang28 deleted the coll_test branch February 21, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

process_group_test - Enhance fault tolerance collective tests
3 participants