Skip to content

Conversation

@sanketpurandare
Copy link
Contributor

@sanketpurandare sanketpurandare commented Nov 4, 2025

Real run: torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py
Fake run: torchrun --standalone --nproc-per-node 4 examples/example_ds3_pp.py --fake-evaluate

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025
@sanketpurandare sanketpurandare requested review from ezyang, fmassa, wconstab and xmfan and removed request for fmassa November 4, 2025 01:47
2: [2, 6],
3: [3, 7],
}
if fake_evaluate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important to maintain 2 paths one for 8 stage one for 4 stage? if not it'd be nice to clean it up and have only one codepath

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second this, just get a 8 gpu devserver pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified the code path

zhxchen17 added a commit to zhxchen17/autoparallel that referenced this pull request Nov 4, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on meta-pytorch#227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
zhxchen17 added a commit to zhxchen17/autoparallel that referenced this pull request Nov 4, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on meta-pytorch#227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
@sanketpurandare sanketpurandare changed the title Enabling real PP run on 4 GPUs Enabling real PP run on 8 GPUs Nov 4, 2025
zhxchen17 added a commit to zhxchen17/autoparallel that referenced this pull request Nov 5, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on meta-pytorch#227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
@xmfan xmfan merged commit c583870 into meta-pytorch:main Nov 5, 2025
6 of 8 checks passed
zhxchen17 added a commit to zhxchen17/autoparallel that referenced this pull request Nov 7, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on meta-pytorch#227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
zhxchen17 added a commit to zhxchen17/autoparallel that referenced this pull request Nov 12, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on meta-pytorch#227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
zhxchen17 added a commit that referenced this pull request Nov 12, 2025
Summary:
the issue from torch nightly has been fixed for the new export API. relanding.

Test Plan:
Also tested on #227
```
=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.12.11, pytest-7.3.2, pluggy-1.6.0
rootdir: /data/users/zhxchen17/autoparallel
plugins: xdoctest-1.1.0, hypothesis-5.35.1, xdist-3.3.1, subtests-0.13.1, rerunfailures-14.0, flakefinder-1.1.0, cpp-2.3.0, anyio-4.10.0
collected 21 items

tests/test_aot_eager.py ..x                                                                                                                                                         [ 14%]
tests/test_api.py ....                                                                                                                                                              [ 33%]
tests/test_dtensor.py ....                                                                                                                                                          [ 52%]
tests/test_optimize_placement.py ........                                                                                                                                           [ 90%]
tests/test_ordered_sharding.py ..                                                                                                                                                   [100%]

======================================================================== 20 passed, 1 xfailed in 86.30s (0:01:26) =========================================================================
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants