Skip to content

Conversation

@sanketpurandare
Copy link
Contributor

@sanketpurandare sanketpurandare commented Oct 30, 2025

Repro:

  • Running graph-based pipeline runner: torchrun --nproc_per_node=4 examples/example_ds3_pp.py
  • Running graph passes isolated: python tests/test_graph_partition.py
    • only FSDP splitting pass right now
    • di/dw and multiplexed NYI

Placeholder description (Claude Code):

This PR adds comprehensive Pipeline Parallelism (PP) support to AutoParallel, including a new graph-based PP runner, FSDP collective splitting passes, and significant improvements to cost estimation and caching. The branch includes 1,621 insertions and 119 deletions across 15 files.

Major Features

Pipeline Parallelism Support

  • Graph PP Runner (Sanket Purandare, Simon Fan): Added graph_pp_runner.py with 417 lines implementing a complete pipeline parallel execution engine that operates on FX graphs
    • Supports arbitrary layers per stage configuration
    • Includes forward and backward pipeline scheduling with tlparse markers for profiling
    • Handles multi-stage traced execution
  • DSv3 Pipeline Example (Simon Fan): Added examples/example_ds3_pp.py (464 lines) demonstrating 8-stage pipeline parallelism with DeepSeek v3 model
    • Splits DSv3 into 8 stages before applying autoparallel to first stage
    • Includes tlparse markers for forward/backward pipeline profiling

Graph Transformation Passes

  • FSDP Collective Splitting (Sanket Purandare, Ivan Kobzarev): New split_fsdp_collectives.py pass (162 lines) to split all_gather prologue and reduce_scatter epilogue from FSDP graphs
    • Clears partitioner tags for proper handling
    • Supports multiple ag/rs operations per graph
  • dI/dW Graph Splitting (Brian Hirsh): New split_di_dw_graph.py pass (64 lines) to separate gradient computation for inputs (dI) and weights (dW)
  • Graph Multiplexing (Sanket Purandare): New graph_multiplex.py pass (105 lines) for combining multiple graph execution paths
  • Pass Utilities (Sanket Purandare et al.): New _passes/utils.py (93 lines) with shared utilities for graph transformations

Caching & Performance

  • Improved Caching (Francisco Massa, Sanket Purandare): Refactored caching system in optimize_sharding.py and api.py
    • PP Runner now only deals with graphs (separation of concerns)
    • Better cache invalidation and reuse strategies

Infrastructure & Testing

  • Test Enhancements (Simon Fan, Ivan Kobzarev): Updated test_graph_partition.py (54 insertions) and test_optimize_placement.py (43 insertions) with new test cases for graph passes
  • DSv3 Model Updates (Simon Fan): Enhanced _testing/models/dsv3.py with 80 additions for multi-stage pipeline support, including arg count assertions before boxed_run

Code Quality

  • API Refactoring (Simon Fan): Reorganized api.py (104 insertions, 69 deletions) moving PP functionality to bottom and improving code organization
  • Cleanup (Simon Fan): Removed unused example/test scripts and moved pipeline stage logic into dsv3.py
  • Linting (Edward Z. Yang, Simon Fan): Multiple linting passes and black formatting across the branch

Contributors

  • Sanket Purandare (@sanketpurandare): PP Runner, graph passes, caching refactor
  • Simon Fan (@xmfan): DSv3 PP example, multi-stage support, testing, code organization
  • Francisco Massa (@fvsmassa): Cost estimation improvements, caching
  • Ivan Kobzarev (@IvanKobzarev): FSDP collective pass enhancements
  • Brian Hirsh (@hirsheybar): dI/dW splitting pass
  • Edward Z. Yang (@ezyang): Code review, merging, linting

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 30, 2025
ezyang and others added 7 commits October 30, 2025 06:44
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Not yet working
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@ezyang ezyang mentioned this pull request Oct 30, 2025
@sanketpurandare sanketpurandare merged commit 9aebf3b into main Nov 3, 2025
6 checks passed
@fmassa fmassa deleted the war-oct29 branch November 4, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants