Skip to content

feat: Improve Dynamo partitioning System Performance on Large Models #2175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 15, 2023

Conversation

gs-olive
Copy link
Collaborator

@gs-olive gs-olive commented Aug 4, 2023

Description

Problem Context

The Dynamo partitioning system was very slow for large models (>1000 Nodes) with segmentation. The existing partitioner was using an exhaustive partitioning mechanism which was more than quadratic in the number of nodes, and worsened with more segmentation. This new system uses a simpler adjacency-based partitioning system which is much more performant on large models.

  • Upgrade Dynamo partitioning to use a custom version of the Torch _SplitterBase for efficiency and optimized usage in the Dynamo case
  • Validate existing use cases are still functional, with the same partitioning schema as before
  • Upgrade qualified name checking

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist:

  • [ x ] My code follows the style guidelines of this project (You can use the linters)
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas and hacks
  • [ x ] I have made corresponding changes to the documentation
  • [ - ] I have added tests to verify my fix or my feature
  • [ x ] New and existing unit tests pass locally with my changes
  • [ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

@gs-olive gs-olive added component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths Story: Export/Compile Unification Issues relating to unification of Dynamo compile/export paths Story: Dynamo Compile Improvements Issues relating to improvement of the Dynamo compile path labels Aug 4, 2023
@gs-olive gs-olive requested review from narendasan and peri044 August 4, 2023 17:40
@gs-olive gs-olive self-assigned this Aug 4, 2023
@github-actions github-actions bot added component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: lowering Issues re: The lowering / preprocessing passes component: torch_compile labels Aug 4, 2023
@gs-olive gs-olive force-pushed the dynamo_partitioning_perf_improvement branch from 6cfbb59 to f5e8dff Compare August 4, 2023 20:16
@github-actions github-actions bot added the component: tests Issues re: Tests label Aug 4, 2023
@gs-olive gs-olive force-pushed the dynamo_partitioning_perf_improvement branch 2 times, most recently from ca94dca to 292f5ce Compare August 4, 2023 23:18
@github-actions github-actions bot added the component: build system Issues re: Build system label Aug 4, 2023
@gs-olive gs-olive force-pushed the dynamo_partitioning_perf_improvement branch 4 times, most recently from bd0b0c5 to bbf514f Compare August 7, 2023 16:51
@gs-olive gs-olive requested a review from narendasan August 7, 2023 19:02
@gs-olive
Copy link
Collaborator Author

gs-olive commented Aug 8, 2023

  • Inform user if none of the nodes are supported/no valid partitions
  • Automatically fall back to global partitioning if adjacency/fast partitioning fails
    • Alert user via warning
    • Show trace in debug logs

- Upgrade Dynamo partitioning to use a custom version of the Torch
_SplitterBase for efficiency and optimized usage in the Dynamo case
- Validate existing use cases are still functional, with the same
partitioning schema as before
- Upgrade qualified name checking
- Update testing for new partitioner
- Add new directory to store available partitioners
@gs-olive gs-olive force-pushed the dynamo_partitioning_perf_improvement branch from bbf514f to ffed9d6 Compare August 9, 2023 01:18
- Fall back to global partitioner if fast partitioner fails
@gs-olive gs-olive force-pushed the dynamo_partitioning_perf_improvement branch from ffed9d6 to 631b7b7 Compare August 9, 2023 01:26
Copy link
Collaborator

@peri044 peri044 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gs-olive gs-olive merged commit b57d83e into pytorch:main Aug 15, 2023
@gs-olive gs-olive deleted the dynamo_partitioning_perf_improvement branch August 15, 2023 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: build system Issues re: Build system component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes component: tests Issues re: Tests component: torch_compile Story: Dynamo Compile Improvements Issues relating to improvement of the Dynamo compile path Story: Export/Compile Unification Issues relating to unification of Dynamo compile/export paths
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants