Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter #1341

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

denera
Copy link
Collaborator

@denera denera commented Nov 18, 2024

Description

When Userbuffers config dictionary sets overlap method to ring-exchange or pipeline for any *_dgrad layer, that layer's *_wgrad overlap needs to be disabled in order for ub_overlap_rs_dgrad=True option for related TE modules to function correctly.

This PR fixes a bug where the "*_wgrad" overlap was persisting in the Userbuffer configuration and the corresponding UB object was being initialized even when it was not needed.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

Please list the changes introduced in this PR:

  • *_wgrad overlap is now removed from methods["bulk"] list when the same layer's *_dgrad overlap has its method set to either ring-exchange or pipeline.
  • add_ub(name, **ub_cfg) is now only called if name is in the original user-provided ub_cfg. This avoids creating UB objects with default configs that may conflict with the user's intended TP overlap use.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@denera denera added the bug Something isn't working label Nov 18, 2024
@denera denera self-assigned this Nov 18, 2024
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

transformer_engine/pytorch/module/base.py Outdated Show resolved Hide resolved
@timmoon10
Copy link
Collaborator

/te-ci pytorch L0 L1

denera and others added 4 commits November 22, 2024 16:09
…dgrad overlap is enabled

Signed-off-by: Alp Dener <adener@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Alp Dener <adener@nvidia.com>
@denera
Copy link
Collaborator Author

denera commented Nov 22, 2024

/te-ci pytorch L0 L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants