Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[checkpointio] gather tensor before unpad it if the tensor is both padded and distributed #6168

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

Lemon-412
Copy link
Contributor

@Lemon-412 Lemon-412 commented Dec 24, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

fixed #6167

📝 What does this PR do?

To prevent #6167, gather the tensor before unpad it if the tensor is both padded and distributed.
Clipboard_Screenshot_1735094244

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@Lemon-412 Lemon-412 requested a review from a team as a code owner December 24, 2024 09:50
@Lemon-412 Lemon-412 changed the title [checkpoint_io] [checkpoint_io] gather tensor before unpad it if the tensor is both padded and distributed Dec 24, 2024
@Lemon-412
Copy link
Contributor Author

request review from @flybird11111 @ver217 .

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


request review from @flybird11111 @ver217 .

@Lemon-412 Lemon-412 changed the title [checkpoint_io] gather tensor before unpad it if the tensor is both padded and distributed [checkpointio] gather tensor before unpad it if the tensor is both padded and distributed Jan 20, 2025
@Lemon-412
Copy link
Contributor Author

it seems like one unittest (test_dist_lamb) fails with OOM.
weird, should we rerun the test?

@ver217
Copy link
Member

ver217 commented Jan 20, 2025

it seems like one unittest (test_dist_lamb) fails with OOM. weird, should we rerun the test?

This was fixed in main branch. Could you rebase the main branch?

@Lemon-412
Copy link
Contributor Author

it seems like one unittest (test_dist_lamb) fails with OOM. weird, should we rerun the test?

This was fixed in main branch. Could you rebase the main branch?

Done. Approval is required again since we force push the code using rebase.

@ver217 ver217 merged commit 97e60cb into hpcaitech:main Jan 21, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants