Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Several bugs when using HvdAllToAllEmbedding to train model and save/restore KV files. #376

Merged
merged 2 commits into from
Jan 16, 2024

Conversation

MoFHeka
Copy link
Collaborator

@MoFHeka MoFHeka commented Jan 3, 2024

Description

BUG1

The id key may be multiple copies in one tensor before real shadow lookup when using HvdAllToAllEmbedding.
Here is how the bug happens.

Before ALL2ALL:
Rank0 IDs: [0,1,1,3,3] --unique--> [0,1,3]
Rank1 IDs: [0,1,2,2,3] --unique--> [0,1,2,3]
After ALL2ALL:
Rank0 IDs: [0,0,2]
Rank1 IDs: [1,1,3,3]

Rank0 has duplicated key 0, and Rank1 has duplicated keys 1 and 3. When updating the gradient, in Rank0, the key 0 will insert twice which means only one gradient takes effect and another one was covered. That may lead to under-training, same as Rank1 because of multiple copies of key 1 and key 3.

So we need to do secondary unique operation after ALL2ALL. The first time for reducing transmission overhead, the second time for training properly.

BUG2

Fix bug that de_hvd_save_model and CheckpointManager were unable to use together. Which cause by CheckpointManager compatibility code in DE hack changing the storage path in DE saveable object when runtime. The modified storage path was not wrote back after Checkpoint saving, and then when call saved_model saving function it would dump to an unexpected directory which was set in DECheckpoint class.

Others

Modify function name de_hvd_save_model to de_save_model.
Also make user more easy to use de_save_model by writing fewer code.

Type of change

  • Bug fix
  • New Tutorial
  • Updated or additional documentation
  • Additional Testing
  • New Feature

Checklist:

  • I've properly formatted my code according to the guidelines
    • By running yapf
    • By running clang-format
  • This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Train any model using HvdAllToAllEmbedding, and comparing the loss between set parameter with_secondary_unique True and False.

Run the new test.

@MoFHeka MoFHeka requested a review from rhdong as a code owner January 3, 2024 07:42
Also make user more easy to use de_save_model by writing fewer code.
Fix bug that de_hvd_save_model and CheckpointManager were unable to use together.  Which cause by CheckpointManager compatibility code in DE hack changing the storage path in DE saveable object when runtime. The modified storage path was not wrote back after Checkpoint saving, and then when call saved_model saving function it would dump to a unexpected directory which was set in DECheckpoint class.
@MoFHeka MoFHeka closed this Jan 15, 2024
@MoFHeka MoFHeka reopened this Jan 15, 2024
@MoFHeka MoFHeka changed the title [Fix] The id key may be multiple copies in one tensor before real shadow lookup when using HvdAllToAllEmbedding. [Fix] Several bugs when using HvdAllToAllEmbedding for to train and save/restore KV files. Jan 15, 2024
@MoFHeka MoFHeka changed the title [Fix] Several bugs when using HvdAllToAllEmbedding for to train and save/restore KV files. [Fix] Several bugs when using HvdAllToAllEmbedding to train model and save/restore KV files. Jan 15, 2024
include_optimizer=True,
save_traces=True,
options=save_options)
de.keras.models.de_save_model(model,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the most left is de., the name can be de.keras.models.save_model

Copy link
Member

@rhdong rhdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhdong rhdong merged commit b33afb8 into tensorflow:master Jan 16, 2024
117 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants