BMTrain New Version Release v1.0.0 #182

MayDomine · 2024-02-01T07:35:40Z

BMTrain New Version Release v1.0.0

Issue Reference

Issue #174

Description

This new version of BMTrain introduces Tensor Parallelism and a significant restructuring of the codebase. We have enhanced flexibility by allowing for more granular control over ZeRO levels and Activation Checkpointing. This control is now achievable at the bmt.Block (alias of bmt.CheckpointBlock) level, enabling targeted optimizations. For instance, it is now possible to apply aggressive ZeRO settings or disable Checkpointing in specific layers as needed. Additionally, the updated BMTrain features a suite of operators designed explicitly for Tensor Parallel training.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

(Describe the tests used to validate the changes. Please provide instructions for reproduction.)

Checklist

I have read the CONTRIBUTING document.
My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

Additional Information

(Provide any additional information, configuration details, or data necessary for the review.)

remove inappropriate import in __init__.py

* using hooks to implement ZeRO and Checkpoint --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>

FIX Error: tensor slice in gather()

Fix tp

only initialized tp_comm when tp_size > 1

This do not affect current BMTrain since BMTrain pass grain_size=0 in all parallel_for.

fix loss scale for tp

* fix adam bf16 load changed to fp16 * support save optim_manager.state (including optimizer, lr_scheduler, loss_scale)

FIX: allgather_object stuck

Optimizer load gathered state and record delta feature are supported now

Vocab parallel Embedding impl and make example work when tp_size > 1

Achazwl and others added 30 commits July 21, 2023 22:16

remove inappropriate import in __init__.py

37c2edc

Merge pull request #126 from OpenBMB/clean_init

c47d3c8

remove inappropriate import in __init__.py

FIX tensor slice in gather

15de53c

Update synchronize.py

6553504

FIX Error : tensor slice in gather()

0c6da0b

Update synchronize.py

7d4a570

Create test_synchronize.py

245dcaa

Update test_synchronize.py

7484cfb

add test_synchronize

e0f8c30

Merge branch 'main' of github.com:OpenBMB/BMTrain into dev

75aa1a8

add ReduceScatter communication op

99e0706

FIX: unittest and backward for reduce scatter

6213886

Refactor ZeRO, checkpoint and pipeline code (#128)

74700e4

* using hooks to implement ZeRO and Checkpoint --------- Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>

prod backward raise error

b0a0865

fix inspect_model where param is None (#151)

3f11744

fix is_first_layer (#152)

dc7284d

mv zero_level to CheckpointBlock (#154)

7d62a18

FIX：the number of GPU of synchronize

286297e

add judgement through assert

69ea264

Add

17973a2

Merge pull request #146 from JerryYin777/dev

9b944f8

FIX Error: tensor slice in gather()

FIX: Revise the document path

d435097

Fix middle hidden (#155)

abc7b90

Tensor Parallel (#153)

df43d6d

Add Bf16 Support (#136)

38461bc

Rename class and files (#157)

a7fb078

Add zero_context.py (#160)

90492fd

FIX error of multi-gpus of test_synchronize.py

9bbd279

Refactor communicate groups and Block (#159)

f256db6

Update test_synchronize.py (#161)

0ba1e3a

zkh2016 and others added 15 commits September 27, 2023 05:55

add _save_to_infer_model (#170)

25e3671

Async save state_dict to file (#171)

290386a

init tp comm when tp_size>1

cf6ad55

fix tp

abcd7c8

fix adam bf16 load changed to fp16 (#175)

9dc7811

Merge pull request #173 from zkh2016/fix_tp

00caadb

Fix tp

Merge pull request #172 from zkh2016/control_tp_comm

31340c5

only initialized tp_comm when tp_size > 1

Fix parallel_for when grain_size > 0 (#179)

ed90d1f

This do not affect current BMTrain since BMTrain pass grain_size=0 in all parallel_for.

fix test_training

76aeeff

tp cross entropy (#180)

5aeba09

Update optim_manager.py (#181)

3c80ce4

fix loss scale for tp

Feat optim manager state (#176)

6abcf77

* fix adam bf16 load changed to fp16 * support save optim_manager.state (including optimizer, lr_scheduler, loss_scale)

FIX: allgather_object stuck

b7d26e2

add test for allgather object

3b7b6b9

Merge pull request #183 from MayDomine/dev

e3b689c

FIX: allgather_object stuck

zkh2016 approved these changes Feb 20, 2024

View reviewed changes

MayDomine added 2 commits February 20, 2024 17:26

fix projection interface

281860a

Optimizer load gathered state and record delta feature are supported now

d933ee9

MayDomine force-pushed the dev branch from a5d75ed to 281860a Compare February 23, 2024 03:05

MayDomine added 4 commits February 23, 2024 11:29

fix adam offload return_delta and formatting code

8cbb576

workflow test

fef8a7a

update the action config yaml

75b84ac

add trigger for pr synchronize

341caa0

MayDomine force-pushed the dev branch from af64ebb to 341caa0 Compare February 23, 2024 05:11

fix c++ include

119e5cf

MayDomine added the feature label Feb 23, 2024

MayDomine and others added 3 commits February 23, 2024 13:48

Merge pull request #184 from OpenBMB/load_opt

0def29f

Optimizer load gathered state and record delta feature are supported now

Vocal parallel Embedding impl and make example work when tp_size > 1

f915f94

Merge pull request #186 from OpenBMB/vpe

5713d76

Vocab parallel Embedding impl and make example work when tp_size > 1

MayDomine merged commit dd2b5bc into main Feb 26, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BMTrain New Version Release v1.0.0 #182

BMTrain New Version Release v1.0.0 #182

MayDomine commented Feb 1, 2024 •

edited

Loading

BMTrain New Version Release v1.0.0 #182

BMTrain New Version Release v1.0.0 #182

Conversation

MayDomine commented Feb 1, 2024 • edited Loading