forked from NVIDIA/NeMo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
S3 Dirpath + Async Uploading Support for Default Checkpoints (NVIDIA#…
…9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
- Loading branch information
Showing
11 changed files
with
887 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
**************** | ||
S3 Checkpointing | ||
**************** | ||
|
||
S3CheckpointIO | ||
============== | ||
|
||
This checkpoint_io is used for saving and loading files to and from S3. | ||
Initializing this checkpoint_io requires the dirpath be an S3 dirpath. | ||
|
||
**Example Usage:** | ||
|
||
.. code-block:: bash | ||
async_checkpointing = self.cfg.s3_checkpointing.get('enable_async_checkpointing', False) | ||
chunk_size_MB = self.cfg.s3_checkpointing.get('chunk_size_MB') | ||
max_read_concurrency = self.cfg.s3_checkpointing.get('max_read_concurrency') | ||
max_write_concurrency = self.cfg.s3_checkpointing.get('max_write_concurrency') | ||
dirpath = self.cfg.exp_manager.checkpoint_callback_params.get('dirpath') | ||
s3_checkpoint_io = S3CheckpointIO(dirpath=dirpath, chunk_size_MB=chunk_size_MB, max_read_concurrency=max_read_concurrency, max_write_concurrency=max_write_concurrency, async_checkpointing=async_checkpointing) | ||
strategy = NLPDDPStrategy( | ||
no_ddp_communication_hook=True, | ||
checkpoint_io=s3_checkpoint_io, | ||
gradient_as_bucket_view=self.cfg.model.gradient_as_bucket_view, | ||
find_unused_parameters=False, | ||
nccl_communicator_config_path=self.cfg.model.get('nccl_communicator_config_path', None), | ||
sharp=self.cfg.model.get('sharp', False), | ||
) | ||
**Config changes:** | ||
|
||
.. code-block:: bash | ||
checkpoint_callback_params: | ||
dirpath: s3://mstar-eks-dev-us-east-2/alxzhang/nemo123/1n/checkpoints | ||
... | ||
s3_checkpointing: | ||
# write_concurrency * tp * pp * 1.15 (buffer) should be within 3500 S3 TPS limit per partition | ||
max_write_concurrency: 10 | ||
# read_concurrency * tp * pp * 1.15 (buffer) should be within 5500 S3 TPS limit per partition | ||
max_read_concurrency: 15 | ||
chunk_size_MB: 64 | ||
# enables asynchronous checkpoint writing to S3 | ||
enable_async_checkpointing: False | ||
**Asynchronous** | ||
By default, the S3CheckpointIO class acts synchronously. | ||
The async feature currently does not check if the previous async save is completed, so it is possible | ||
that an old checkpoint is removed even when the current save fails. | ||
To prevent this, this feature is meant to be used in conjunction with saving top k checkpoints. | ||
|
||
|
||
S3Utils and Dependencies | ||
======================== | ||
|
||
This utility class is used by the S3CheckpoinIO and the exp_manager to do S3-related operations. | ||
It has dependencies on | ||
|
||
1. boto3[crt] | ||
|
||
2. s3fs==0.4.2 | ||
|
||
3. tenacity | ||
|
||
If any of these are missing, this class can't be used. | ||
|
||
|
||
|
||
s3_dirpath_utils | ||
================ | ||
|
||
Used to operate on strings by checking if they are S3 dirpaths, or convert a bucket and key into an s3 dirpath. | ||
This has no reliance on the S3Utils utility class, and can be used without any new dependencies. | ||
|
||
|
||
S3 Demands and ExpManager Details When Running at Scale | ||
======================================================= | ||
|
||
Typically, in the ExpManager, every rank looks for the checkpoint file to load from. At large scale, there can be thousands of ranks querying S3 for dirpaths which can cause slowdown or throttling errors. | ||
|
||
To avoid overloading S3 when resuming from a checkpoint only rank 0 needs to identify the checkpoint path and find the correct resumption file. Rank 0 will broadcast the checkpoint path to the other ranks. | ||
|
||
.. code-block:: bash | ||
trainer._checkpoint_connector = NeMoCheckpointConnector(trainer) | ||
The NeMoModelCheckpoint setup() method will automatically broadcast the checkpoint path. | ||
|
||
The NeMoCheckpointConnector is defined in the exp_manager.py file, and uses the broadcasted checkpoint path founds by rank 0 on all ranks when resuming training from an existing checkpoint. | ||
|
||
The setting of the trainer._checkpoint_connector needs to happen before the ExpManager call as the ExpManager updates the trainer's checkpoint connector. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.