Skip to content

Commit

Permalink
Update S3 checkpointing doc and fix visibility on website. Update the…
Browse files Browse the repository at this point in the history
… nlp_overrides DDP initializer to properly assign updated checkpoint io to base class.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
  • Loading branch information
alxzhang-amazon committed Jun 14, 2024
1 parent 0d8b86a commit cc30150
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 8 deletions.
1 change: 1 addition & 0 deletions docs/source/common/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ The common collection contains things that could be used across all collections.
metrics
tokenizers
data
s3_checkpointing
20 changes: 13 additions & 7 deletions docs/source/common/s3_checkpointing.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
*********
Callbacks
*********
****************
S3 Checkpointing
****************

S3CheckpointIO
==============
Expand Down Expand Up @@ -60,8 +60,11 @@ S3Utils and Dependencies

This utility class is used by the S3CheckpoinIO and the exp_manager to do S3-related operations.
It has dependencies on

1. boto3[crt]

2. s3fs==0.4.2

3. tenacity

If any of these are missing, this class can't be used.
Expand All @@ -72,19 +75,22 @@ s3_dirpath_utils
================

Used to operate on strings by checking if they are S3 dirpaths, or convert a bucket and key into an s3 dirpath.
This has NO dependencies on what's required for the S3Utils class, and can be used with without any new dependencies.
This has no reliance on the S3Utils utility class, and can be used without any new dependencies.


S3 Demands and ExpManager Details When Running at Scale
=======================================================

When there are many ranks loading from S3, there can be slowdown or throttling errors.
To avoid overloading S3, when resuming from a checkpoint only rank 0 needs to identify the checkpoint path and find the correct resumption file.
Typically, in the ExpManager, every rank looks for the checkpoint file to load from. At large scale, there can be thousands of ranks querying S3 for dirpaths which can cause slowdown or throttling errors.

To avoid overloading S3 when resuming from a checkpoint only rank 0 needs to identify the checkpoint path and find the correct resumption file. Rank 0 will broadcast the checkpoint path to the other ranks.

.. code-block:: bash
trainer._checkpoint_connector = NeMoCheckpointConnector(trainer)
The NeMoModelCheckpoint setup() method will automatically broadcast the checkpoint path.

The NeMoCheckpointConnector is defined in the exp_manager.py file, and uses the broadcasted checkpoint path founds by rank 0 on all ranks when resuming training from an existing checkpoint.

The NeMoModelCheckpoint setup() method broadcasts the checkpoint path.
The setting of the trainer._checkpoint_connector needs to happen before the ExpManager call as the ExpManager updates the trainer's checkpoint connector.
2 changes: 1 addition & 1 deletion nemo/collections/nlp/parts/nlp_overrides.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def __init__(
raise ImportError(
"megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt."
)
super().__init__(parallel_devices, cluster_environment, checkpoint_io, **kwargs)
super().__init__(parallel_devices=parallel_devices, cluster_environment=cluster_environment, checkpoint_io=checkpoint_io, **kwargs)

self.no_ddp_communication_hook = no_ddp_communication_hook
self.nccl_communicator_config_path = nccl_communicator_config_path
Expand Down

0 comments on commit cc30150

Please sign in to comment.