Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo dev doc restructure #8896

Merged
merged 38 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
d18547c
Update intro and why nemo in dev doc
yaoyu-33 Apr 11, 2024
76dd297
Categorize tutorials
yaoyu-33 Apr 11, 2024
69aef9e
Update tutorials link
yaoyu-33 Apr 15, 2024
955546a
update index
yaoyu-33 Apr 15, 2024
b0d69c6
Restructure
yaoyu-33 Apr 15, 2024
153cfaf
Restructure
yaoyu-33 Apr 15, 2024
d786227
Restructure
yaoyu-33 Apr 15, 2024
85fcfd8
Restructure
yaoyu-33 Apr 15, 2024
7f3bfba
Restructure
yaoyu-33 Apr 15, 2024
3d87835
Restructure
yaoyu-33 Apr 16, 2024
544b514
Restructure
yaoyu-33 Apr 17, 2024
9b80dfa
Restructure
yaoyu-33 Apr 17, 2024
c2a3032
Update flash attention
yaoyu-33 Apr 17, 2024
b1b69c4
Update flash attention
yaoyu-33 Apr 17, 2024
425c007
Fix few structure issue
yaoyu-33 Apr 17, 2024
a4da882
Fix migration
yaoyu-33 Apr 17, 2024
e8595ed
Fix structure
yaoyu-33 Apr 17, 2024
5a60afe
Fix structure
yaoyu-33 Apr 17, 2024
31b4ff4
Few updates
yaoyu-33 Apr 17, 2024
f1c9cd7
Add few more scripts
yaoyu-33 Apr 17, 2024
ac713d0
Fix scripts
yaoyu-33 Apr 17, 2024
2c41acb
Fix few things
yaoyu-33 Apr 17, 2024
0d25d04
Fix tutorial table
yaoyu-33 Apr 17, 2024
029541c
Restructure
yaoyu-33 Apr 17, 2024
8f0c10f
Rename
yaoyu-33 Apr 17, 2024
2425aeb
Few fixes and moves
yaoyu-33 Apr 19, 2024
c4096ef
Move sections
yaoyu-33 Apr 19, 2024
6291286
Merge branch 'main' into yuya/dev_doc_update
yaoyu-33 Apr 19, 2024
98cdaa5
Fix bib
yaoyu-33 Apr 19, 2024
0a70dbe
Refactor files
yaoyu-33 Apr 19, 2024
be01ef7
Merge branch 'main' into yuya/dev_doc_update
pablo-garay Apr 20, 2024
d8ad941
Fixes
yaoyu-33 Apr 23, 2024
371b98f
Merge remote-tracking branch 'origin/yuya/dev_doc_update' into yuya/d…
yaoyu-33 Apr 23, 2024
44fe116
Merge branch 'main' into yuya/dev_doc_update
yaoyu-33 Apr 24, 2024
0afd998
Fix
yaoyu-33 Apr 24, 2024
5bfbc1b
Fix few issues
yaoyu-33 Apr 24, 2024
e4f1d8e
remove scripts
yaoyu-33 Apr 24, 2024
6405a3a
Update docs
yaoyu-33 Apr 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions docs/source/ckpt_converters/convert_mlm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Converting from Megatron-LM
===========================

NVIDIA NeMo and NVIDIA Megatron-LM share many underlying technologies. This document provides guidance for migrating your project from Megatron-LM to NVIDIA NeMo.

Converting Checkpoints
----------------------

You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo framework using the provided example script. This script facilitates the conversion of Megatron-LM checkpoints to NeMo compatible formats.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: bash

<NeMo_ROOT_FOLDER>/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \
--checkpoint_folder <path_to_PTL_checkpoints_folder> \
--checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \
--nemo_file_path <path_to_output_nemo_file> \
--model_type <megatron_model_type> \
--tensor_model_parallel_size <tensor_model_parallel_size> \
--pipeline_model_parallel_size <pipeline_model_parallel_size> \
--gpus_per_node <gpus_per_node>

Resuming Training
-----------------

To resume training from a converted Megatron-LM checkpoint, it is crucial to correctly set up the training parameters to match the previous learning rate schedule. Use the following setting for the `trainer.max_steps` parameter in your NeMo training configuration:

.. code-block:: none

trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)

This configuration ensures that the learning rate scheduler in NeMo continues from where it left off in Megatron-LM, using the `lr-warmup-fraction` and `lr-decay-iters` arguments from the original Megatron-LM training setup.

22 changes: 22 additions & 0 deletions docs/source/ckpt_converters/intro.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Community Checkpoint Converter
==============================

We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Sparse Fine-Tuning (SFT), Position-Encodings Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

We offer comprehensive guides to assist both end-users and developers:
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

- **User Guide**: Detailed steps on how to convert community model checkpoints for further training or deployment within NeMo. For more information, please see our :doc:`user_guide`.

- **Developer Guide**: Instructions for developers on how to implement converters for community model checkpoints, allowing for broader compatibility and integration within the NeMo ecosystem. For development details, refer to our :doc:`dev_guide`.

- **Megatron-LM Checkpoint Conversion**: NVIDIA NeMo and NVIDIA Megatron-LM share several foundational technologies. You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo framework using our scripts, see our :doc:`convert_mlm`.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

Access the user and developer guides directly through the links below:

.. toctree::
:maxdepth: 1
:caption: Conversion Guides

user_guide
dev_guide
convert_mlm
70 changes: 70 additions & 0 deletions docs/source/collections.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
================
NeMo Collections
================

Documentation for the individual collections

.. toctree::
:maxdepth: 1
:caption: Large Language Models (LLMs)
:name: Large Language Models
:titlesonly:

nlp/nemo_megatron/intro
nlp/models
nlp/machine_translation/machine_translation
nlp/megatron_onnx_export
nlp/quantization
nlp/api


.. toctree::
:maxdepth: 1
:caption: Speech AI
:name: Speech AI
:titlesonly:

asr/intro
asr/speech_classification/intro
asr/speaker_recognition/intro
asr/speaker_diarization/intro
asr/ssl/intro
asr/speech_intent_slot/intro


.. toctree::
:maxdepth: 1
:caption: Multimodal (MM)
:name: Multimodal
:titlesonly:

multimodal/mllm/intro
multimodal/vlm/intro
multimodal/text2img/intro
multimodal/nerf/intro
multimodal/api


.. toctree::
:maxdepth: 1
:caption: Text To Speech (TTS)
:name: Text To Speech
:titlesonly:

tts/intro

.. toctree::
:maxdepth: 1
:caption: Vision (CV)
:name: vision
:titlesonly:

vision/intro

.. toctree::
:maxdepth: 1
:caption: Common
:name: Common
:titlesonly:

common/intro
3 changes: 2 additions & 1 deletion docs/source/core/core_index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
=========
NeMo Core
NeMo APIs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed we should call this "NeMo Core APIs"? cc @ericharper

=========

You can learn more about the underlying principles of the NeMo codebase in this section.
Expand All @@ -16,6 +16,7 @@ You can learn more about aspects of the NeMo "core" by following the links below

core
neural_modules
positional_embeddings
exp_manager
neural_types
export
Expand Down
48 changes: 48 additions & 0 deletions docs/source/features/memory_optimizations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Memory Optimizations
====================

Parallelism
-----------
Refer to :doc:`Parallelism <./parallelism>`.

Flash Attention
---------------

Overview
^^^^^^^^

Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as natural language processing. Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

Turn on and off Flash Attention
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the NeMo framework, Flash Attention is supported through the Transformer Engine with the inclusion of Flash Attention 2. By default, Flash Attention is enabled, but the Transformer Engine may switch to a different kernel if the tensor dimensions are not optimal for Flash Attention. Users can completely disable Flash Attention by setting the environment variable ``NVTE_FLASH_ATTN=0``.

For more details on the supported Dot Attention backend, please refer to the Transformer Engine source code available at `Transformer Engine's Attention Mechanism <https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py>`_.

.. bibliography:: ./nlp_all.bib
:style: plain
:labelprefix: nlp-megatron
:keyprefix: nlp-megatron-

Overview
^^^^^^^^

Full Activation Recomputation
"""""""""""""""""""""""""""""
Involves recalculating all the intermediate activations during the backward pass of a model's training, instead of storing them during the forward pass. This technique maximizes memory efficiency at the cost of computational overhead, as each activation is recomputed when needed.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

Partial Activation Recomputation
""""""""""""""""""""""""""""""""
This method recomputes only a subset of layers during the backward phase. It is a trade-off between the full recomputation and no recomputation, balancing memory savings with computational efficiency.

Selective Activation Recomputation
""""""""""""""""""""""""""""""""""
Reduces the memory footprint of activations significantly via smart activation checkpointing. This approach involves selectively storing only crucial activations and recomputing the others as needed. It is particularly useful in large models to minimize memory usage while controlling the computational cost.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198

.. bibliography:: ./nlp_all.bib
:style: plain
:labelprefix: nlp-megatron
:keyprefix: nlp-megatron-
6 changes: 6 additions & 0 deletions docs/source/features/mixed_precision.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _mix_precision:

Mixed Precision Training
------------------------

Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports fp16, bf16, and fp8 (via Transformer Engine) across most models. Further details will be provided shortly.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
Parallelisms
------------

NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitraritly):
NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitrarily):

Distributed Data parallelism
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Distributed Data parallelism (DDP) creates idential copies of the model across multiple GPUs.
yaoyu-33 marked this conversation as resolved.
Show resolved Hide resolved

.. image:: images/ddp.gif
.. image:: ../nlp/nemo_megatron/images/ddp.gif
:align: center
:width: 800px
:alt: Distributed Data Parallel
Expand All @@ -20,7 +20,7 @@ Tensor Parallelism
With Tensor Paralellism (TP) a tensor is split into non-overlapping pieces and
different parts are distributed and processed on separate GPUs.

.. image:: images/tp.gif
.. image:: ../nlp/nemo_megatron/images/tp.gif
:align: center
:width: 800px
:alt: Tensor Parallel
Expand All @@ -29,15 +29,15 @@ Pipeline Parallelism
^^^^^^^^^^^^^^^^^^^^
With Pipeline Paralellism (PP) consecutive layer chunks are assigned to different GPUs.

.. image:: images/pp.gif
.. image:: ../nlp/nemo_megatron/images/pp.gif
:align: center
:width: 800px
:alt: Pipeline Parallel

Sequence Parallelism
^^^^^^^^^^^^^^^^^^^^

.. image:: images/sp.gif
.. image:: ../nlp/nemo_megatron/images/sp.gif
:align: center
:width: 800px
:alt: Sequence Parallel
Expand All @@ -47,7 +47,7 @@ Expert Parallelism
Expert Paralellim (EP) distributes experts across GPUs.


.. image:: images/ep.png
.. image:: ../nlp/nemo_megatron/images/ep.png
:align: center
:width: 800px
:alt: Expert Parallelism
Expand All @@ -57,7 +57,7 @@ Parallelism nomenclature

When reading and modifying NeMo Megatron code you will encounter the following terms.

.. image:: images/pnom.gif
.. image:: ../nlp/nemo_megatron/images/pnom.gif
:align: center
:width: 800px
:alt: Parallelism nomenclature
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Throughput Optimizations
========================

Sequence Packing for SFT/PEFT
-----------------------------


Overview
^^^^^^^^

Expand Down Expand Up @@ -133,6 +135,10 @@ To train with packed sequences, you need to change four items in the SFT/PEFT co
Now you are all set to finetune your model with a much improved throughput!

Communication Overlap
---------------------
NeMo leverages Megatron-Core's optimizations to enhance bandwidth utilization and effectively overlap computation with communication. Additional details will be provided soon.


.. rubric:: Footnotes

Expand Down
Loading
Loading