diff --git a/README.rst b/README.rst index c93c48e355d8..66b3a5806c2d 100644 --- a/README.rst +++ b/README.rst @@ -46,7 +46,7 @@ Latest News
Large Language Models and Multimodal
- Accelerate your generative AI journey with NVIDIA NeMo framework on GKE (2024/03/16) + Accelerate your generative AI journey with NVIDIA NeMo Framework on GKE (2024/03/16) An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke. The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.

@@ -71,7 +71,7 @@ Latest News
NVIDIA now powers training for Amazon Titan Foundation models (2023/11/28) - NVIDIA NeMo framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs. + NVIDIA NeMo Framework now empowers the Amazon Titan foundation models (FM) with efficient training of large language models (LLMs). The Titan FMs form the basis of Amazon’s generative AI service, Amazon Bedrock. The NeMo Framework provides a versatile framework for building, customizing, and running LLMs.

@@ -486,7 +486,7 @@ We welcome community contributions! Please refer to `CONTRIBUTING.md `_ that utilize the NeMo framework. +We provide an ever-growing list of `publications `_ that utilize the NeMo Framework. If you would like to add your own article to the list, you are welcome to do so via a pull request to this repository's ``gh-pages-src`` branch. Please refer to the instructions in the `README of that branch `_. diff --git a/docs/source/ckpt_converters/convert_mlm.rst b/docs/source/ckpt_converters/convert_mlm.rst new file mode 100644 index 000000000000..61b5b2802e8a --- /dev/null +++ b/docs/source/ckpt_converters/convert_mlm.rst @@ -0,0 +1,32 @@ +Converting from Megatron-LM +=========================== + +NVIDIA NeMo and NVIDIA Megatron-LM share many underlying technologies. This document provides guidance for migrating your project from Megatron-LM to NVIDIA NeMo. + +Converting Checkpoints +---------------------- + +You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using the provided example script. This script facilitates the conversion of Megatron-LM checkpoints to NeMo compatible formats. + +.. code-block:: bash + + /examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \ + --checkpoint_folder \ + --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \ + --nemo_file_path \ + --model_type \ + --tensor_model_parallel_size \ + --pipeline_model_parallel_size \ + --gpus_per_node + +Resuming Training +----------------- + +To resume training from a converted Megatron-LM checkpoint, it is crucial to correctly set up the training parameters to match the previous learning rate schedule. Use the following setting for the `trainer.max_steps` parameter in your NeMo training configuration: + +.. code-block:: none + + trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters) + +This configuration ensures that the learning rate scheduler in NeMo continues from where it left off in Megatron-LM, using the `lr-warmup-fraction` and `lr-decay-iters` arguments from the original Megatron-LM training setup. + diff --git a/docs/source/ckpt_converters/intro.rst b/docs/source/ckpt_converters/intro.rst new file mode 100644 index 000000000000..6d4da83499fa --- /dev/null +++ b/docs/source/ckpt_converters/intro.rst @@ -0,0 +1,22 @@ +Community Checkpoint Converter +============================== + +We provide easy-to-use tools that enable users to convert community checkpoints into the NeMo format. These tools facilitate various operations, including resuming training, Sparse Fine-Tuning (SFT), Parameter-Efficient Fine-Tuning (PEFT), and deployment. For detailed instructions and guidelines, please refer to our documentation. + +We offer comprehensive guides to assist both end users and developers: + +- **User Guide**: Detailed steps on how to convert community model checkpoints for further training or deployment within NeMo. For more information, please see our :doc:`user_guide`. + +- **Developer Guide**: Instructions for developers on how to implement converters for community model checkpoints, allowing for broader compatibility and integration within the NeMo ecosystem. For development details, refer to our :doc:`dev_guide`. + +- **Megatron-LM Checkpoint Conversion**: NVIDIA NeMo and NVIDIA Megatron-LM share several foundational technologies. You can convert your GPT-style model checkpoints trained with Megatron-LM into the NeMo Framework using our scripts, see our :doc:`convert_mlm`. + +Access the user and developer guides directly through the links below: + +.. toctree:: + :maxdepth: 1 + :caption: Conversion Guides + + user_guide + dev_guide + convert_mlm diff --git a/docs/source/collections.rst b/docs/source/collections.rst new file mode 100644 index 000000000000..1cc7a654b9c1 --- /dev/null +++ b/docs/source/collections.rst @@ -0,0 +1,70 @@ +================ +NeMo Collections +================ + +Documentation for the individual collections + +.. toctree:: + :maxdepth: 1 + :caption: Large Language Models (LLMs) + :name: Large Language Models + :titlesonly: + + nlp/nemo_megatron/intro + nlp/models + nlp/machine_translation/machine_translation + nlp/megatron_onnx_export + nlp/quantization + nlp/api + + +.. toctree:: + :maxdepth: 1 + :caption: Speech AI + :name: Speech AI + :titlesonly: + + asr/intro + asr/speech_classification/intro + asr/speaker_recognition/intro + asr/speaker_diarization/intro + asr/ssl/intro + asr/speech_intent_slot/intro + + +.. toctree:: + :maxdepth: 1 + :caption: Multimodal Models (MMs) + :name: Multimodal + :titlesonly: + + multimodal/mllm/intro + multimodal/vlm/intro + multimodal/text2img/intro + multimodal/nerf/intro + multimodal/api + + +.. toctree:: + :maxdepth: 1 + :caption: Text To Speech (TTS) + :name: Text To Speech + :titlesonly: + + tts/intro + +.. toctree:: + :maxdepth: 1 + :caption: Vision (CV) + :name: vision + :titlesonly: + + vision/intro + +.. toctree:: + :maxdepth: 1 + :caption: Common + :name: Common + :titlesonly: + + common/intro \ No newline at end of file diff --git a/docs/source/core/core_index.rst b/docs/source/core/core_index.rst index 28cd149bdcb5..01977c1b5101 100644 --- a/docs/source/core/core_index.rst +++ b/docs/source/core/core_index.rst @@ -1,5 +1,5 @@ ========= -NeMo Core +NeMo APIs ========= You can learn more about the underlying principles of the NeMo codebase in this section. @@ -30,7 +30,7 @@ Alternatively, you can jump straight to the documentation for the individual col * :doc:`Automatic Speech Recognition (ASR) <../asr/intro>` -* :doc:`Multimodal (MM) Models <../multimodal/mllm/intro>` +* :doc:`Multimodal Models (MMs) <../multimodal/mllm/intro>` * :doc:`Text-to-Speech (TTS) <../tts/intro>` diff --git a/docs/source/features/memory_optimizations.rst b/docs/source/features/memory_optimizations.rst new file mode 100644 index 000000000000..0e0b3ad84402 --- /dev/null +++ b/docs/source/features/memory_optimizations.rst @@ -0,0 +1,48 @@ +Memory Optimizations +==================== + +Parallelism +----------- +Refer to :doc:`Parallelism <./parallelism>`. + +Flash Attention +--------------- + +Overview +^^^^^^^^ + +Flash Attention is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as Natural Language Processing (NLP). Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms. + +Turn Flash Attention On and Off +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In the NeMo Framework, Flash Attention is supported through the Transformer Engine with the inclusion of Flash Attention 2. By default, Flash Attention is enabled, but the Transformer Engine may switch to a different kernel if the tensor dimensions are not optimal for Flash Attention. Users can completely disable Flash Attention by setting the environment variable ``NVTE_FLASH_ATTN=0``. + +For more details on the supported Dot Attention backend, please refer to the Transformer Engine source code available at `Transformer Engine's Attention Mechanism `_. + +.. bibliography:: ./nlp_all.bib + :style: plain + :labelprefix: nlp-megatron + :keyprefix: nlp-megatron- + +Overview +^^^^^^^^ + +Full Activation Recomputation +""""""""""""""""""""""""""""" +This method recalculates all the intermediate activations during the backward pass of a model's training, instead of storing them during the forward pass. This technique maximizes memory efficiency at the cost of computational overhead, as each activation is recomputed when needed. + +Partial Activation Recomputation +"""""""""""""""""""""""""""""""" +This method recomputes only a subset of layers during the backward phase. It is a trade-off between the full recomputation and no recomputation, balancing memory savings with computational efficiency. + +Selective Activation Recomputation +"""""""""""""""""""""""""""""""""" +This method reduces memory footprint of activations significantly via smart activation checkpointing. This approach involves selectively storing only crucial activations and recomputing the others as needed. It is particularly useful in large models to minimize memory usage while controlling the computational cost. + +Refer to "Reducing Activation Recomputation in Large Transformer Models" for more details: https://arxiv.org/abs/2205.05198 + +.. bibliography:: ./nlp_all.bib + :style: plain + :labelprefix: nlp-megatron + :keyprefix: nlp-megatron- \ No newline at end of file diff --git a/docs/source/features/mixed_precision.rst b/docs/source/features/mixed_precision.rst new file mode 100644 index 000000000000..d193752e5475 --- /dev/null +++ b/docs/source/features/mixed_precision.rst @@ -0,0 +1,6 @@ +.. _mix_precision: + +Mixed Precision Training +------------------------ + +Mixed precision training significantly enhances computational efficiency by conducting operations in half-precision and fp8 formats, while selectively maintaining minimal data in single-precision to preserve critical information throughout key areas of the network. NeMo now supports FP16, BF16, and FP8 (via Transformer Engine) across most models. Further details will be provided shortly. diff --git a/docs/source/nlp/nemo_megatron/parallelisms.rst b/docs/source/features/parallelisms.rst similarity index 74% rename from docs/source/nlp/nemo_megatron/parallelisms.rst rename to docs/source/features/parallelisms.rst index 9129963ef021..b10477e4232c 100644 --- a/docs/source/nlp/nemo_megatron/parallelisms.rst +++ b/docs/source/features/parallelisms.rst @@ -3,13 +3,13 @@ Parallelisms ------------ -NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitraritly): +NeMo Megatron supports 5 types of parallelisms (which can be mixed together arbitrarily): -Distributed Data parallelism +Distributed Data Parallelism ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Distributed Data parallelism (DDP) creates idential copies of the model across multiple GPUs. +Distributed Data Parallelism (DDP) creates idential copies of the model across multiple GPUs. -.. image:: images/ddp.gif +.. image:: ../nlp/nemo_megatron/images/ddp.gif :align: center :width: 800px :alt: Distributed Data Parallel @@ -20,7 +20,7 @@ Tensor Parallelism With Tensor Paralellism (TP) a tensor is split into non-overlapping pieces and different parts are distributed and processed on separate GPUs. -.. image:: images/tp.gif +.. image:: ../nlp/nemo_megatron/images/tp.gif :align: center :width: 800px :alt: Tensor Parallel @@ -29,7 +29,7 @@ Pipeline Parallelism ^^^^^^^^^^^^^^^^^^^^ With Pipeline Paralellism (PP) consecutive layer chunks are assigned to different GPUs. -.. image:: images/pp.gif +.. image:: ../nlp/nemo_megatron/images/pp.gif :align: center :width: 800px :alt: Pipeline Parallel @@ -37,7 +37,7 @@ With Pipeline Paralellism (PP) consecutive layer chunks are assigned to differen Sequence Parallelism ^^^^^^^^^^^^^^^^^^^^ -.. image:: images/sp.gif +.. image:: ../nlp/nemo_megatron/images/sp.gif :align: center :width: 800px :alt: Sequence Parallel @@ -47,7 +47,7 @@ Expert Parallelism Expert Paralellim (EP) distributes experts across GPUs. -.. image:: images/ep.png +.. image:: ../nlp/nemo_megatron/images/ep.png :align: center :width: 800px :alt: Expert Parallelism @@ -57,7 +57,7 @@ Parallelism nomenclature When reading and modifying NeMo Megatron code you will encounter the following terms. -.. image:: images/pnom.gif +.. image:: ../nlp/nemo_megatron/images/pnom.gif :align: center :width: 800px :alt: Parallelism nomenclature diff --git a/docs/source/nlp/nemo_megatron/packed_sequence.rst b/docs/source/features/throughput_optimizations.rst similarity index 96% rename from docs/source/nlp/nemo_megatron/packed_sequence.rst rename to docs/source/features/throughput_optimizations.rst index e31444fe1e60..825c3add5dfb 100644 --- a/docs/source/nlp/nemo_megatron/packed_sequence.rst +++ b/docs/source/features/throughput_optimizations.rst @@ -1,7 +1,9 @@ +Throughput Optimizations +======================== + Sequence Packing for SFT/PEFT ----------------------------- - Overview ^^^^^^^^ @@ -133,6 +135,10 @@ To train with packed sequences, you need to change four items in the SFT/PEFT co Now you are all set to finetune your model with a much improved throughput! +Communication Overlap +--------------------- +NeMo leverages Megatron-Core's optimizations to enhance bandwidth utilization and effectively overlap computation with communication. Additional details will be provided soon. + .. rubric:: Footnotes diff --git a/docs/source/index.rst b/docs/source/index.rst index 8dc74ecc771d..82d3359480ca 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,7 +1,19 @@ NVIDIA NeMo Framework Developer Docs ==================================== -NVIDIA NeMo Framework is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. +NVIDIA NeMo Framework is an end-to-end, cloud-native framework designed to build, customize, and deploy generative AI models anywhere. + +`NVIDIA NeMo Framework `_ supports large-scale training features, including: + +- Mixed Precision Training +- Parallelism +- Distributed Optimizer +- Fully Sharded Data Parallel (FSDP) +- Flash Attention +- Activation Recomputation +- Positional Embeddings and Positional Interpolation +- Post-Training Quantization (PTQ) with Ammo +- Sequence Packing `NVIDIA NeMo Framework `_ has separate collections for: @@ -9,7 +21,7 @@ NVIDIA NeMo Framework is an end-to-end, cloud-native framework to build, customi * :doc:`Automatic Speech Recognition (ASR) ` -* :doc:`Multimodal (MM) Models ` +* :doc:`Multimodal Models (MMs) ` * :doc:`Text-to-Speech (TTS) ` @@ -29,105 +41,49 @@ For quick guides and tutorials, see the "Getting started" section below. :titlesonly: starthere/intro - starthere/tutorials starthere/best-practices + starthere/tutorials For more information, browse the developer docs for your area of interest in the contents section below or on the left sidebar. + .. toctree:: :maxdepth: 1 - :caption: NeMo Core - :name: core - :titlesonly: + :caption: Key Optimizations + :name: Key Optimizations - core/core_index + features/mixed_precision + features/parallelisms + features/memory_optimizations + features/throughput_optimizations .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Community Model Converters :name: CheckpointConverters - ckpt_converters/user_guide - ckpt_converters/dev_guide - -.. toctree:: - :maxdepth: 1 - :caption: Large Language Models (LLMs) - :name: Large Language Models - :titlesonly: - - nlp/nemo_megatron/intro - nlp/models - nlp/machine_translation/machine_translation - nlp/megatron_onnx_export - nlp/quantization - nlp/api - + ckpt_converters/intro .. toctree:: :maxdepth: 1 - :caption: Speech AI - :name: Speech AI + :caption: APIs + :name: APIs :titlesonly: - asr/intro - asr/speech_classification/intro - asr/speaker_recognition/intro - asr/speaker_diarization/intro - asr/ssl/intro - asr/speech_intent_slot/intro - + core/core_index .. toctree:: :maxdepth: 1 - :caption: Multimodal (MM) - :name: Multimodal + :caption: Collections + :name: Collections :titlesonly: - multimodal/mllm/intro - multimodal/vlm/intro - multimodal/text2img/intro - multimodal/nerf/intro - multimodal/api - + collections .. toctree:: :maxdepth: 1 - :caption: Text To Speech (TTS) - :name: Text To Speech - :titlesonly: - - tts/intro - -.. toctree:: - :maxdepth: 2 - :caption: Vision (CV) - :name: vision - :titlesonly: - - vision/intro - -.. toctree:: - :maxdepth: 2 - :caption: Common - :name: Common - :titlesonly: - - common/intro - - -.. toctree:: - :maxdepth: 2 - :caption: Speech Tools - :name: Speech Tools - :titlesonly: - - tools/intro - -.. toctree:: - :maxdepth: 2 - :caption: Upgrade Guide - :name: Upgrade Guide + :caption: Speech AI Tools + :name: Speech AI Tools :titlesonly: - starthere/migration-guide \ No newline at end of file + tools/intro \ No newline at end of file diff --git a/docs/source/multimodal/mllm/neva.rst b/docs/source/multimodal/mllm/neva.rst index 83fb6b681e29..5484ab358c2f 100644 --- a/docs/source/multimodal/mllm/neva.rst +++ b/docs/source/multimodal/mllm/neva.rst @@ -25,7 +25,7 @@ In NeMo, the text encoder is anchored in the :class:`~nemo.collections.nlp.model Vision Model ^^^^^^^^^^^^ -For visual interpretation, NeVA harnesses the power of the pre-trained CLIP visual encoder, ViT-L/14, recognized for its visual comprehension acumen. Images are first partitioned into standardized patches, for instance, 16x16 pixels. These patches are linearly embedded, forming a flattened vector that subsequently feeds into the transformer. The culmination of the transformer's processing is a unified image representation. In the NeMo framework, the NeVA vision model, anchored on the CLIP visual encoder ViT-L/14, can either be instantiated via the :class:`~nemo.collections.multimodal.models.multimodal_llm.clip.megatron_clip_models.CLIPVisionTransformer` class or initiated through the `transformers` package from Hugging Face. +For visual interpretation, NeVA harnesses the power of the pre-trained CLIP visual encoder, ViT-L/14, recognized for its visual comprehension acumen. Images are first partitioned into standardized patches, for instance, 16x16 pixels. These patches are linearly embedded, forming a flattened vector that subsequently feeds into the transformer. The culmination of the transformer's processing is a unified image representation. In the NeMo Framework, the NeVA vision model, anchored on the CLIP visual encoder ViT-L/14, can either be instantiated via the :class:`~nemo.collections.multimodal.models.multimodal_llm.clip.megatron_clip_models.CLIPVisionTransformer` class or initiated through the `transformers` package from Hugging Face. Projection and Integration ^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/source/multimodal/text2img/sd.rst b/docs/source/multimodal/text2img/sd.rst index 11ccfd010058..6f5092f93f5f 100644 --- a/docs/source/multimodal/text2img/sd.rst +++ b/docs/source/multimodal/text2img/sd.rst @@ -1,7 +1,7 @@ Stable Diffusion ================ -This section gives a brief overview of the stable diffusion model in NeMo framework. +This section gives a brief overview of the stable diffusion model in NeMo Framework. Model Introduction -------------------- diff --git a/docs/source/nlp/nemo_megatron/flash_attention.rst b/docs/source/nlp/nemo_megatron/flash_attention.rst deleted file mode 100644 index b00b7a38d63a..000000000000 --- a/docs/source/nlp/nemo_megatron/flash_attention.rst +++ /dev/null @@ -1,28 +0,0 @@ -Flash attention ---------------- -Flash Attention :cite:`nlp-megatron-dao2022flashattention` is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as natural language processing. Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU's high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms. - -GPT -^^^ -To enable Flash Attention while Megatron GPT model training or fine-tuning, modify the following configuration: - -.. code:: - - model.use_flash_attention=True - -T5 -^^ -To enable Flash Attention while Megatron T5 model training, modify the following configuration: - -.. code:: - - model.encoder.use_flash_attention=True - model.decoder.use_flash_attention=True - -References ----------- - -.. bibliography:: ../nlp_all.bib - :style: plain - :labelprefix: nlp-megatron - :keyprefix: nlp-megatron- diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst index c582edbffd61..fab448f3d4f2 100644 --- a/docs/source/nlp/nemo_megatron/intro.rst +++ b/docs/source/nlp/nemo_megatron/intro.rst @@ -12,18 +12,14 @@ To learn more about using NeMo to train Large Language Models at scale, please r .. toctree:: :maxdepth: 1 - mlm_migration gpt/gpt_training batching - parallelisms prompt_learning retro/retro_model hiddens/hiddens_module peft/landing_page - flash_attention positional_embeddings mcore_customization - packed_sequence References diff --git a/docs/source/nlp/nemo_megatron/mlm_migration.rst b/docs/source/nlp/nemo_megatron/mlm_migration.rst deleted file mode 100644 index ffe9764615b5..000000000000 --- a/docs/source/nlp/nemo_megatron/mlm_migration.rst +++ /dev/null @@ -1,24 +0,0 @@ -Migrating from Megatron-LM --------------------------- - -NeMo Megatron and Megatron-LM share many underlying technology. You should be able to convert your GPT model checkpoints trained with Megatron-LM into NeMo Megatron. -Example conversion script: - -.. code-block:: bash - - /examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py \ - --checkpoint_folder \ - --checkpoint_name megatron_gpt--val_loss=99.99-step={steps}-consumed_samples={consumed}.0 \ - --nemo_file_path \ - --model_type \ - --tensor_model_parallel_size \ - --pipeline_model_parallel_size \ - --gpus_per_node - - - -To resume the training from converted MegatronLM checkpoint, make sure to set the -`trainer.max_steps=round(lr-warmup-fraction * lr-decay-iters + lr-decay-iters)` -where `lr-warmup-fraction` and `lr-decay-iters` are arguments from MegatronLM training -so the learning rate scheduler will follow the same curve. - diff --git a/docs/source/nlp/nemo_megatron/positional_embeddings.rst b/docs/source/nlp/nemo_megatron/positional_embeddings.rst index b8dea5280c28..332ce304049d 100644 --- a/docs/source/nlp/nemo_megatron/positional_embeddings.rst +++ b/docs/source/nlp/nemo_megatron/positional_embeddings.rst @@ -18,26 +18,26 @@ GPT - .. code:: model.position_embedding_type='learned_absolute' - - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. + - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. * - **rope** - .. code:: - + model.position_embedding_type='rope' model.rotary_percentage=1.0 - - Rotary Position Embedding (RoPE) :cite:`nlp-megatron-su2022roformer` incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings. - + - Rotary Position Embedding (RoPE) :cite:`nlp-megatron-su2022roformer` incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations. It achieves this by leveraging the geometric properties of vectors and complex numbers and applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings. + * - **alibi** - .. code:: - + model.position_embedding_type='alibi' - - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. + - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. * - **kerple** - .. code:: model.position_embedding_type='kerple' - - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. + - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using Conditionally Positive Definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. * - **xpos** - .. code:: @@ -64,43 +64,43 @@ T5 * - **learned_absolute** - .. code:: - + model.encoder.position_embedding_type='learned_absolute' model.decoder.position_embedding_type='learned_absolute' - - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. + - Absolute Position Encodings :cite:`nlp-megatron-vaswani2023attention` are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. * - **relative** - .. code:: - + model.encoder.position_embedding_type='relative' model.decoder.position_embedding_type='relative' - Relative Position Representations :cite:`nlp-megatron-shaw2018selfattention` * - **alibi** - .. code:: - + model.encoder.position_embedding_type='alibi' model.decoder.position_embedding_type='alibi' - - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. + - Attention with Linear Biases (ALiBi) :cite:`nlp-megatron-press2022train` modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. * - **kerple** - .. code:: - + model.encoder.position_embedding_type='kerple' model.decoder.position_embedding_type='kerple' - - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. + - Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) :cite:`nlp-megatron-chi2022kerple` generalizes relative positional embeddings (RPE) by kernelizing positional differences using Conditionally Positive Definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. Positional interpolation ------------------------ Position Interpolation (PI) :cite:`nlp-megatron-chen2023extending` is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation. -Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length :code:`seq_len_interpolation_factor` to enable it. +Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length :code:`seq_len_interpolation_factor` to enable it. .. code:: - + model.position_embedding_type='rope' model.rotary_percentage=1.0 - model.seq_len_interpolation_factor: 2 + model.seq_len_interpolation_factor: 2 References ---------- diff --git a/docs/source/starthere/best-practices.rst b/docs/source/starthere/best-practices.rst index 5e2f5db23cfb..ec0fea1985cc 100644 --- a/docs/source/starthere/best-practices.rst +++ b/docs/source/starthere/best-practices.rst @@ -1,299 +1,72 @@ .. _best-practices: -Best Practices -============== - -The NVIDIA NeMo Toolkit is available on GitHub as `open source `_ as well as -a `Docker container on NGC `_. It's assumed the user has -already installed NeMo by following the :ref:`quick_start_guide` instructions. - -The conversational AI pipeline consists of three major stages: - -- Automatic Speech Recognition (ASR) -- Natural Language Processing (NLP) or Natural Language Understanding (NLU) -- Text-to-Speech (TTS) Synthesis - -As you talk to a computer, the ASR phase converts the audio signal into text, the NLP stage interprets the question -and generates a smart response, and finally the TTS phase converts the text into speech signals to generate audio for -the user. The toolkit enables development and training of deep learning models involved in conversational AI and easily -chain them together. - Why NeMo? ---------- - -Deep learning model development for conversational AI is complex. It involves defining, building, and training several -models in specific domains; experimenting several times to get high accuracy, fine tuning on multiple tasks and domain -specific data, ensuring training performance and making sure the models are ready for deployment to inference applications. -Neural modules are logical blocks of AI applications which take some typed inputs and produce certain typed outputs. By -separating a model into its essential components in a building block manner, NeMo helps researchers develop state-of-the-art -accuracy models for domain specific data faster and easier. - -Collections of modules for core tasks as well as specific to speech recognition, natural language, speech synthesis help -develop modular, flexible, and reusable pipelines. - -A neural module’s inputs/outputs have a neural type, that describes the semantics, the axis order and meaning, and the dimensions -of the input/output tensors. This typing allows neural modules to be safely chained together to build models for applications. - -NeMo can be used to train new models or perform transfer learning on existing pre-trained models. Pre-trained weights per module -(such as encoder, decoder) help accelerate model training for domain specific data. - -ASR, NLP and TTS pre-trained models are trained on multiple datasets (including some languages such as Mandarin) and optimized -for high accuracy. They can be used for transfer learning as well. - -NeMo supports developing models that work with Mandarin Chinese data. Tutorials help users train or fine tune models for -conversational AI with the Mandarin Chinese language. The export method provided in NeMo makes it easy to transform a trained -model into inference ready format for deployment. - -A key area of development in the toolkit is interoperability with other tools used by speech researchers. Data layer for Kaldi -compatibility is one such example. +========= -NeMo, PyTorch Lightning, And Hydra ----------------------------------- +Developing deep learning models for Gen AI is a complex process, encompassing the design, construction, and training of models across specific domains. Achieving high accuracy requires extensive experimentation, fine-tuning for diverse tasks and domain-specific datasets, ensuring optimal training performance, and preparing models for deployment. -Conversational AI architectures are typically very large and require a lot of data and compute for training. NeMo uses -`Pytorch Lightning `_ for easy and performant multi-GPU/multi-node -mixed precision training. +NeMo simplifies this intricate development landscape through its modular approach. It introduces neural modules—logical blocks of AI applications with typed inputs and outputs—facilitating the seamless construction of models by chaining these blocks based on neural types. This methodology accelerates development, improves model accuracy on domain-specific data, and promotes modularity, flexibility, and reusability within AI workflows. -Pytorch Lightning is a high-performance PyTorch wrapper that organizes PyTorch code, scales model training, and reduces -boilerplate. PyTorch Lightning has two main components, the ``LightningModule`` and the Trainer. The ``LightningModule`` is -used to organize PyTorch code so that deep learning experiments can be easily understood and reproduced. The Pytorch Lightning -Trainer is then able to take the ``LightningModule`` and automate everything needed for deep learning training. +Further enhancing its utility, NeMo provides collections of modules designed for core tasks in speech recognition, natural language processing, and speech synthesis. It supports the training of new models or fine-tuning of existing pre-trained modules, leveraging pre-trained weights to expedite the training process. -NeMo models are LightningModules that come equipped with all supporting infrastructure for training and reproducibility. This -includes the deep learning model architecture, data preprocessing, optimizer, check-pointing and experiment logging. NeMo -models, like LightningModules, are also PyTorch modules and are fully compatible with the broader PyTorch ecosystem. Any NeMo -model can be taken and plugged into any PyTorch workflow. +The framework encompasses models trained and optimized for multiple languages, including Mandarin, and offers extensive tutorials for conversational AI development across these languages. NeMo's emphasis on interoperability with other research tools broadens its applicability and ease of use. -Configuring conversational AI applications is difficult due to the need to bring together many different Python libraries into -one end-to-end system. NeMo uses Hydra for configuring both NeMo models and the PyTorch Lightning Trainer. `Hydra `_ -is a flexible solution that makes it easy to configure all of these libraries from a configuration file or from the command-line. +Large Language Models & Multimodal (LLM & MM) +--------------------------------------------- -Every NeMo model has an example configuration file and a corresponding script that contains all configurations needed for training -to state-of-the-art accuracy. NeMo models have the same look and feel so that it is easy to do conversational AI research across -multiple domains. +NeMo excels in training large-scale LLM & MM, utilizing optimizations from Megatron-LM and Transformer Engine to deliver state-of-the-art performance. It includes a comprehensive feature set for large-scale training: -Using Optimized Pretrained Models With NeMo -------------------------------------------- +- Supports Multi-GPU and Multi-Node computing to enable scalability. +- Precision options including FP32/TF32, FP16, BF16, and TransformerEngine/FP8. +- Parallelism strategies: Data parallelism, Tensor parallelism, Pipeline parallelism, Interleaved Pipeline parallelism, Sequence parallelism and Context parallelism, Distributed Optimizer, and Fully Shared Data Parallel. +- Optimized utilities such as Flash Attention, Activation Recomputation, and Communication Overlap. +- Advanced checkpointing through the Distributed Checkpoint Format. -`NVIDIA GPU Cloud (NGC) `_ is a software repository that has containers and models optimized -for deep learning. NGC hosts many conversational AI models developed with NeMo that have been trained to state-of-the-art accuracy -on large datasets. NeMo models on NGC can be automatically downloaded and used for transfer learning tasks. Pretrained models -are the quickest way to get started with conversational AI on your own data. NeMo has many `example scripts `_ -and `Jupyter Notebook tutorials `_ showing step-by-step how to fine-tune pretrained NeMo -models on your own domain-specific datasets. - -For BERT based models, the model weights provided are ready for -downstream NLU tasks. For speech models, it can be helpful to start with a pretrained model and then continue pretraining on your -own domain-specific data. Jasper and QuartzNet base model pretrained weights have been known to be very efficient when used as -base models. For an easy to follow guide on transfer learning and building domain specific ASR models, you can follow this `blog `_. -All pre-trained NeMo models can be found on the `NGC NeMo Collection `_. Everything needed to quickly get started -with NeMo ASR, NLP, and TTS models is there. - -Pre-trained models are packaged as a ``.nemo`` file and contain the PyTorch checkpoint along with everything needed to use the model. -NeMo models are trained to state-of-the-art accuracy and trained on multiple datasets so that they are robust to small differences -in data. NeMo contains a large variety of models such as speaker identification and Megatron BERT and the best models in speech and -language are constantly being added as they become available. NeMo is the premier toolkit for conversational AI model building and -training. - -For a list of supported models, refer to the :ref:`tutorials` section. - -ASR Guidance ------------- - -This section is to help guide your decision making by answering our most asked ASR questions. - -**Q: Is there a way to add domain specific vocabulary in NeMo? If so, how do I do that?** -A: QuartzNet and Jasper models are character-based. So pretrained models we provide for these two output lowercase English -letters and ‘. Users can re-retrain them on vocabulary with upper case letters and punctuation symbols. - -**Q: When training, there are “Reference” lines and “Decoded” lines that are printed out. It seems like the reference line should -be the “truth” line and the decoded line should be what the ASR is transcribing. Why do I see that even the reference lines do not -appear to be correct?** -A: Because our pre-trained models can only output lowercase letters and apostrophe, everything else is dropped. So the model will -transcribe 10 as ten. The best way forward is to prepare the training data first by transforming everything to lowercase and convert -the numbers from digit representation to word representation using a simple library such as `inflect `_. Then, add the uppercase letters -and punctuation back using the NLP punctuation model. Here is an example of how this is incorporated: `NeMo voice swap demo `_. - -**Q: What languages are supported in NeMo currently?** -A: Along with English, we provide pre-trained models for Zh, Es, Fr, De, Ru, It, Ca and Pl languages. -For more information, see `NeMo Speech Models `_. +Speech AI +-------- Data Augmentation ------------------ - -Data augmentation in ASR is invaluable. It comes at the cost of increased training time if samples are augmented during training -time. To save training time, it is recommended to pre-process the dataset offline for a one time preprocessing cost and then train -the dataset on this augmented training set. +~~~~~~~~~~~~~~~~~ -For example, processing a single sample involves: - -- Speed perturbation -- Time stretch perturbation (sample level) -- Noise perturbation -- Impulse perturbation -- Time stretch augmentation (batch level, neural module) - -A simple tutorial guides users on how to use these utilities provided in `GitHub: NeMo `_. +Augmenting ASR data is essential but can be time-consuming during training. NeMo advocates for offline dataset preprocessing to conserve training time, illustrated in a tutorial covering speed perturbation and noise augmentation techniques. Speech Data Explorer --------------------- - -Speech data explorer is a `Dash-based tool `_ for interactive exploration of ASR/TTS datasets. +~~~~~~~~~~~~~~~~~~~~ -Speech data explorer collects: - -- dataset statistics (alphabet, vocabulary, and duration-based histograms) -- navigation across datasets (sorting and filtering) -- inspections of individual utterances (waveform, spectrogram, and audio player) -- errors analysis (word error rate, character error rate, word match rate, mean word accuracy, and diff) - -In order to use the tool, it needs to be installed separately. Perform the steps `here `_ to install speech data explorer. +A Dash-based tool for interactive exploration of ASR/TTS datasets, providing insights into dataset statistics, utterance inspections, and error analysis. Installation instructions for this tool are available in NeMo’s GitHub repository. Using Kaldi Formatted Data --------------------------- - -The `Kaldi Speech Recognition Toolkit `_ project began in 2009 at `Johns Hopkins University `. It is a toolkit written in C++. If -researchers have used Kaldi and have datasets that are formatted to be used with the toolkit; they can use NeMo to develop models -based on that data. - -To load Kaldi-formatted data, you can simply use ``KaldiFeatureDataLayer`` instead of ``AudioToTextDataLayer``. The ``KaldiFeatureDataLayer`` -takes in the argument ``kaldi_dir`` instead of a ``manifest_filepath``. The ``manifest_filepath`` argument should be set to the directory -that contains the files ``feats.scp`` and ``text``. - -Using Speech Command Recognition Task For ASR Models ----------------------------------------------------- - -Speech Command Recognition is the task of classifying an input audio pattern into a set of discrete classes. It is a subset of ASR, -sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain ``action`` classes. - -Upon detection of these commands, a specific action can be taken. An example Jupyter notebook provided in NeMo shows how to train a -QuartzNet model with a modified decoder head trained on a speech commands dataset. - -.. note:: It is preferred that you use absolute paths to ``data_dir`` when preprocessing the dataset. - -NLP Fine-Tuning BERT --------------------- - -BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which -obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE benchmark and -SQuAD Question & Answering dataset. - -BERT model checkpoints (`BERT-large-uncased `_ and `BERT-base-uncased `_) are provided can be used for either fine tuning BERT on your custom -dataset, or fine tuning downstream tasks, including GLUE benchmark tasks, Question & Answering tasks, Joint Intent & Slot detection, -Punctuation and Capitalization, Named Entity Recognition, and Speech Recognition post processing model to correct mistakes. - -.. note:: Almost all NLP examples also support RoBERTa and ALBERT models for downstream fine-tuning tasks (see the list of all supported models by calling ``nemo.collections.nlp.modules.common.lm_utils.get_pretrained_lm_models_list()``). The user needs to specify the name of the model desired while running the example scripts. +~~~~~~~~~~~~~~~~~~~~~~~~~~ -BioMegatron Medical BERT ------------------------- +NeMo supports Kaldi-formatted datasets, enabling the development of models with existing Kaldi data by substituting the AudioToTextDataLayer with the KaldiFeatureDataLayer. -BioMegatron is a large language model (Megatron-LM) trained on larger domain text corpus (PubMed abstract + full-text-commercial). -It achieves state-of-the-art results for certain tasks such as Relationship Extraction, Named Entity Recognition and Question & -Answering. Follow these tutorials to learn how to train and fine tune BioMegatron; pretrained models are provided on NGC: +Speech Command Recognition +~~~~~~~~~~~~~~~~~~~~~~~~~~ -- `Relation Extraction BioMegatron `_ -- `Token Classification BioMegatron `_ +Specialized training for speech command recognition is covered in a dedicated NeMo Jupyter notebook, guiding users through the process of training a QuartzNet model on a speech commands dataset. -Efficient Training With NeMo ----------------------------- +General Optimizations +--------------------- -Using Mixed Precision -^^^^^^^^^^^^^^^^^^^^^ +Mixed Precision Training +~~~~~~~~~~~~~~~~~~~~~~~~ -Mixed precision accelerates training speed while protecting against noticeable loss. Tensor Cores is a specific hardware unit that -comes starting with the Volta and Turing architectures to accelerate large matrix to matrix multiply-add operations by operating them -on half precision inputs and returning the result in full precision. - -Neural networks which usually use massive matrix multiplications can be significantly sped up with mixed precision and Tensor Cores. -However, some neural network layers are numerically more sensitive than others. Apex AMP is an NVIDIA library that maximizes the -benefit of mixed precision and Tensor Cores usage for a given network. +Utilizing NVIDIA’s Apex AMP, mixed precision training enhances training speeds with minimal precision loss, especially on hardware equipped with Tensor Cores. Multi-GPU Training -^^^^^^^^^^^^^^^^^^ - -This section is to help guide your decision making by answering our most asked multi-GPU training questions. - -**Q: Why is multi-GPU training preferred over other types of training?** -A: Multi-GPU training can reduce the total training time by distributing the workload onto multiple compute instances. This is -particularly important for large neural networks which would otherwise take weeks to train until convergence. Since NeMo supports -multi-GPU training, no code change is needed to move from single to multi-GPU training, only a slight change in your launch command -is required. - -**Q: What are the advantages of mixed precision training?** -A: Mixed precision accelerates training speed while protecting against noticeable loss in precision. Tensor Cores is a specific -hardware unit that comes starting with the Volta and Turing architectures to accelerate large matrix multiply-add operations by -operating on half precision inputs and returning the result in full precision in order to prevent loss in precision. Neural -networks which usually use massive matrix multiplications can be significantly sped up with mixed precision and Tensor Cores. -However, some neural network layers are numerically more sensitive than others. Apex AMP is a NVIDIA library that maximizes the -benefit of mixed precision and Tensor Core usage for a given network. - -**Q: What is the difference between multi-GPU and multi-node training?** -A: Multi-node is an abstraction of multi-GPU training, which requires a distributed compute cluster, where each node can have multiple -GPUs. Multi-node training is needed to scale training beyond a single node to large amounts of GPUs. - -From the framework perspective, nothing changes from moving to multi-node training. However, a master address and port needs to be set -up for inter-node communication. Multi-GPU training will then be launched on each node with passed information. You might also consider -the underlying inter-node network topology and type to achieve full performance, such as HPC-style hardware such as NVLink, InfiniBand -networking, or Ethernet. - - -Recommendations For Optimization And FAQs ------------------------------------------ - -This section is to help guide your decision making by answering our most asked NeMo questions. - -**Q: Are there areas where performance can be increased?** -A: You should try using mixed precision for improved performance. Note that typically when using mixed precision, memory consumption -is decreased and larger batch sizes could be used to further improve the performance. - -When fine-tuning ASR models on your data, it is almost always possible to take advantage of NeMo's pre-trained modules. Even if you -have a different target vocabulary, or even a different language; you can still try starting with pre-trained weights from Jasper or -QuartzNet ``encoder`` and only adjust the ``decoder`` for your needs. - -**Q: What is the recommended sampling rate for ASR?** -A: The released models are based on 16 KHz audio, therefore, ensure you use models with 16 KHz audio. Reduced performance should be -expected for any audio that is up-sampled from a sampling frequency less than 16 KHz data. +~~~~~~~~~~~~~~~~~~ -**Q: How do we use this toolkit for audio with different types of compression and frequency than the training domain for ASR?** -A: You have to match the compression and frequency. +NeMo enables multi-GPU training, substantially reducing training durations for large models. This section clarifies the advantages of mixed precision and the distinctions between multi-GPU and multi-node training. -**Q: How do you replace the 6-gram out of the ASR model with a custom language model? What is the language format supported in NeMo?** -A: NeMo’s Beam Search decoder with Levenberg-Marquardt (LM) neural module supports the KenLM language model. +NeMo, PyTorch Lightning, and Hydra +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- You should retrain the KenLM language model on your own dataset. Refer to `KenLM’s documentation `_. -- If you want to use a different language model, other than KenLM, you will need to implement a corresponding decoder module. -- Transformer-XL example is present in OS2S. It would need to be updated to work with NeMo. `Here is the code `_. +Integrating PyTorch Lightning for training efficiency and Hydra for configuration management, NeMo streamlines conversational AI research by organizing PyTorch code and automating training workflows. -**Q: How do I use text-to-speech (TTS) synthesis?** -A: +Optimized Pretrained Models +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- Obtain speech data ideally at 22050 Hz or alternatively at a higher sample rate and then down sample to 22050 Hz. - - If less than 22050 Hz and at least 16000 Hz: - - Retrain WaveGlow on your own dataset. - - Tweak the spectrogram generation parameters, namely the ``window_size`` and the ``window_stride`` for their fourier transforms. - - For below 16000 Hz, look into obtaining new data. -- In terms of bitrate/quantization, the general advice is the higher the better. We have not experimented enough to state how much - this impacts quality. -- For the amount of data, again the more the better, and the more diverse in terms of phonemes the better. Aim for around 20 hours - of speech after filtering for silences and non-speech audio. -- Most open speech datasets are in ~10 second format so training spectrogram generators on audio on the order of 10s - 20s per sample is known - to work. Additionally, the longer the speech samples, the more difficult it will be to train them. -- Audio files should be clean. There should be little background noise or music. Data recorded from a studio mic is likely to be easier - to train compared to data captured using a phone. -- To ensure pronunciation of words are accurate; the technical challenge is related to the dataset, text to phonetic spelling is - required, use phonetic alphabet (notation) that has the name correctly pronounced. -- Here are some example parameters you can use to train spectrogram generators: - - use single speaker dataset - - Use AMP level O0 - - Trim long silences in the beginning and end - - ``optimizer="adam"`` - - ``beta1 = 0.9`` - - ``beta2 = 0.999`` - - ``lr=0.001 (constant)`` - - ``amp_opt_level="O0"`` - - ``weight_decay=1e-6`` - - ``batch_size=48 (per GPU)`` - - ``trim_silence=True`` +Through NVIDIA GPU Cloud (NGC), NeMo offers a collection of optimized, pre-trained models for various conversational AI applications, facilitating easy integration into research projects and providing a head start in conversational AI development. Resources --------- diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst index eaeab3c212d0..63fdcfb0406e 100644 --- a/docs/source/starthere/intro.rst +++ b/docs/source/starthere/intro.rst @@ -8,42 +8,125 @@ Introduction .. _dummy_header: -NVIDIA NeMo Framework is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. -To learn more about using NeMo in generative AI workflows, please refer to the `NeMo Framework User Guide `_. +NVIDIA NeMo Framework is an end-to-end, cloud-native framework for building, customizing, and deploying generative AI models anywhere. It allows for the creation of state-of-the-art models across a wide array of domains, including speech, language, and vision. For detailed information on utilizing NeMo in your generative AI workflows, refer to the `NeMo Framework User Guide `_. -`NVIDIA NeMo Framework `_ has separate collections for Large Language Models (LLMs), -Multimodal (MM), Computer Vision (CV), Automatic Speech Recognition (ASR), -and Text-to-Speech (TTS) models. Each collection consists of -prebuilt modules that include everything needed to train on your data. -Every module can easily be customized, extended, and composed to create new generative AI -model architectures. +Training generative AI architectures typically requires significant data and computing resources. NeMo utilizes `PyTorch Lightning `_ for efficient and performant multi-GPU/multi-node mixed-precision training. +NeMo is built on top of NVIDIA's powerful Megatron-LM and Transformer Engine for its Large Language Models (LLMs) and Multimodal Models (MMs), leveraging cutting-edge advancements in model training and optimization. For Speech AI applications, Automatic Speech Recognition (ASR) and Text-to-Speech (TTS), NeMo is developed with native PyTorch and PyTorch Lightning, ensuring seamless integration and ease of use. Future updates are planned to align Speech AI models with the Megatron framework, enhancing training efficiency and model performance. -Generative AI architectures are typically large and require a lot of data and compute -for training. NeMo uses `PyTorch Lightning `_ for easy and performant multi-GPU/multi-node -mixed-precision training. -`Pre-trained NeMo models `_ are available -in 14+ languages. +`NVIDIA NeMo Framework `_ features separate collections for Large Language Models (LLMs), Multimodal Models (MMs), Computer Vision (CV), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS) models. Each collection comprises prebuilt modules that include everything needed to train on your data. These modules can be easily customized, extended, and composed to create new generative AI model architectures. + +(TODO: Still valid? LLM is not included here.) `Pre-trained NeMo models `_ are available in 14+ languages. Prerequisites ------------- -Before you begin using NeMo, it's assumed you meet the following prerequisites. +Before using NeMo, make sure you meet the following prerequisites: + +#. Python version 3.10 or above. + +#. Pytorch version 1.13.1 or 2.0+. + +#. Access to an NVIDIA GPU for model training. + +Installation +------------ + +**Using NVIDIA PyTorch Container** + +To leverage all optimizations for LLM training, including 3D Model Parallel, fused kernels, FP8, and more, we recommend using the NVIDIA PyTorch container. + +.. code-block:: bash + + docker pull nvcr.io/nvidia/pytorch:24.01-py3 + docker run --gpus all -it nvcr.io/nvidia/pytorch:24.01-py3 + +Within the container, you can install NeMo and its dependencies as follows: + +NeMo Installation + +.. code-block:: bash + + apt-get update && apt-get install -y libsndfile1 ffmpeg + pip install Cython + pip install nemo_toolkit['all'] + +Transformer Engine Installation + +This step involves cloning the Transformer Engine repository, checking out a specific commit, and installing it with specific flags. -#. You have Python version 3.10 or above. +.. code-block:: bash + + git clone https://github.com/NVIDIA/TransformerEngine.git && \ + cd TransformerEngine && \ + git fetch origin 8c9abbb80dba196f086b8b602a7cf1bce0040a6a && \ + git checkout FETCH_HEAD && \ + git submodule init && git submodule update && \ + NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install . + +Apex Installation + +This step includes a bug fix for Apex in the PyTorch 23.11 container. + +.. code-block:: bash + + git clone https://github.com/NVIDIA/apex.git && \ + cd apex && \ + git checkout c07a4cf67102b9cd3f97d1ba36690f985bae4227 && \ + cp -R apex /usr/local/lib/python3.10/dist-packages + +PyTorch Lightning Installation + +This step involves installing a bug-fixed version of PyTorch Lightning from a specific branch. + +.. code-block:: bash -#. You have Pytorch version 1.13.1 or 2.0+. + git clone -b bug_fix https://github.com/athitten/pytorch-lightning.git && \ + cd pytorch-lightning && \ + PACKAGE_NAME=pytorch pip install -e . -#. You have access to an NVIDIA GPU, if you intend to do model training. +Megatron Core Installation -.. _quick_start_guide: +This section details the steps to clone and install the Megatron Core. + +.. code-block:: bash + + git clone https://github.com/NVIDIA/Megatron-LM.git && \ + cd Megatron-LM && \ + git checkout a5415fcfacef2a37416259bd38b7c4b673583675 && \ + pip install . + +AMMO Installation + +This final step involves installing the AMMO package. + +.. code-block:: bash + + pip install nvidia-ammo~=0.7.0 --extra-index-url https://pypi.nvidia.com --no-cache-dir + + +.. code-block:: bash + + apt-get update && apt-get install -y libsndfile1 ffmpeg + pip install Cython + pip install nemo_toolkit['all'] + +**Conda Installation** + +If you do not use the NVIDIA PyTorch container, we recommend installing NeMo in a clean Conda environment. + +.. code-block:: bash + + conda create --name nemo python==3.10.12 + conda activate nemo + +Refer to the PyTorch configurator for instructions on installing PyTorch. `configurator `_ Quick Start Guide ----------------- -You can try out NeMo's ASR, LLM and TTS functionality with the example below, which is based on the `Audio Translation `_ tutorial. +To explore NeMo's capabilities in LLM, ASR, and TTS, follow the example below based on the `Audio Translation `_ tutorial. Ensure NeMo is :ref:`installed ` before proceeding. -Once you have :ref:`installed NeMo `, then you can run the code below: .. code-block:: python @@ -66,7 +149,7 @@ Once you have :ref:`installed NeMo `, then you can run the code be english_text = nmt_model.translate(mandarin_text) print(english_text) - # Instantiate a spectrogram generator (which converts text -> spectrogram) + # Instantiate a spectrogram generator (which converts text -> spectrogram) # and vocoder model (which converts spectrogram -> audio waveform) spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch") vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan") @@ -80,67 +163,19 @@ Once you have :ref:`installed NeMo `, then you can run the code be import soundfile as sf sf.write("output_audio.wav", audio.to('cpu').detach().numpy()[0], 22050) -You can learn more by about specific tasks you are interested in by checking out the NeMo :doc:`tutorials <./tutorials>`, or documentation (e.g. read :doc:`here <../asr/intro>` to learn more about ASR). - -You can also learn more about NeMo in the `NeMo Primer `_ tutorial, which introduces NeMo, PyTorch Lightning, and OmegaConf, and shows how to use, modify, save, and restore NeMo models. Additionally, the `NeMo Models `__ tutorial explains the fundamentals of how NeMo models are created. These concepts are also explained in detail in the :doc:`NeMo Core <../core/core>` documentation. - - -Introductory videos -------------------- - -See the two introductory videos below for a high level overview of NeMo. - -**Developing State-Of-The-Art Conversational AI Models in Three Lines of Code** +For detailed tutorials and documentation on specific tasks or to learn more about NeMo, check out the NeMo :doc:`tutorials <./tutorials>` or dive deeper into the documentation, such as learning about ASR in :doc:`here <../asr/intro>`. -.. raw:: html - -
- -
- -.. _installation: - -Installation ------------- - -The simplest way to install NeMo is via pip, see info below. - -.. note:: Full NeMo installation instructions (with more ways to install NeMo, and how to handle optional dependencies) can be found in the `GitHub README `_. - -Conda -~~~~~ - -We recommend installing NeMo in a fresh Conda environment. - -.. code-block:: bash - - conda create --name nemo python==3.10.12 - conda activate nemo - -Install PyTorch using their `configurator `_. - -Pip -~~~ -Use this installation mode if you want the latest released version. - -.. code-block:: bash - - apt-get update && apt-get install -y libsndfile1 ffmpeg - pip install Cython - pip install nemo_toolkit['all'] - -Depending on the shell used, you may need to use ``"nemo_toolkit[all]"`` instead in the above command. - -Discussion board +Discussion Board ---------------- -For more information and questions, visit the `NVIDIA NeMo Discussion Board `_. -Contributing ------------- +For additional information and questions, visit the `NVIDIA NeMo Discussion Board `_. + +Contribute to NeMo +------------------ -We welcome community contributions! Refer to the `CONTRIBUTING.md `_ file for the process. +Community contributions are welcome! See the `CONTRIBUTING.md `_ file for how to contribute. License ------- -NeMo is released under an `Apache 2.0 license `_. \ No newline at end of file +NeMo is released under the `Apache 2.0 license `_. diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst index a61c078175f5..5ca48904ed9b 100644 --- a/docs/source/starthere/tutorials.rst +++ b/docs/source/starthere/tutorials.rst @@ -3,40 +3,74 @@ Tutorials ========= -The best way to get started with NeMo is to start with one of our tutorials. +The best way to get started with NeMo is to start with one of our tutorials. These tutorials cover various domains and provide both introductory and advanced topics. They are designed to help you understand and use the NeMo toolkit effectively. + +Running Tutorials on Colab +-------------------------- Most NeMo tutorials can be run on `Google's Colab `_. To run a tutorial: -#. Click the **Colab** link (see table below). -#. Connect to an instance with a GPU. For example, click **Runtime** > **Change runtime type** and select **GPU** for the hardware accelerator. +1. Click the **Colab** link associated with the tutorial you are interested in from the table below. +2. Once in Colab, connect to an instance with a GPU by clicking **Runtime** > **Change runtime type** and selecting **GPU** as the hardware accelerator. + +Tutorial Overview +----------------- -.. list-table:: **Tutorials** - :widths: 15 25 25 +.. list-table:: **General Tutorials** + :widths: 15 25 60 :header-rows: 1 * - Domain - Title - GitHub URL * - General - - Getting Started: Exploring Nemo Fundamentals + - Getting Started: NeMo Fundamentals - `NeMo Fundamentals `_ * - General - - Getting Started: Sample Conversational AI application + - Getting Started: Audio translator example - `Audio translator example `_ * - General - - Getting Started: Voice swap application + - Getting Started: Voice swap example - `Voice swap example `_ * - General - - Exploring NeMo Model Construction + - Getting Started: NeMo Models - `NeMo Models `_ * - General - - Exploring NeMo Adapters + - Getting Started: NeMo Adapters - `NeMo Adapters `_ * - General - - Publishing NeMo models on Hugging Face Hub + - Getting Started: NeMo Models on Hugging Face Hub - `NeMo Models on HF Hub `_ + +.. list-table:: **Multimodal Tutorials** + :widths: 20 25 55 + :header-rows: 1 + + * - Domain + - Title + - GitHub URL + * - Multimodal + - Preparations and Advanced Applications: Multimodal Data Preparation + - `Multimodal Data Preparation `_ + * - Multimodal + - Preparations and Advanced Applications: NeVA (LLaVA) Tutorial + - `NeVA (LLaVA) Tutorial `_ + * - Multimodal + - Preparations and Advanced Applications: Stable Diffusion Tutorial + - `Stable Diffusion Tutorial `_ + * - Multimodal + - Preparations and Advanced Applications: DreamBooth Tutorial + - `DreamBooth Tutorial `_ + +.. list-table:: **Automatic Speech Recognition (ASR) Tutorials** + :widths: 15 30 55 + :header-rows: 1 + + * - Domain + - Title + - GitHub URL * - ASR - ASR with NeMo - `ASR with NeMo `_ @@ -44,16 +78,16 @@ To run a tutorial: - ASR with Subword Tokenization - `ASR with Subword Tokenization `_ * - ASR - - Offline ASR Inference with Beam Search and External Language Model Rescoring + - Offline ASR - `Offline ASR `_ * - ASR - - Online ASR inference with Microphone (Cache-Aware Streaming) + - Online ASR Microphone Cache Aware Streaming - `Online ASR Microphone Cache Aware Streaming `_ * - ASR - - Online ASR inference with Microphone (Buffered Streaming) + - Online ASR Microphone Buffered Streaming - `Online ASR Microphone Buffered Streaming `_ * - ASR - - Fine-tuning CTC Models on New Languages + - ASR CTC Language Fine-Tuning - `ASR CTC Language Fine-Tuning `_ * - ASR - Intro to Transducers @@ -68,13 +102,13 @@ To run a tutorial: - Speech Commands - `Speech Commands `_ * - ASR - - Online and Offline Speech Commands Inference + - Online Offline Microphone Speech Commands - `Online Offline Microphone Speech Commands `_ * - ASR - - Voice Activity Detection (VAD) + - Voice Activity Detection - `Voice Activity Detection `_ * - ASR - - Online and Offline VAD Inference + - Online Offline Microphone VAD - `Online Offline Microphone VAD `_ * - ASR - Speaker Recognition and Verification @@ -92,19 +126,19 @@ To run a tutorial: - ASR for Telephony Speech - `ASR for Telephony Speech `_ * - ASR - - Streaming inference for ASR + - Streaming inference - `Streaming inference `_ * - ASR - - Buffered Transducer inference for ASR + - Buffered Transducer inference - `Buffered Transducer inference `_ * - ASR - - Buffered Transducer inference with LCS Merge Algorithm + - Buffered Transducer inference with LCS Merge - `Buffered Transducer inference with LCS Merge `_ * - ASR - Offline ASR with VAD for CTC models - `Offline ASR with VAD for CTC models `_ * - ASR - - Self-supervised pre-training for ASR + - Self-supervised Pre-training for ASR - `Self-supervised Pre-training for ASR `_ * - ASR - Multi-lingual ASR @@ -118,105 +152,75 @@ To run a tutorial: * - ASR - Confidence-based Ensembles - `Confidence-based Ensembles `_ - * - NLP - - Using Pretrained Language Models for Downstream Tasks - - `Pretrained Language Models for Downstream Tasks `_ - * - NLP - - Exploring NeMo NLP Tokenizers - - `NLP Tokenizers `_ - * - NLP - - Text Classification (Sentiment Analysis) with BERT - - `Text Classification (Sentiment Analysis) `_ - * - NLP - - Question Answering - - `Question Answering `_ - * - NLP - - Token Classification (Named Entity Recognition) - - `Token Classification: Named Entity Recognition `_ - * - NLP - - Joint Intent Classification and Slot Filling - - `Joint Intent and Slot Classification `_ - * - NLP - - GLUE Benchmark - - `GLUE Benchmark `_ - * - NLP - - Punctuation and Capitalization - - `Punctuation and Capitalization `_ - * - NLP - - Spellchecking ASR Customization - SpellMapper - - `Spellchecking ASR Customization - SpellMapper `_ - * - NLP - - Entity Linking - - `Entity Linking `_ - * - NLP - - Named Entity Recognition - BioMegatron - - `Named Entity Recognition - BioMegatron `_ - * - NLP - - Relation Extraction - BioMegatron - - `Relation Extraction - BioMegatron `_ - * - NLP - - P-Tuning/Prompt-Tuning - - `P-Tuning/Prompt-Tuning `_ - * - NLP - - Synthetic Tabular Data Generation - - `Synthetic Tabular Data Generation `_ - * - Multimodal - - Multimodal Data Preparation - - `Multimodal Data Preparation `_ - * - Multimodal - - NeVA (LLaVA) Tutorial - - `NeVA (LLaVA) Tutorial `_ - * - Multimodal - - Stable Diffusion Tutorial - - `Stable Diffusion Tutorial `_ - * - Multimodal - - DreamBooth Tutorial - - `DreamBooth Tutorial `_ + +.. list-table:: **Text-to-Speech (TTS) Tutorials** + :widths: 15 35 50 + :header-rows: 1 + + * - Domain + - Title + - GitHub URL * - TTS - - NeMo TTS Primer + - Basic and Advanced: NeMo TTS Primer - `NeMo TTS Primer `_ * - TTS - - TTS Speech/Text Aligner Inference + - Basic and Advanced: TTS Speech/Text Aligner Inference - `TTS Speech/Text Aligner Inference `_ * - TTS - - FastPitch and MixerTTS Model Training + - Basic and Advanced: FastPitch and MixerTTS Model Training - `FastPitch and MixerTTS Model Training `_ * - TTS - - FastPitch Finetuning + - Basic and Advanced: FastPitch Finetuning - `FastPitch Finetuning `_ * - TTS - - FastPitch and HiFiGAN Model Training for German + - Basic and Advanced: FastPitch and HiFiGAN Model Training for German - `FastPitch and HiFiGAN Model Training for German `_ * - TTS - - Tacotron2 Model Training + - Basic and Advanced: Tacotron2 Model Training - `Tacotron2 Model Training `_ * - TTS - - FastPitch Duration and Pitch Control + - Basic and Advanced: FastPitch Duration and Pitch Control - `FastPitch Duration and Pitch Control `_ * - TTS - - FastPitch Speaker Interpolation + - Basic and Advanced: FastPitch Speaker Interpolation - `FastPitch Speaker Interpolation `_ * - TTS - - Inference and Model Selection + - Basic and Advanced: TTS Inference and Model Selection - `TTS Inference and Model Selection `_ * - TTS - - Pronunciation_customization - - `TTS Pronunciation_customization `_ - * - Tools - - NeMo Forced Aligner + - Basic and Advanced: TTS Pronunciation Customization + - `TTS Pronunciation Customization `_ + +.. list-table:: **Tools and Utilities** + :widths: 15 25 60 + :header-rows: 1 + + * - Domain + - Title + - GitHub URL + * - Utility Tools + - Utility Tools for Speech and Text: NeMo Forced Aligner - `NeMo Forced Aligner `_ - * - Tools - - Speech Data Explorer - - `Speech Data Explorer `_ - * - Tools - - CTC Segmentation + * - Utility Tools + - Utility Tools for Speech and Text: Speech Data Explorer + - `Speech Data Explorer `_ + * - Utility Tools + - Utility Tools for Speech and Text: CTC Segmentation - `CTC Segmentation `_ - * - Text Processing (TN/ITN) - - Text Normalization and Inverse Normalization for ASR and TTS + +.. list-table:: **Text Processing (TN/ITN) Tutorials** + :widths: 25 35 60 + :header-rows: 1 + + * - Domain + - Title + - GitHub URL + * - Text Processing + - Text Normalization Techniques: Text Normalization - `Text Normalization `_ - * - Text Processing (TN/ITN) - - Inverse Text Normalization for ASR - Thutmose Tagger + * - Text Processing + - Text Normalization Techniques: Inverse Text Normalization with Thutmose Tagger - `Inverse Text Normalization with Thutmose Tagger `_ - * - Text Processing (TN/ITN) - - Constructing Normalization Grammars with WFSTs + * - Text Processing + - Text Normalization Techniques: WFST Tutorial - `WFST Tutorial `_ diff --git a/docs/source/tools/intro.rst b/docs/source/tools/intro.rst index 9e1b19f83b9e..5a08d05f3405 100644 --- a/docs/source/tools/intro.rst +++ b/docs/source/tools/intro.rst @@ -1,5 +1,5 @@ -Tools -===== +Speech AI Tools +=============== NeMo provides a set of tools useful for developing Automatic Speech Recognitions (ASR) and Text-to-Speech (TTS) synthesis models: \ `https://github.com/NVIDIA/NeMo/tree/stable/tools `__ . diff --git a/tutorials/asr/ASR_Context_Biasing.ipynb b/tutorials/asr/ASR_Context_Biasing.ipynb index f001ce3d65a2..bca4585e45cb 100644 --- a/tutorials/asr/ASR_Context_Biasing.ipynb +++ b/tutorials/asr/ASR_Context_Biasing.ipynb @@ -13,7 +13,7 @@ "id": "1156d1d1", "metadata": {}, "source": [ - "This tutorial aims to show how to improve the recognition accuracy of specific words in NeMo framework\n", + "This tutorial aims to show how to improve the recognition accuracy of specific words in NeMo Framework\n", "for CTC and Trasducer (RNN-T) ASR models by using the fast context-biasing method with CTC-based Word Spotter.\n", "\n", "## Tutorial content:\n", diff --git a/tutorials/asr/ASR_with_NeMo.ipynb b/tutorials/asr/ASR_with_NeMo.ipynb index f88dc7bbd8c1..bd95c7194655 100644 --- a/tutorials/asr/ASR_with_NeMo.ipynb +++ b/tutorials/asr/ASR_with_NeMo.ipynb @@ -75,7 +75,7 @@ "source": [ "# Introduction to End-To-End Automatic Speech Recognition\n", "\n", - "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", + "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo Framework](https://github.com/NVIDIA/NeMo).\n", "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", "\n", "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." diff --git a/tutorials/asr/README.md b/tutorials/asr/README.md index 138e13f58a08..565e9eafd9d3 100644 --- a/tutorials/asr/README.md +++ b/tutorials/asr/README.md @@ -34,7 +34,7 @@ In this repository, you will find several tutorials discussing what is Automatic 13) `ASR_Example_CommonVoice_Finetuning`: Learn how to fine-tune an ASR model using CommonVoice to a new alphabet, Esperanto. We walk through the data processing steps of MCV data using HuggingFace Datasets, preparation of the tokenizer, model and then setup fine-tuning. -14) `ASR_Context_Biasing`: This tutorial aims to show how to improve the recognition accuracy of specific words in NeMo framework for CTC and Trasducer (RNN-T) ASR models by using the fast context-biasing method with CTC-based Word Spotter. +14) `ASR_Context_Biasing`: This tutorial aims to show how to improve the recognition accuracy of specific words in NeMo Framework for CTC and Trasducer (RNN-T) ASR models by using the fast context-biasing method with CTC-based Word Spotter. ---------------- diff --git a/tutorials/multimodal/NeVA Tutorial.ipynb b/tutorials/multimodal/NeVA Tutorial.ipynb index 7a9a1a3a7b4b..20b5e5a1c82c 100644 --- a/tutorials/multimodal/NeVA Tutorial.ipynb +++ b/tutorials/multimodal/NeVA Tutorial.ipynb @@ -18,7 +18,7 @@ "\n", "## Introduction\n", "\n", - "This notebook illustrates how to train and perform inference using NeVA with the NeMo Toolkit. NeVA originates from [LLaVA](https://github.com/haotian-liu/LLaVA) (Large Language and Vision Assistant) and is a powerful multimodal image-text instruction tuned model optimized within the NeMo framework. \n", + "This notebook illustrates how to train and perform inference using NeVA with the NeMo Toolkit. NeVA originates from [LLaVA](https://github.com/haotian-liu/LLaVA) (Large Language and Vision Assistant) and is a powerful multimodal image-text instruction tuned model optimized within the NeMo Framework. \n", "\n", "\n", "This tutorial will guide you through the following topics:\n", @@ -270,7 +270,7 @@ "source": [ "### Running Inference\n", "\n", - "NeVA inference via the NeMo framework can be quickly spun up via the NeMo Launcher and a few modifications to use the default NeVA inference config file.\n", + "NeVA inference via the NeMo Framework can be quickly spun up via the NeMo Launcher and a few modifications to use the default NeVA inference config file.\n", "\n", "Inference can be run with a similar command leveraging the provided inference script `neva_evaluation.py` within the container.\n", "\n", diff --git a/tutorials/multimodal/Stable Diffusion Tutorial.ipynb b/tutorials/multimodal/Stable Diffusion Tutorial.ipynb index 48da90dcb23d..8df695a994ef 100644 --- a/tutorials/multimodal/Stable Diffusion Tutorial.ipynb +++ b/tutorials/multimodal/Stable Diffusion Tutorial.ipynb @@ -86,7 +86,7 @@ "\n", "**Note**: if you want to customize the saved location, make sure it is also reflected in your training config.\n", "#### B. Prepare Text Encoder\n", - "For the text encoder used in Stable Diffusion, you can either use [HuggingFace CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14) or use NeMo's CLIP-ViT. NeMo Stable Diffusion also supports native CLIP ViT model trained in NeMo framework.\n", + "For the text encoder used in Stable Diffusion, you can either use [HuggingFace CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14) or use NeMo's CLIP-ViT. NeMo Stable Diffusion also supports native CLIP ViT model trained in NeMo Framework.\n", "\n", "Make sure the following settings are used in `cond_stage_config`:\n", "\n", diff --git a/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb b/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb index 323bfa1c49b8..df5ac458dc9c 100644 --- a/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb +++ b/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb @@ -20,7 +20,7 @@ "source": [ "# Data Preprocessing & Cleaning for NMT\n", "\n", - "This notebook contains a tutorial of data processing and cleaning for NMT (Neural Machine Translation) to train translation models with the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", + "This notebook contains a tutorial of data processing and cleaning for NMT (Neural Machine Translation) to train translation models with the [NeMo Framework](https://github.com/NVIDIA/NeMo).\n", "\n", "A pre-requisite to train supervised neural machine translation systems is the availability of *parallel corpora* of reasonable quality.\n", "\n",