Update readme (#8440)

* update Signed-off-by: eharper <eharper@nvidia.com> * udpate Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * landing pages added * landing page added for vision * landing pages updated * some minor changes to the main readme * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * update Signed-off-by: eharper <eharper@nvidia.com> * typo fixed * update Signed-off-by: eharper <eharper@nvidia.com> --------- Signed-off-by: eharper <eharper@nvidia.com> Co-authored-by: ntajbakhsh <ntajbakhsh@nvidia.com>
NVIDIA · Feb 17, 2024 · df5a395 · df5a395
1 parent 8222634
commit df5a395
Show file tree

Hide file tree

Showing 12 changed files with 196 additions and 147 deletions.
diff --git a/README.rst b/README.rst
@@ -35,7 +35,7 @@
 
 .. _main-readme:
 
-**NVIDIA NeMo**
+**NVIDIA NeMo Framework**
 ===============
 
 Latest News
@@ -57,92 +57,66 @@ such as FSDP, Mixture-of-Experts, and RLHF with TensorRT-LLM to provide speedups
 Introduction
 ------------
 
-NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR),
-text-to-speech synthesis (TTS), large language models (LLMs), and
-natural language processing (NLP).
-The primary objective of NeMo is to help researchers from industry and academia to reuse prior work (code and pretrained models)
-and make it easier to create new `conversational AI models <https://developer.nvidia.com/conversational-ai#started>`_.
+NVIDIA NeMo Framework is a generative AI framework built for researchers and pytorch developers 
+working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR),
+and text-to-speech synthesis (TTS).
+The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia 
+to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.
+
+For technical documentation, please see the `NeMo Framework User Guide <https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/index.html>`_.
 
 All NeMo models are trained with `Lightning <https://github.com/Lightning-AI/lightning>`_ and
 training is automatically scalable to 1000s of GPUs.
-Additionally, NeMo Megatron LLM models can be trained up to 1 trillion parameters using tensor and pipeline model parallelism.
-NeMo models can be optimized for inference and deployed for production use-cases with `NVIDIA Riva <https://developer.nvidia.com/riva>`_.
+
+When applicable, NeMo models take advantage of the latest possible distributed training techniques, 
+including parallelism strategies such as 
+
+* data parallelism
+* tensor parallelism
+* pipeline model parallelism
+* fully sharded data parallelism (FSDP)
+* sequence parallelism
+* context parallelism
+* mixture-of-experts (MoE)
+
+and mixed precision training recipes with bfloat16 and FP8 training.
+
+NeMo's Transformer based LLM and Multimodal models leverage `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_ for FP8 training on NVIDIA Hopper GPUs
+and leverages `NVIDIA Megatron Core <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core>`_ for scaling transformer model training.
+
+NeMo LLMs can be aligned with state of the art methods such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF), 
+see `NVIDIA NeMo Aligner <https://github.com/NVIDIA/NeMo-Aligner>`_ for more details.
+
+NeMo LLM and Multimodal models can be deployed and optimized with `NVIDIA Inference Microservices (Early Access) <https://developer.nvidia.com/nemo-microservices-early-access>`_.
+
+NeMo ASR and TTS models can be optimized for inference and deployed for production use-cases with `NVIDIA Riva <https://developer.nvidia.com/riva>`_.
+
+For scaling NeMo LLM and Multimodal training on Slurm clusters or public clouds, please see the `NVIDIA Framework Launcher <https://github.com/NVIDIA/NeMo-Megatron-Launcher>`_.
+The NeMo Framework launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and Multimodal models and also has an `Autoconfigurator <https://github.com/NVIDIA/NeMo-Megatron-Launcher#53-using-autoconfigurator-to-find-the-optimal-configuration>`_
+which can be used to find the optimal model parallel configuration for training on a specific cluster. 
+To get started quickly with the NeMo Framework Launcher, please see the `NeMo Framework Playbooks <https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/index.html>`_
+The NeMo Framework Launcher does not currently support ASR and TTS training but will soon.
 
 Getting started with NeMo is simple.
 State of the Art pretrained NeMo models are freely available on `HuggingFace Hub <https://huggingface.co/models?library=nemo&sort=downloads&search=nvidia>`_ and
 `NVIDIA NGC <https://catalog.ngc.nvidia.com/models?query=nemo&orderBy=weightPopularDESC>`_.
-These models can be used to transcribe audio, synthesize speech, or translate text in just a few lines of code.
+These models can be used to generate text or images, transcribe audio, and synthesize speech in just a few lines of code.
 
 We have extensive `tutorials <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/tutorials.html>`_ that
-can be run on `Google Colab <https://colab.research.google.com>`_.
+can be run on `Google Colab <https://colab.research.google.com>`_ or with our `NGC NeMo Framework Container. <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`_
+and we have `playbooks <https://docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/index.html>`_ for users that want to train NeMo models with the NeMo Framework Launcher.
 
 For advanced users that want to train NeMo models from scratch or finetune existing NeMo models
 we have a full suite of `example scripts <https://github.com/NVIDIA/NeMo/tree/main/examples>`_ that support multi-GPU/multi-node training.
 
-For scaling NeMo LLM training on Slurm clusters or public clouds, please see the `NVIDIA NeMo Megatron Launcher <https://github.com/NVIDIA/NeMo-Megatron-Launcher>`_.
-The NM launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and also has an `Autoconfigurator <https://github.com/NVIDIA/NeMo-Megatron-Launcher#53-using-autoconfigurator-to-find-the-optimal-configuration>`_
-which can be used to find the optimal model parallel configuration for training on a specific cluster.
-
 Key Features
 ------------
 
-* Speech processing
-    * `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) <https://huggingface.co/spaces/smajumdar/nemo_multilingual_language_id>`_
-    * `Pretrained models <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr>`_ available in 14+ languages
-    * `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
-        * Supported ASR `models <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html>`_:
-            * Jasper, QuartzNet, CitriNet, ContextNet
-            * Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer
-            * Squeezeformer-CTC and Squeezeformer-Transducer
-            * LSTM-Transducer (RNNT) and LSTM-CTC
-        * Supports the following decoders/losses:
-            * CTC
-            * Transducer/RNNT
-            * Hybrid Transducer/CTC
-            * NeMo Original `Multi-blank Transducers <https://arxiv.org/abs/2211.03541>`_ and `Token-and-Duration Transducers (TDT) <https://arxiv.org/abs/2304.06795>`_
-        * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference>`_
-        * `Cache-aware Streaming Conformer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer>`_ with multiple lookaheads (including microphone streaming `tutorial <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb>`_).
-        * Beam Search decoding
-        * `Language Modelling for ASR (CTC and RNNT) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
-        * `Support of long audios for Conformer with memory efficient local attention <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio>`_
-    * `Speech Classification, Speech Command Recognition and Language Identification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition), AmberNet (LangID)
-    * `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
-        * ASR with VAD Inference - `Example <https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_vad>`_
-    * `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet
-    * `Speaker Diarization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html>`_
-        * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet
-        * Neural Diarizer: MSDD (Multi-scale Diarization Decoder)
-    * `Speech Intent Detection and Slot Filling <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_intent_slot/intro.html>`_: Conformer-Transformer
-* Natural Language Processing
-    * `NeMo Megatron pre-training of Large Language Models <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/nemo_megatron/intro.html>`_
-    * `Neural Machine Translation (NMT) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/machine_translation/machine_translation.html>`_
-    * `Punctuation and Capitalization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html>`_
-    * `Token classification (named entity recognition) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/token_classification.html>`_
-    * `Text classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_classification.html>`_
-    * `Joint Intent and Slot Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/joint_intent_slot.html>`_
-    * `Question answering <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/question_answering.html>`_
-    * `GLUE benchmark <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/glue_benchmark.html>`_
-    * `Information retrieval <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/information_retrieval.html>`_
-    * `Entity Linking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/entity_linking.html>`_
-    * `Dialogue State Tracking <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/dialogue.html>`_
-    * `Prompt Learning <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html>`_
-    * `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
-    * `Synthetic Tabular Data Generation <https://developer.nvidia.com/blog/generating-synthetic-data-with-transformers-a-solution-for-enterprise-data-challenges/>`_
-* Text-to-Speech Synthesis (TTS):
-    * `Documentation <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/intro.html#>`_
-    * Mel-Spectrogram generators: FastPitch, SSL FastPitch, Mixer-TTS/Mixer-TTS-X, RAD-TTS, Tacotron2
-    * Vocoders: HiFiGAN, UnivNet, WaveGlow
-    * End-to-End Models: VITS
-    * `Pre-trained Model Checkpoints in NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_tts>`_
-* `Tools <https://github.com/NVIDIA/NeMo/tree/stable/tools>`_
-    * `Text Processing (text normalization and inverse text normalization) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/text_normalization/intro.html>`_
-    * `NeMo Forced Aligner <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/nemo_forced_aligner.html>`_
-    * `CTC-Segmentation tool <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/ctc_segmentation.html>`_
-    * `Speech Data Explorer <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/speech_data_explorer.html>`_: a dash-based tool for interactive exploration of ASR/TTS datasets
-    * `Speech Data Processor <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tools/speech_data_processor.html>`_
-
-
-Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes.
+* `Large Language Models <nemo/collections/nlp/README.md>`_
+* `Multimodal <nemo/collections/multimodal/README.md>`_
+* `Automatic Speech Recognition <nemo/collections/asr/README.md>`_
+* `Text to Speech <nemo/collections/tts/README.md>`_
+* `Computer Vision <nemo/collections/vision/README.md>`_
 
 Requirements
 ------------
@@ -151,8 +125,8 @@ Requirements
 2) Pytorch 1.13.1 or above
 3) NVIDIA GPU, if you intend to do model training
 
-Documentation
--------------
+Developer Documentation
+-----------------------
 
 .. |main| image:: https://readthedocs.com/projects/nvidia-nemo/badge/?version=main
   :alt: Documentation Status
@@ -172,18 +146,6 @@ Documentation
 | Stable  | |stable|    | `Documentation of the stable (i.e. most recent release) branch. <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/>`_ |
 +---------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+
 
-Tutorials
----------
-A great way to start with NeMo is by checking `one of our tutorials <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/starthere/tutorials.html>`_.
-
-You can also get a high-level overview of NeMo by watching the talk *NVIDIA NeMo: Toolkit for Conversational AI*, presented at PyData Yerevan 2022:
-
-|pydata|
-
-.. |pydata| image:: https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg
-    :target: https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0
-    :width: 600
-    :alt: NeMo presentation at PyData@Yerevan 2022
 
 Getting help with NeMo
 ----------------------

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -1,5 +1,5 @@
-NVIDIA NeMo User Guide
-======================
+NVIDIA NeMo Framework Developer Docs
+====================================
 
 .. toctree::
    :maxdepth: 2
@@ -12,18 +12,28 @@ NVIDIA NeMo User Guide
    starthere/migration-guide
 
 .. toctree::
-   :maxdepth: 2
-   :caption: NeMo Core
-   :name: core
+   :maxdepth: 3
+   :caption: Multimodal (MM)
+   :name: Multimodal
 
-   core/core
-   core/exp_manager
-   core/neural_types
-   core/export
-   core/adapters/intro
-   core/api
+   multimodal/mllm/intro
+   multimodal/vlm/intro
+   multimodal/text2img/intro
+   multimodal/nerf/intro
+   multimodal/api
 
 
+.. toctree::
+   :maxdepth: 3
+   :caption: Large Language Models (LLMs)
+   :name: Large Language Models
+
+   nlp/nemo_megatron/intro
+   nlp/models
+   nlp/machine_translation/machine_translation
+   nlp/megatron_onnx_export
+   nlp/api
+
 .. toctree::
    :maxdepth: 2
    :caption: Speech Processing
@@ -36,26 +46,33 @@ NVIDIA NeMo User Guide
    asr/ssl/intro
    asr/speech_intent_slot/intro
 
-.. toctree::
-   :maxdepth: 3
-   :caption: Natural Language Processing
-   :name: Natural Language Processing
-
-   nlp/nemo_megatron/intro
-   nlp/machine_translation/machine_translation
-   nlp/text_normalization/intro
-   nlp/api
-   nlp/megatron_onnx_export
-   nlp/models
-
-
 .. toctree::
    :maxdepth: 1
    :caption: Text To Speech (TTS)
    :name: Text To Speech
 
    tts/intro
 
+.. toctree::
+   :maxdepth: 2
+   :caption: Vision
+   :name: vision
+
+   vision/intro
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: NeMo Core
+   :name: core
+
+   core/core
+   core/exp_manager
+   core/neural_types
+   core/export
+   core/adapters/intro
+   core/api
+
 .. toctree::
    :maxdepth: 2
    :caption: Common
@@ -71,27 +88,10 @@ NVIDIA NeMo User Guide
    text_processing/g2p/g2p
    common/intro
 
-.. toctree::
-   :maxdepth: 3
-   :caption: Multimodal (MM)
-   :name: Multimodal
-
-   multimodal/mllm/intro
-   multimodal/vlm/intro
-   multimodal/text2img/intro
-   multimodal/nerf/intro
-   multimodal/api
-
-.. toctree::
-   :maxdepth: 2
-   :caption: Vision
-   :name: vision
-
-   vision/intro
 
 .. toctree::
    :maxdepth: 3
-   :caption: Tools
-   :name: Tools
+   :caption: Speech Tools
+   :name: Speech Tools
 
    tools/intro
diff --git a/docs/source/multimodal/api.rst b/docs/source/multimodal/api.rst
@@ -1,4 +1,4 @@
-NeMo Megatron API
+Multimodal API
 =======================
 
 Model Classes

diff --git a/docs/source/nlp/api.rst b/docs/source/nlp/api.rst
@@ -1,5 +1,5 @@
-NeMo Megatron API
-=======================
+Large language Model API
+========================
 
 Pretraining Model Classes
 -------------------------

diff --git a/docs/source/nlp/information_retrieval.rst b/docs/source/nlp/information_retrieval.rst
@@ -8,7 +8,7 @@ The model architecture and pre-training process are detailed in the `Sentence-BE
 Sentence-BERT utilizes a BERT-based architecture, but it is trained using a siamese and triplet network structure to derive fixed-sized sentence embeddings that capture semantic information. 
 Sentence-BERT is commonly used to generate high-quality sentence embeddings for various downstream natural language processing tasks, such as semantic textual similarity, clustering, and information retrieval
 
-Data Input for the Senntence-BERT model
+Data Input for the Sentence-BERT model
 ---------------------------------------
 
 The fine-tuning data for the Sentence-BERT (SBERT) model should consist of data instances, 

diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst
@@ -1,20 +1,14 @@
-NeMo Megatron
-=============
+Large Language Models
+=====================
 
-Megatron :cite:`nlp-megatron-shoeybi2019megatron` is a large, powerful transformer developed by the Applied Deep Learning Research 
-team at NVIDIA. NeMo Megatron supports several types of models:
+To learn more about using NeMo to train Large Language Models at scale, please refer to the `NeMo Framework User Guide! <https://docs.nvidia.com/nemo-framework/user-guide/latest/index.html>`_.
 
 * GPT-style models (decoder only)
 * T5/BART/UL2-style models (encoder-decoder)
 * BERT-style models (encoder only)
 * RETRO model (decoder only)
 
 
-
-.. note::
-    NeMo Megatron has an Enterprise edition which contains tools for data preprocessing, hyperparameter tuning, container, scripts for various clouds and more. With Enterprise edition you also get deployment tools. Apply for `early access here <https://developer.nvidia.com/nemo-megatron-early-access>`_ .
-
-
 .. toctree::
    :maxdepth: 1