diff --git a/README.rst b/README.rst index acc30a68e610..d05a9b63c642 100644 --- a/README.rst +++ b/README.rst @@ -58,7 +58,7 @@ State of the Art pretrained NeMo models are freely available on `HuggingFace Hub These models can be used to transcribe audio, synthesize speech, or translate text in just a few lines of code. We have extensive `tutorials `_ that -can all be run on `Google Colab `_. +can be run on `Google Colab `_. For advanced users that want to train NeMo models from scratch or finetune existing NeMo models we have a full suite of `example scripts `_ that support multi-GPU/multi-node training. @@ -67,30 +67,14 @@ For scaling NeMo LLM training on Slurm clusters or public clouds, please see the The NM launcher has extensive recipes, scripts, utilities, and documentation for training NeMo LLMs and also has an `Autoconfigurator `_ which can be used to find the optimal model parallel configuration for training on a specific cluster. -Also see the two introductory videos below for a high level overview of NeMo. - -* Developing State-Of-The-Art Conversational AI Models in Three Lines of Code. -* NVIDIA NeMo: Toolkit for Conversational AI at PyData Yerevan 2022. - -|three_lines| |pydata| - -.. |pydata| image:: https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg - :target: https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0 - :width: 600 - :alt: Develop Conversational AI Models in 3 Lines - -.. |three_lines| image:: https://img.youtube.com/vi/wBgpMf_KQVw/maxresdefault.jpg - :target: https://www.youtube.com/embed/wBgpMf_KQVw?mute=0&start=0&autoplay=0 - :width: 600 - :alt: Introduction at PyData@Yerevan 2022 - Key Features ------------ * Speech processing * `HuggingFace Space for Audio Transcription (File, Microphone and YouTube) `_ + * `Pretrained models `_ available in 14+ languages * `Automatic Speech Recognition (ASR) `_ - * Supported ASR models: ``_ + * Supported ASR `models `_: * Jasper, QuartzNet, CitriNet, ContextNet * Conformer-CTC, Conformer-Transducer, FastConformer-CTC, FastConformer-Transducer * Squeezeformer-CTC and Squeezeformer-Transducer @@ -101,7 +85,7 @@ Key Features * Hybrid Transducer/CTC * NeMo Original `Multi-blank Transducers `_ and `Token-and-Duration Transducers (TDT) `_ * Streaming/Buffered ASR (CTC/Transducer) - `Chunked Inference Examples `_ - * Cache-aware Streaming Conformer with multiple lookaheads - ``_ + * `Cache-aware Streaming Conformer `_ with multiple lookaheads. * Beam Search decoding * `Language Modelling for ASR (CTC and RNNT) `_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer * `Support of long audios for Conformer with memory efficient local attention `_ @@ -113,8 +97,6 @@ Key Features * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet * Neural Diarizer: MSDD (Multi-scale Diarization Decoder) * `Speech Intent Detection and Slot Filling `_: Conformer-Transformer - * `Pretrained models on different languages. `_: English, Spanish, German, Russian, Chinese, French, Italian, Polish, ... - * `NGC collection of pre-trained speech processing models. `_ * Natural Language Processing * `NeMo Megatron pre-training of Large Language Models `_ * `Neural Machine Translation (NMT) `_ @@ -151,7 +133,7 @@ Requirements 1) Python 3.10 or above 2) Pytorch 1.13.1 or above -3) NVIDIA GPU for training +3) NVIDIA GPU, if you intend to do model training Documentation ------------- @@ -178,6 +160,15 @@ Tutorials --------- A great way to start with NeMo is by checking `one of our tutorials `_. +You can also get a high-level overview of NeMo by watching the talk *NVIDIA NeMo: Toolkit for Conversational AI*, presented at PyData Yerevan 2022: + +|pydata| + +.. |pydata| image:: https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg + :target: https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0 + :width: 600 + :alt: NeMo presentation at PyData@Yerevan 2022 + Getting help with NeMo ---------------------- FAQ can be found on NeMo's `Discussions board `_. You are welcome to ask questions or start discussions there. @@ -185,7 +176,6 @@ FAQ can be found on NeMo's `Discussions board `_ CONTRIBUTING.md for the process. +We welcome community contributions! Please refer to `CONTRIBUTING.md `_ for the process. Publications ------------ @@ -367,4 +384,4 @@ Please refer to the instructions in the `README of that branch `_. +NeMo is released under an `Apache 2.0 license `_. diff --git a/docs/source/asr/asr_language_modeling.rst b/docs/source/asr/asr_language_modeling.rst index a0d46ca795b1..bb823cb252c0 100644 --- a/docs/source/asr/asr_language_modeling.rst +++ b/docs/source/asr/asr_language_modeling.rst @@ -39,6 +39,7 @@ penalty term to consider the sequence length in the scores. Larger alpha means m importance on the acoustic model. Negative values for beta will give penalty to longer sequences and make the decoder to prefer shorter predictions, while positive values would result in longer candidates. +.. _train-ngram-lm: Train N-gram LM =============== diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst index 05278ecb2437..61c34014c809 100644 --- a/docs/source/asr/datasets.rst +++ b/docs/source/asr/datasets.rst @@ -162,6 +162,8 @@ these files using ``--dest_folder``. In order to generate files in the supported After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory. +.. _section-with-manifest-format-explanation: + Preparing Custom ASR Data ------------------------- diff --git a/docs/source/asr/intro.rst b/docs/source/asr/intro.rst index 46a192c546a2..7066c2989393 100644 --- a/docs/source/asr/intro.rst +++ b/docs/source/asr/intro.rst @@ -1,34 +1,156 @@ Automatic Speech Recognition (ASR) ================================== -ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language -(speech-to-text). Our goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input. -In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with -as few errors as possible? +Automatic Speech Recognition (ASR), also known as Speech To Text (STT), refers to the problem of automatically transcribing spoken language. +You can use NeMo to transcribe speech using open-sourced pretrained models in :ref:`14+ languages `, or :doc:`train your own<./examples/kinyarwanda_asr>` ASR models. -Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to -evaluate a speech sample. We would start from a language model that encapsulates the most likely orderings of words that are generated -(e.g. an n-gram model), to a pronunciation model for each word in that ordering (e.g. a pronunciation table), to an acoustic model that -translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model). -Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio -according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model ``Pr(audio|transcript)*Pr(transcript)``, -and take the argmax of this over possible transcripts. -Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a -neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram -model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of -these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the -whole prediction. +Transcribe speech with 3 lines of code +---------------------------------------- +After :ref:`installing NeMo`, you can transcribe an audio file as follows: -Thus, we can see the appeal of end-to-end ASR architectures: discriminative models that simply take an audio input and give a textual -output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be -akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, -we could integrate a language model that would improve our predictions, as well. +.. code-block:: python -And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! + import nemo.collections.asr as nemo_asr + asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") + transcript = asr_model.transcribe(["path/to/audio_file.wav"]) -A demo below allows evaluation of NeMo ASR models in multiple langauges from the browser: +Obtain word timestamps +^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can also obtain timestamps for each word in the transcription as follows: + +.. code-block:: python + + # import nemo_asr and instantiate asr_model as above + import nemo.collections.asr as nemo_asr + asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") + + # update decoding config to preserve alignments and compute timestamps + from omegaconf import OmegaConf, open_dict + decoding_cfg = asr_model.cfg.decoding + with open_dict(decoding_cfg): + decoding_cfg.preserve_alignments = True + decoding_cfg.compute_timestamps = True + asr_model.change_decoding_strategy(decoding_cfg) + + # specify flag `return_hypotheses=True`` + hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], return_hypotheses=True) + + # if hypotheses form a tuple (from RNNT), extract just "best" hypotheses + if type(hypotheses) == tuple and len(hypotheses) == 2: + hypotheses = hypotheses[0] + + timestamp_dict = hypotheses[0].timestep # extract timesteps from hypothesis of first (and only) audio file + print("Hypothesis contains following timestep information :", list(timestamp_dict.keys())) + + # For a FastConformer model, you can display the word timestamps as follows: + # 80ms is duration of a timestep at output of the Conformer + time_stride = 8 * asr_model.cfg.preprocessor.window_stride + + word_timestamps = timestamp_dict['word'] + + for stamp in word_timestamps: + start = stamp['start_offset'] * time_stride + end = stamp['end_offset'] * time_stride + word = stamp['char'] if 'char' in stamp else stamp['word'] + + print(f"Time : {start:0.2f} - {end:0.2f} - {word}") + +Transcribe speech via command line +---------------------------------- +You can also transcribe speech via the command line using the following `script `_, for example: + +.. code-block:: bash + + python /blob/main/examples/asr/transcribe_speech.py \ + pretrained_name="stt_en_fastconformer_transducer_large" \ + audio_dir= # path to dir containing audio files to transcribe + +The script will save all transcriptions in a JSONL file where each line corresponds to an audio file in ````. +This file will correspond to a format that NeMo commonly uses for saving model predictions, and also for storing +input data for training and evaluation. You can learn more about the format that NeMo uses for these files +(which we refer to as "manifest files") :ref:`here`. + +You can also specify the files to be transcribed inside a manifest file, and pass that in using the argument +``dataset_manifest=`` instead of ``audio_dir``. + + +Incorporate a language model (LM) to improve ASR transcriptions +--------------------------------------------------------------- + +You can often get a boost to transcription accuracy by using a Language Model to help choose words that are more likely +to be spoken in a sentence. + +You can get a good improvement in transcription accuracy even using a simple N-gram LM. + +After :ref:`training ` an N-gram LM, you can use it for transcribing audio as follows: + +1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using `this script `_. +2. Perform transcription using `this script `_: + +.. code-block:: bash + + python eval_beamsearch_ngram.py nemo_model_file= \ + input_manifest= \ + beam_width=[] \ + beam_alpha=[] \ + beam_beta=[] \ + preds_output_folder= \ + probs_cache_file=null \ + decoding_mode=beamsearch_ngram \ + decoding_strategy="" + +See more information about LM decoding :doc:`here <./asr_language_modeling>`. + +Use real-time transcription +--------------------------- + +It is possible to use NeMo to transcribe speech in real-time. You can find an example of how to do +this in the following `notebook tutorial `_. + + +Try different ASR models +------------------------ + +NeMo offers a variety of open-sourced pretrained ASR models that vary by model architecture: + +* **encoder architecture** (FastConformer, Conformer, Citrinet, etc.), +* **decoder architecture** (Transducer, CTC & hybrid of the two), +* **size** of the model (small, medium, large, etc.). + +The pretrained models also vary by: + +* **language** (English, Spanish, etc., including some **multilingual** and **code-switching** models), +* whether the output text contains **punctuation & capitalization** or not. + +The NeMo ASR checkpoints can be found on `HuggingFace `_, or on `NGC `_. All models released by the NeMo team can be found on NGC, and some of those are also available on HuggingFace. + +All NeMo ASR checkpoints open-sourced by the NeMo team follow the following naming convention: +``stt_{language}_{encoder name}_{decoder name}_{model size}{_optional descriptor}``. + +You can load the checkpoints automatically using the ``ASRModel.from_pretrained()`` class method, for example: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + # model will be fetched from NGC + asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") + # if model name is prepended with "nvidia/", the model will be fetched from huggingface + asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_fastconformer_transducer_large") + # you can also load open-sourced NeMo models released by other HF users using: + # asr_model = nemo_asr.models.ASRModel.from_pretrained("/") + +See further documentation about :doc:`loading checkpoints <./results>`, a full :ref:`list ` of models and their :doc:`benchmark scores <./score>`. + +There is also more information about the ASR model architectures available in NeMo :doc:`here <./models>`. + + +Try out NeMo ASR transcription in your browser +---------------------------------------------- +You can try out transcription with NeMo ASR models without leaving your browser, by using the HuggingFace Space embedded below. .. raw:: html @@ -40,10 +162,27 @@ A demo below allows evaluation of NeMo ASR models in multiple langauges from the -The full documentation tree is as follows: +ASR tutorial notebooks +---------------------- +Hands-on speech recognition tutorial notebooks can be found under `the ASR tutorials folder `_. +If you are a beginner to NeMo, consider trying out the `ASR with NeMo `_ tutorial. +This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. + +ASR model configuration +----------------------- +Documentation regarding the configuration files specific to the ``nemo_asr`` models can be found in the :doc:`Configuration Files <./configs>` section. + +Preparing ASR datasets +---------------------- +NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`Datasets <./datasets>` section contains instructions on +running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. + +Further information +------------------- +For more information, see additional sections in the ASR docs on the left-hand-side menu or in the list below: .. toctree:: - :maxdepth: 8 + :maxdepth: 1 models datasets @@ -54,5 +193,3 @@ The full documentation tree is as follows: api resources examples/kinyarwanda_asr.rst - -.. include:: resources.rst diff --git a/docs/source/asr/resources.rst b/docs/source/asr/resources.rst deleted file mode 100644 index e192f5fbe83d..000000000000 --- a/docs/source/asr/resources.rst +++ /dev/null @@ -1,17 +0,0 @@ -Resources and Documentation ---------------------------- - -Hands-on speech recognition tutorial notebooks can be found under `the ASR tutorials folder `_. -If you are a beginner to NeMo, consider trying out the `ASR with NeMo `_ tutorial. -This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. - -If you are looking for information about a particular ASR model, or would like to find out more about the model -architectures available in the `nemo_asr` collection, refer to the :doc:`Models <./models>` section. - -NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`Datasets <./datasets>` section contains instructions on -running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. - -Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints -available on NGC are located on the :doc:`Checkpoints <./results>` section. - -Documentation regarding the configuration files specific to the ``nemo_asr`` models can be found on the :doc:`Configuration Files <./configs>` section. diff --git a/docs/source/asr/results.rst b/docs/source/asr/results.rst index 466393e9a55a..3dca110d89e4 100644 --- a/docs/source/asr/results.rst +++ b/docs/source/asr/results.rst @@ -157,6 +157,9 @@ Language Models for ASR | + +.. _asr-checkpoint-list-by-language: + Speech Recognition (Languages) ------------------------------ diff --git a/docs/source/conf.py b/docs/source/conf.py index c54defb59ce8..952e25332ca4 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -149,7 +149,7 @@ # General information about the project. project = "NVIDIA NeMo" -copyright = "© 2021-2022 NVIDIA Corporation & Affiliates. All rights reserved." +copyright = "© 2021-2023 NVIDIA Corporation & Affiliates. All rights reserved." author = "NVIDIA CORPORATION" # The version info for the project you're documenting, acts as replacement for diff --git a/docs/source/starthere/intro.rst b/docs/source/starthere/intro.rst index 9297b7ef53b3..e6a59b0832ab 100644 --- a/docs/source/starthere/intro.rst +++ b/docs/source/starthere/intro.rst @@ -10,7 +10,7 @@ Introduction `NVIDIA NeMo `_, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), -Natural Language Processing (NLP), and Text-to-Speech (TTS) synthesis models. Each collection consists of +Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. @@ -19,181 +19,126 @@ Conversational AI architectures are typically large and require a lot of data an for training. NeMo uses `PyTorch Lightning `_ for easy and performant multi-GPU/multi-node mixed-precision training. -`Pre-trained NeMo models. `_ - -Also see the two introductory videos below for a high level overview of NeMo. - -* Developing State-Of-The-Art Conversational AI Models in Three Lines of Code. -.. raw:: html - -
- -
- -* NVIDIA NeMo: Toolkit for Conversational AI at PyData Yerevan 2022. -.. image:: https://img.youtube.com/vi/J-P6Sczmas8/maxres3.jpg - :target: https://www.youtube.com/embed/J-P6Sczmas8?mute=0&start=14&autoplay=0 - :width: 560 - :alt: Develop Conversational AI Models in 3 Lines - -For more information and questions, visit the `NVIDIA NeMo Discussion Board `_. +`Pre-trained NeMo models `_ are available +in 14+ languages. Prerequisites ------------- Before you begin using NeMo, it's assumed you meet the following prerequisites. -#. You have Python version 3.9, 3.10. +#. You have Python version 3.10 or above. #. You have Pytorch version 1.13.1 or 2.0+. -#. You have access to an NVIDIA GPU for training. +#. You have access to an NVIDIA GPU, if you intend to do model training. .. _quick_start_guide: Quick Start Guide ----------------- -This NeMo Quick Start Guide is a starting point for users who want to try out NeMo; specifically, this guide enables users to quickly get started with the NeMo fundamentals by walking you through an example audio translator and voice swap. +You can try out NeMo's ASR, NLP and TTS functionality with the example below, which is based on the `Audio Translation `_ tutorial. -If you're new to NeMo, the best way to get started is to take a look at the following tutorials: - -* `Text Classification (Sentiment Analysis) `__ - demonstrates the Text Classification model using the NeMo NLP collection. -* `NeMo Primer `__ - introduces NeMo, PyTorch Lightning, and OmegaConf, and shows how to use, modify, save, and restore NeMo models. -* `NeMo Models `__ - explains the fundamental concepts of the NeMo model. -* `NeMo voice swap demo `__ - demonstrates how to swap a voice in the audio fragment with a computer generated one using NeMo. - -Below we is the code snippet of Audio Translator application. +Once you have :ref:`installed NeMo `, then you can run the code below: .. code-block:: python - # Import NeMo and it's ASR, NLP and TTS collections - import nemo - # Import Speech Recognition collection + # Import NeMo's ASR, NLP and TTS collections import nemo.collections.asr as nemo_asr - # Import Natural Language Processing colleciton import nemo.collections.nlp as nemo_nlp - # Import Speech Synthesis collection import nemo.collections.tts as nemo_tts - # Next, we instantiate all the necessary models directly from NVIDIA NGC - # Speech Recognition model - QuartzNet trained on Russian part of MCV 6.0 - quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_ru_quartznet15x5").cuda() - # Neural Machine Translation model - nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_ru_en_transformer6x6').cuda() - # Spectrogram generator which takes text as an input and produces spectrogram - spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda() - # Vocoder model which takes spectrogram and produces actual audio - vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda() - # Transcribe an audio file - # IMPORTANT: The audio must be mono with 16Khz sampling rate - # Get example from: https://nemo-public.s3.us-east-2.amazonaws.com/mcv-samples-ru/common_voice_ru_19034087.wav - russian_text = quartznet.transcribe(['Path_to_audio_file']) - print(russian_text) - # You should see russian text here. Let's translate it to English - english_text = nmt_model.translate(russian_text) - print(english_text) - # After this you should see English translation - # Let's convert it into audio - # A helper function which combines FastPitch and HiFiGAN to go directly from - # text to audio - def text_to_audio(text): - parsed = spectrogram_generator.parse(text) - spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed) - audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram) - return audio.to('cpu').numpy() - audio = text_to_audio(english_text[0]) + # Download an audio file that we will transcribe, translate, and convert the written translation to speech + import wget + wget.download("https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3") + # Instantiate a Mandarin speech recognition model and transcribe an audio file. + asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="stt_zh_citrinet_1024_gamma_0_25") + mandarin_text = asr_model.transcribe(['common_voice_zh-CN_21347786.mp3']) + print(mandarin_text) -Installation ------------- + # Instantiate Neural Machine Translation model and translate the text + nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name="nmt_zh_en_transformer24x6") + english_text = nmt_model.translate(mandarin_text) + print(english_text) -Pip -~~~ -Use this installation mode if you want the latest released version. + # Instantiate a spectrogram generator (which converts text -> spectrogram) + # and vocoder model (which converts spectrogram -> audio waveform) + spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch") + vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan") -.. code-block:: bash + # Parse the text input, generate the spectrogram, and convert it to audio + parsed_text = spectrogram_generator.parse(english_text[0]) + spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed_text) + audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram) - apt-get update && apt-get install -y libsndfile1 ffmpeg - pip install Cython - pip install nemo_toolkit[all] + # Save the audio to a file + import soundfile as sf + sf.write("output_audio.wav", audio.to('cpu').detach().numpy()[0], 22050) -Pip from source -~~~~~~~~~~~~~~~ -Use this installation mode if you want the version from a particular GitHub branch (for example, ``main``). +You can learn more by about specific tasks you are interested in by checking out the NeMo :doc:`tutorials <./tutorials>`, or documentation (e.g. read :doc:`here <../asr/intro>` to learn more about ASR). -.. code-block:: bash +You can also learn more about NeMo in the `NeMo Primer `_ tutorial, which introduces NeMo, PyTorch Lightning, and OmegaConf, and shows how to use, modify, save, and restore NeMo models. Additionally, the `NeMo Models `__ tutorial explains the fundamentals of how NeMo models are created. These concepts are also explained in detail in the :doc:`NeMo Core <../core/core>` documentation. - apt-get update && apt-get install -y libsndfile1 ffmpeg - pip install Cython - python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all] - # For v1.0.2, replace {BRANCH} with v1.0.2 like so: - # python -m pip install git+https://github.com/NVIDIA/NeMo.git@v1.0.2#egg=nemo_toolkit[all] -From source -~~~~~~~~~~~ -Use this installation mode if you are contributing to NeMo. +Introductory videos +------------------- -.. code-block:: bash - - apt-get update && apt-get install -y libsndfile1 ffmpeg - git clone https://github.com/NVIDIA/NeMo - cd NeMo - ./reinstall.sh - -Docker containers -~~~~~~~~~~~~~~~~~ -To build a nemo container with Dockerfile from a branch, please run - -.. code-block:: bash +See the two introductory videos below for a high level overview of NeMo. - DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest. +**Developing State-Of-The-Art Conversational AI Models in Three Lines of Code** +.. raw:: html -If you chose to work with the ``main`` branch, we recommend using `NVIDIA's PyTorch container version 21.05-py3 `_, then install from GitHub. +
+ +
-.. code-block:: bash +**NVIDIA NeMo: Toolkit for Conversational AI at PyData Yerevan 2022** - docker run --gpus all -it --rm -v :/NeMo --shm-size=8g \ - -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \ - stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:21.05-py3 +.. raw:: html -.. _mac-installation: +
+ +
-Mac computers with Apple silicon -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To install NeMo on Mac with Apple M-Series GPU: +.. _installation: -- install `Homebrew `_ package manager +Installation +------------ -- create a new Conda environment +The simplest way to install NeMo is via pip, see info below. -- install PyTorch 2.0 or higher +.. note:: Full NeMo installation instructions (with more ways to install NeMo, and how to handle optional dependencies) can be found in the `GitHub README `_. -- run the following code: +Conda +~~~~~ -.. code-block:: shell +We recommend installing NeMo in a fresh Conda environment. - # install mecab using Homebrew, required for sacrebleu for NLP collection - brew install mecab +.. code-block:: bash - # install pynini using Conda, required for text normalization - conda install -c conda-forge pynini + conda create --name nemo python==3.10.12 + conda activate nemo - # install Cython manually - pip install cython +Install PyTorch using their `configurator `_. - # clone the repo and install in development mode - git clone https://github.com/NVIDIA/NeMo - cd NeMo - ./reinstall.sh +Pip +~~~ +Use this installation mode if you want the latest released version. +.. code-block:: bash + apt-get update && apt-get install -y libsndfile1 ffmpeg + pip install Cython + pip install nemo_toolkit['all'] -`FAQ `_ ---------------------------------------------------- -Have a look at our `discussions board `_ and feel free to post a question or start a discussion. +Depending on the shell used, you may need to use ``"nemo_toolkit[all]"`` instead in the above command. +Discussion board +---------------- +For more information and questions, visit the `NVIDIA NeMo Discussion Board `_. Contributing ------------ @@ -203,4 +148,4 @@ We welcome community contributions! Refer to the `CONTRIBUTING.md `_. \ No newline at end of file +NeMo is released under an `Apache 2.0 license `_. \ No newline at end of file