Update link to yaml file in ASR_with_Transducers.ipynb (NVIDIA#8014)

changed link from `../../examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml` to `/content/configs/contextnet_rnnt.yaml` because the latter did not exist and threw and error when trying to load the yaml config file. Signed-off-by: Faith Wenyi Nchifor <52848633+Faith-Nchifor@users.noreply.github.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>
pzelasko · Jan 3, 2024 · 1211fc8 · 1211fc8
1 parent 2443111
commit 1211fc8
Showing 1 changed file with 29 additions and 25 deletions.
diff --git a/tutorials/asr/ASR_with_Transducers.ipynb b/tutorials/asr/ASR_with_Transducers.ipynb
@@ -17,7 +17,9 @@
         "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
         "4. Run this cell to set up dependencies.\n",
         "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n",
-        "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n",
+        "\n",
+        "\n",
+        "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n",
         "\"\"\"\n",
         "# If you're using Google Colab and not running locally, run this cell.\n",
         "import os\n",
@@ -29,7 +31,7 @@
         "!pip install matplotlib>=3.3.2\n",
         "\n",
         "## Install NeMo\n",
-        "BRANCH = 'main'\n",
+        "BRANCH = 'r1.21.0'\n",
         "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n",
         "\n",
         "## Grab the config we'll use in this example\n",
@@ -83,7 +85,7 @@
       "source": [
         "# Preparing the dataset\n",
         "\n",
-        "In this tutorial, we will be utilizing the `AN4`dataset - also known as the Alphanumeric dataset, which was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time and their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly. \n",
+        "In this tutorial, we will be utilizing the `AN4`dataset - also known as the Alphanumeric dataset, which was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time and their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n",
         "\n",
         "Let's first download the preparation script from NeMo's scripts directory -"
       ]
@@ -125,8 +127,8 @@
       "outputs": [],
       "source": [
         "import wget\n",
-        "import tarfile \n",
-        "import subprocess \n",
+        "import tarfile\n",
+        "import subprocess\n",
         "import glob\n",
         "\n",
         "data_dir = \"datasets\"\n",
@@ -163,7 +165,9 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "metadata": {},
+      "metadata": {
+        "id": "q2OVIxAXfTu_"
+      },
       "outputs": [],
       "source": [
         "# --- Building Manifest Files --- #\n",
@@ -211,7 +215,7 @@
         "if not os.path.isfile(test_manifest):\n",
         "    build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n",
         "    print(\"Test manifest created.\")\n",
-        "print(\"***Done***\") \n",
+        "print(\"***Done***\")\n",
         "# Manifest filepaths\n",
         "TRAIN_MANIFEST = train_manifest\n",
         "TEST_MANIFEST = test_manifest"
@@ -334,7 +338,7 @@
       "source": [
         "from omegaconf import OmegaConf, open_dict\n",
         "\n",
-        "config = OmegaConf.load(\"../../examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml\")"
+        "config = OmegaConf.load(\"/content/configs/contextnet_rnnt.yaml\")"
       ]
     },
     {
@@ -347,7 +351,7 @@
         "\n",
         "Here, we will slice off the first five blocks from the Jasper block (used to build ContextNet). Setting the config with this subset will create a stride 2x model with just five blocks.\n",
         "\n",
-        "We will also explicitly state that the last block dimension must be obtained from `model.model_defaults.enc_hidden` inside the config. "
+        "We will also explicitly state that the last block dimension must be obtained from `model.model_defaults.enc_hidden` inside the config."
       ]
     },
     {
@@ -557,7 +561,7 @@
         "### Diving deeper into the memory costs of Transducer Joint\n",
         "-------\n",
         "\n",
-        "One of the significant limitations of Transducers is the exorbitant memory cost of computing the Joint module. The Joint module is comprised of two steps. \n",
+        "One of the significant limitations of Transducers is the exorbitant memory cost of computing the Joint module. The Joint module is comprised of two steps.\n",
         "\n",
         "1) Projecting the Acoustic and Transcription feature dimensions to some standard hidden dimension (specified by `model.model_defaults.joint_hidden`)\n",
         "\n",
@@ -567,7 +571,7 @@
         "\n",
         "**BS**=32 ; **T** (after **2x** stride) = 800, **U** (with character encoding) = 400-450 tokens, Vocabulary size  **V** = 28 (26 alphabet chars, space and apostrophe). Let the hidden dimension of the Joint model be 640 (Most Google Transducer papers use hidden dimension of 640).\n",
         "\n",
-        "$ Memory \\, (Hidden, \\, gb) = 32 \\times 800 \\times 450 \\times 640 \\times 4 = 29.49 $ gigabytes (4 bytes per float). \n",
+        "$ Memory \\, (Hidden, \\, gb) = 32 \\times 800 \\times 450 \\times 640 \\times 4 = 29.49 $ gigabytes (4 bytes per float).\n",
         "\n",
         "$ Memory \\, (Joint, \\, gb) = 32 \\times 800 \\times 450 \\times 28 \\times 4 = 1.290 $ gigabytes (4 bytes per float)\n",
         "\n",
@@ -623,27 +627,27 @@
         "\n",
         "------\n",
         "\n",
-        "The fused operation goes as follows : \n",
+        "The fused operation goes as follows :\n",
         "\n",
         "1) Forward the entire acoustic model in a single pass. (Use global batch size here for acoustic model - found in `model.*_ds.batch_size`)\n",
         "\n",
         "2) Split the Acoustic Model's logits by `fused_batch_size` and loop over these sub-batches.\n",
         "\n",
-        "3) Construct a sub-batch of same `fused_batch_size` for the Prediction model. Now the target sequence length is $U_{sub-batch} < U$. \n",
+        "3) Construct a sub-batch of same `fused_batch_size` for the Prediction model. Now the target sequence length is $U_{sub-batch} < U$.\n",
         "\n",
         "4) Feed this $U_{sub-batch}$ into the Joint model, along with a sub-batch from the Acoustic model (with $T_{sub-batch} < T$). Remember, we only have to slice off a part of the acoustic model here since we have the full batch of samples $(B, T, D)$ from the acoustic model.\n",
         "\n",
         "5) Performing steps (3) and (4) yields $T_{sub-batch}$ and $U_{sub-batch}$. Perform sub-batch joint step - costing an intermediate $(B, T_{sub-batch}, U_{sub-batch}, V)$ in memory.\n",
         "\n",
-        "6) Compute loss on sub-batch and preserve in a list to be later concatenated. \n",
+        "6) Compute loss on sub-batch and preserve in a list to be later concatenated.\n",
         "\n",
         "7) Compute sub-batch metrics (such as Character / Word Error Rate) using the above Joint tensor and sub-batch of ground truth labels. Preserve the scores to be averaged across the entire batch later.\n",
         "\n",
         "8) Delete the sub-batch joint matrix  $(B, T_{sub-batch}, U_{sub-batch}, V)$. Only gradients from .backward() are preserved now in the computation graph.\n",
         "\n",
         "9) Repeat steps (3) - (8) until all sub-batches are consumed.\n",
         "\n",
-        "10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching. \n"
+        "10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching.\n"
       ]
     },
     {
@@ -1088,9 +1092,9 @@
         "id": "dpUoqG_G6DII"
       },
       "source": [
-        "# (Extra) Extracting Transducer Model Alignments \n",
+        "# (Extra) Extracting Transducer Model Alignments\n",
         "\n",
-        "Transducers are unique in the sense that for each timestep $t \\le T$, they can emit multiple target tokens $u_t$. During training, this is represented as the $T \\times U$ joint that maps to the vocabulary $V$. \n",
+        "Transducers are unique in the sense that for each timestep $t \\le T$, they can emit multiple target tokens $u_t$. During training, this is represented as the $T \\times U$ joint that maps to the vocabulary $V$.\n",
         "\n",
         "During inference, there is no need to compute the full joint $T \\times U$. Instead, after the model predicts the `Transducer Blank` token at the current timestep $t$ while predicting the target token $u_t$, the model will move onto the next acoustic timestep $t + 1$. As such, we can obtain the diagonal alignment of the Transducer model per sample relatively simply.\n",
         "\n",
@@ -1140,7 +1144,7 @@
       "source": [
         "-------\n",
         "\n",
-        "Set up a test data loader that we will use to obtain the alignments for a single batch. "
+        "Set up a test data loader that we will use to obtain the alignments for a single batch."
       ]
     },
     {
@@ -1177,15 +1181,15 @@
         "  encoded, encoded_len = model.forward(\n",
         "                        input_signal=batch[0].to(device), input_signal_length=batch[1].to(device)\n",
         "                    )\n",
-        "  \n",
+        "\n",
         "  current_hypotheses = model.decoding.rnnt_decoder_predictions_tensor(\n",
         "                        encoded, encoded_len, return_hypotheses=True\n",
         "                    )\n",
-        "  \n",
+        "\n",
         "  del encoded, encoded_len\n",
-        "  \n",
-        "  # current hypothesis is a tuple of \n",
-        "  # 1) best hypothesis \n",
+        "\n",
+        "  # current hypothesis is a tuple of\n",
+        "  # 1) best hypothesis\n",
         "  # 2) Sorted list of hypothesis (if using beam search); None otherwise\n",
         "  return current_hypotheses"
       ]
@@ -1321,7 +1325,7 @@
         "\n",
         "Finally, let us calculate the alignment grid. We will de-tokenize the sub-word token if it is a valid index in the vocabulary and use `''` as a placeholder for the `Transducer Blank` token.\n",
         "\n",
-        "Note that each `timestep` here is (roughly) $timestep * total\\_stride\\_of\\_model * preprocessor.window\\_stride$ seconds timestamp. \n",
+        "Note that each `timestep` here is (roughly) $timestep * total\\_stride\\_of\\_model * preprocessor.window\\_stride$ seconds timestamp.\n",
         "\n",
         "**Note**: You can modify the value of `config.model.loss.warprnnt_numba_kwargs.fastemit_lambda` prior to training and see an impact on final alignment latency!"
       ]
@@ -1342,7 +1346,7 @@
         "    token = token.to('cpu').numpy().tolist()\n",
         "    decoded_token = model.decoding.decode_ids_to_tokens([token])[0] if token != model.decoding.blank_id else ''  # token at index len(vocab) == RNNT blank token\n",
         "    t_u.append(decoded_token)\n",
-        "  \n",
+        "\n",
         "  print(f\"Tokens at timestep {ti} = {t_u}\")"
       ]
     },