Added the full data preparation docs

collabora · Mar 1, 2024 · fb8129d · fb8129d
1 parent 1ab7ad1
commit fb8129d
Showing 1 changed file with 190 additions and 0 deletions.
diff --git a/nbs/Dataset preparation.ipynb b/nbs/Dataset preparation.ipynb
@@ -48,6 +48,196 @@
     "These three steps are all independent since they require different chunking of speech data. For quantizing Whisper and S2A training we greedily merge the VAD segments from the same speaker into (at most) 30 second chunks to improve training performance (more uniform chunks mean less computation time is spent on padding). For T2S we randomly truncate when merging the VAD segments so the model also learns how to work with shorter texts. The code to perform this is in [1C. VAD merging](https://github.com/collabora/WhisperSpeech/blob/main/nbs/1C.%20VAD%20merging.ipynb)."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "705b6807",
+   "metadata": {},
+   "source": [
+    "## TL;DR example – give me the codes!\n",
+    "\n",
+    "In this example we will convert a single split from the Multilingual Libri Speech dataset. \n",
+    "\n",
+    "#### Prepare the webdataset shards\n",
+    "\n",
+    "The first, most time-consuming, step is to convert the data from it's original form into the webdataset format. If you want to skip this section and still follow along, the results can be downloaded from Hugging Face at [datasets/collabora/multilingual-librispeech-webdataset](https://huggingface.co/datasets/collabora/multilingual-librispeech-webdataset/tree/main).\n",
+    "\n",
+    "First we need `tarp` which is a tool that helps create and manipulate the webdataset `tar` files more effectively. You can check out more about it in the [official tarp README](https://github.com/webdataset/tarp#examples)\n",
+    "\n",
+    "```bash\n",
+    "go install -v github.com/collabora/tarp/tarp@latest\n",
+    "```\n",
+    "\n",
+    "Afterwards, we download and unpack the original dataset files:\n",
+    "```\n",
+    "aria2c -x10 https://dl.fbaipublicfiles.com/mls/mls_french_opus.tar.gz\n",
+    "tar -xf mls_french_opus.tar.gz\n",
+    "```\n",
+    "\n",
+    "Next, we'll need to convert each line in the `transcripts.txt` file:\n",
+    "\n",
+    "```\n",
+    "10065_10039_000000      ses vêtements devinrent tout brillants de lumière et blancs comme la neige en sorte qu'il n'y a point de foulon sur la terre qui puisse en faire d'aussi blancs\n",
+    "```\n",
+    "\n",
+    "into a `tarp` script:\n",
+    "\n",
+    "```\n",
+    "train/10065_10039_000000.opus file:mls_french_opus/train/audio/10065/10039/10065_10039_000000.opus\n",
+    "train/10065_10039_000000.txt text:ses vêtements devinrent tout brillants de lumière et blancs comme la neige en sorte qu'il n'y a point de foulon sur la terre qui puisse en faire d'aussi blancs\n",
+    "```\n",
+    "\n",
+    "We can achieve this using a short Python script (saved as `make-script.py`):\n",
+    "\n",
+    "```python\n",
+    "import sys\n",
+    "\n",
+    "fname = sys.argv[1]\n",
+    "dir, split, _ = fname.rsplit(\"/\", 2)\n",
+    "\n",
+    "for ln in open(fname):\n",
+    "    id, txt = ln.split(\"\\t\")\n",
+    "    a,b,c = id.split(\"_\")\n",
+    "    txt = txt.replace(\"\\n\", \"\")\n",
+    "    print(f\"\"\"{split}/{id}.opus file:{dir}/{split}/audio/{a}/{b}/{id}.opus\n",
+    "{split}/{id}.txt text:{txt}\"\"\")\n",
+    "```\n",
+    "\n",
+    "Once we have this, we can run the conversion process. The python script outputs data sample descriptions which are fed to `tarp create` that archives them into a tar stream (a bit similar to `tar -T -`). The `tarp split` will then cut the incoming stream into 2GB shards and save them to separate files, making sure to split on sample boundaries.\n",
+    "\n",
+    "The 2GB size was chosen as a good compromise between the shard count and shard transcription time for `mp3`/`opus` files with mutlilingual speech. For LibriLight (English compressed with `FLAC`) the magic number was `5GB` because we FLAC compresses less and we can also use a smaller model for transcribing English speech.\n",
+    "\n",
+    "```bash\n",
+    "python3 make-script.py  mls_french_opus/train/transcripts.txt \\\n",
+    "  | /root/go/bin/tarp create -o - - \\\n",
+    "  | /root/go/bin/tarp split -s 2e9 -o 'mls_french_train-audio-%06d.tar' -\n",
+    "```\n",
+    "\n",
+    "We'll have to repeat the same command two times replacing `train` with `test` and `dev` and afterwards we can upload everything to Hugging Face:\n",
+    "\n",
+    "```bash\n",
+    "huggingface-cli login\n",
+    "huggingface-cli upload --repo-type dataset collabora/multilingual-librispeech-webdataset .\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c3db10ca",
+   "metadata": {},
+   "source": [
+    "#### Process the shards on a single GPU machine\n",
+    "\n",
+    "We do the sharding mainly to be able to effectively process data on many GPUs but for the sake of simplicity we will use a single GPU here. The process stays the same, but different tools would be used to schedule the jobs. For reference, below the commands, we have specified their approximate runtimes on a RTX 4090 for the French subset of MLS.\n",
+    "\n",
+    "Perform voice activity detection:\n",
+    "```bash\n",
+    "parallel --eta -j3 python -m whisperspeech.vad {} ::: ./*.tar\n",
+    "# 50min\n",
+    "```\n",
+    "\n",
+    "Extract speaker embeddings for each fragment:\n",
+    "```bash\n",
+    "parallel --eta -j2 python -m whisperspeech.extract_spk_emb --batch_size 16 {} ::: ./*.tar\n",
+    "# 1h 10min\n",
+    "```\n",
+    "\n",
+    "We perform VAD segment merging (we do it as a separate step here to remove all randomness and get reproducibility for later steps):\n",
+    "\n",
+    "```bash\n",
+    "parallel --eta -j16 python -m whisperspeech.vad_merge --eqvad {} ::: *.tar\n",
+    "parallel --eta -j16 python -m whisperspeech.vad_merge {} ::: *.tar\n",
+    "```\n",
+    "\n",
+    "With that covered we can start the heavy lifting with the transcripts:\n",
+    "\n",
+    "```bash\n",
+    "parallel --eta -j1 python -m whisperspeech.prepare_t2s_txts --transcription_model medium --language fr --batch_size 32 {} ::: *.tar\n",
+    "# 6h 48min\n",
+    "```\n",
+    "\n",
+    "Afterwards comes Encodec compression:\n",
+    "\n",
+    "```bash\n",
+    "parallel --eta -j2 python -m whisperspeech.prepare_s2a_atoks --batch_size 4 {} ::: *.tar\n",
+    "# 2h\n",
+    "```\n",
+    "\n",
+    "Now we can extract the semantic tokens for both the T2S (`eqvad`) and S2A (`maxvad`) training:\n",
+    "\n",
+    "```bash\n",
+    "parallel --eta -j1 python -m whisperspeech.extract_stoks --batch_size 16 --vq_model ../nbs/vqmodel-medium-en+pl-512c-dim64.model {} ::: *.tar\n",
+    "parallel --eta -j1 python -m whisperspeech.extract_stoks --kind eqvad --batch_size 16 --vq_model ../nbs/vqmodel-medium-en+pl-512c-dim64.model {} ::: *.tar\n",
+    "# 3h 45min\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e92f3255",
+   "metadata": {},
+   "source": [
+    "#### Splitting out the validation set(s)\n",
+    "\n",
+    "After we have all the samples we may want to extract some validation sets. There are many ways to do it but here we'll manually choose some speakers we'll later skip completely during training.\n",
+    "\n",
+    "We start by dumping all the sample ids:\n",
+    "\n",
+    "```bash\n",
+    "parallel tar tf {} ::: stoks/*-atoks-3kbps-*.tar.gz | sed -e 's/\\.atoks\\.npy//' > all-samples-maxvad\n",
+    "parallel tar tf {} ::: stoks/*-small.en-txt-*.tar.gz | sed -e 's/\\.txt//' > all-samples-eqvad\n",
+    "wc -l all-samples-maxvad\n",
+    "```\n",
+    "\n",
+    "Because the sample ids (which are the original file paths) have speaker ids in them we can make a quick histogram:\n",
+    "\n",
+    "```bash\n",
+    "< all-samples-maxvad awk -F_ '{ print $1; }'|sort|uniq -c|sort -n|less\n",
+    "```\n",
+    "\n",
+    "From the result we can copy and paste 10 speaker ids of around 50 samples each to get 512 validation samples. We'll exclude them from the training set because we want to validate on unseen speakers. We have to repeat this process for both splits (`maxvad` and `eqvad` since they have'll different sample counts and ids):\n",
+    "\n",
+    "```bash\n",
+    "< all-samples-maxvad grep 'train/1579\\|train/2033\\|train/3182\\|train/12981\\|train/2284\\|train/2297\\|train/6348\\|train/7200\\|train/7679\\|train/1989' >\n",
+    "unseen-speakers-maxvad\n",
+    "< all-samples-eq grep 'train/1579\\|train/2033\\|train/3182\\|train/12981\\|train/2284\\|train/2297\\|train/6348\\|train/7200\\|train/7679\\|train/1989' > unseen-speakers-eqvad\n",
+    "```\n",
+    "\n",
+    "Once we have all the ids we can rescan the whole dataset once and split out the validation samples to separate webdataset shards to make validation fast:\n",
+    "\n",
+    "```bash\n",
+    "python -m whisperspeech.split_out_val_datasets *-atoks-* unseen-speakers-maxvad\n",
+    "python -m whisperspeech.split_out_val_datasets '*-txt-*' unseen-speakers-eqvad\n",
+    "cd stoks && python -m whisperspeech.split_out_val_datasets '*-maxvad-stoks-*' ../unseen-speakers-maxvad\n",
+    "cd stoks && python -m whisperspeech.split_out_val_datasets '*-eqvad-stoks-*' ../unseen-speakers-eqvad\n",
+    "```\n",
+    "\n",
+    "We can use `wc -l all-samples-maxvad` to find out how many samples we have."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8dc0429",
+   "metadata": {},
+   "source": [
+    "#### Creating the dataset configuration files for training\n",
+    "\n",
+    "Finally we create the configuration files for the training script:\n",
+    "```bash\n",
+    "cat > mls-fr-t2s-train.dataset <<EOF\n",
+    "multilingual-librispeech-webdataset/*-medium-txt-*.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 390203 --txt_kind='medium-txt' --language=fr --exclude_files multilingual-librispeech-webdataset/unseen-speakers-eqvad\n",
+    "EOF\n",
+    "cat > mls-fr-s2a-train.dataset <<EOF\n",
+    "multilingual-librispeech-webdataset/*-atoks-*.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 338362  --language=fr --exclude_files multilingual-librispeech-webdataset/unseen-speakers-maxvad\n",
+    "EOF\n",
+    "cat > mls-fr-s2a-val-unseen-speakers.dataset <<EOF\n",
+    "multilingual-librispeech-webdataset/unseen-speakers-maxvad.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 512 --language fr\n",
+    "EOF\n",
+    "cat > mls-fr-t2s-val-unseen-speakers.dataset <<EOF\n",
+    "multilingual-librispeech-webdataset/unseen-speakers-eqvad.tar.gz multilingual-librispeech-webdataset/vq-en+pl/ 512 --txt_kind 'medium-txt' --language fr\n",
+    "EOF\n",
+    "```"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "544fe8d2",