Added dataset documentation

collabora · Oct 27, 2023 · 8bfdc6c · 8bfdc6c
1 parent a7128b0
commit 8bfdc6c
Show file tree

Hide file tree

Showing 9 changed files with 305 additions and 11 deletions.
diff --git a/.github/workflows/deploy.yaml b/.github/workflows/deploy.yaml
@@ -0,0 +1,9 @@
+name: Deploy to GitHub Pages
+on:
+  push:
+    branches: [ "main", "master" ]
+  workflow_dispatch:
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps: [uses: fastai/workflows/quarto-ghp@master]
diff --git a/nbs/Dataset preparation.ipynb b/nbs/Dataset preparation.ipynb
@@ -0,0 +1,222 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f91750d5",
+   "metadata": {},
+   "source": [
+    "# I can has speech? What data WhisperSpeech needs?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "961f74a4",
+   "metadata": {},
+   "source": [
+    "**WhisperSpeech** is trained on heavily preprocessed speech data generated from several models:\n",
+    "\n",
+    "- acoustic tokens generated by [Encodec](https://github.com/facebookresearch/encodec)\n",
+    "- semantic tokens generated by [the quantized Whisper model](https://github.com/collabora/WhisperSpeech/blob/main/nbs/2B.%20Whisper%20quantization%20(semantic%20token)%20model.ipynb)\n",
+    "- automatic transcriptions made with [Whisper](https://github.com/openai/whisper)\n",
+    "\n",
+    "![WhisperSpeech TTS overview diagram](https://github.com/collabora/WhisperSpeech/blob/main/whisperspeech-diagram.png?raw=true)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6877e40f",
+   "metadata": {},
+   "source": [
+    "## Who is who? A high-level overview\n",
+    "\n",
+    "To get these 3 data representations we have to run the audio data through several models. The first two steps are always the same, the rest depend on the model we want to run.\n",
+    "\n",
+    "1. We start by downloading the speech audio files into a sharded webdataset (e.g. [A3. Download Project Guttenberg audiobooks]()).  \n",
+    "We released webdatasetified versions of two important public domain speech datasets – [LibriLight](https://huggingface.co/datasets/collabora/librilight-webdataset/tree/main) and [Project Gutenberg Audiobooks](https://huggingface.co/datasets/collabora/the-project-gutenberg-open-audiobook-collection-wds/tree/main).\n",
+    "\n",
+    "2. All subsequent steps rely on voice activity detection (VAD) so we always generate segment lists for all audio files (see [1B. Voice activity detection](https://github.com/collabora/WhisperSpeech/blob/main/nbs/1B.%20Voice%20activity%20detection.ipynb) for source code).  \n",
+    "The results of this step were also released on HuggingFace – [LibriLight](https://huggingface.co/datasets/collabora/librilight-processed-webdataset/tree/main) and [Project Gutenberg Audiobooks](https://huggingface.co/datasets/collabora/project-gutenberg-wds-preprocessed/tree/main).\n",
+    "\n",
+    "The next steps depend on which model we want to train of fine-tune.\n",
+    "\n",
+    "3. To re-train the *quantized Whisper model* we need to transcribe the audio with `base.en` ([2A. Whisper quantization dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/2A.%20Whisper%20quantization%20dataset%20preparation.ipynb)). A model pretrained on 60k hours of LibriLight is available from HuggingFace [whisper-vq-stoks-v2.model](https://huggingface.co/collabora/whisperspeech/blob/main/whisper-vq-stoks-v2.model).\n",
+    "4. To train the text to semantic token model we need to transcribe the audio with Whisper `small.en` and extract the semantic tokens ([5A. T2S dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/5A.%20T2S%20dataset%20preparation.ipynb)).\n",
+    "5. To train the semantic to acoustic model we need to extract the semantic tokens and compress the audio with Encodec for the semantic to acoustic model ([4A. S2A dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/4A.%20S2A%20dataset%20preparation.ipynb)).\n",
+    "\n",
+    "These three steps are all independent since they require different chunking of speech data. For quantizing Whisper and S2A training we greedily merge the VAD segments into (at most) 30 second chunks to improve training performance (more uniform chunks mean less computation time is spent on padding). For T2S we randomly truncate when merging the VAD segments so the model also learns how to work with shorter texts."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "544fe8d2",
+   "metadata": {},
+   "source": [
+    "## Why WebDataset?\n",
+    "\n",
+    "All WhisperSpeech training and preproc code got reorganized around webdatasets. [Webdatasets](https://github.com/webdataset/webdataset) are just simple tar files that store all our data samples (files) but they are great for working with very large datasets. Inside these tar files we can store multiple files per sample in any format we want (e.g. the speech mp3/flac/wav files, the text transcripts, tokens in numpy arrays). For example from the data used to train the `S2A` model we have:\n",
+    "\n",
+    "```bash\n",
+    "$ tar tf whisperspeech-s2a-512c-dim64/librilight-small-000.tar.gz |head -6\n",
+    "small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_021.atoks.npy\n",
+    "small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_021.stoks.npy\n",
+    "small/28/amateur_cracksman_librivox_64kb_mp3/amateur_cracksman_04_hornung_64kb_004.atoks.npy\n",
+    "small/28/amateur_cracksman_librivox_64kb_mp3/amateur_cracksman_04_hornung_64kb_004.stoks.npy\n",
+    "small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_052.atoks.npy\n",
+    "small/1874/shortlifelincoln_0809_librivox_64kb_mp3/shortlifeoflincoln_10_nicolay_64kb_052.stoks.npy\n",
+    "```\n",
+    "\n",
+    "The name of the file is the same as the file name of the original dataset sample and the extensions tell us what kind of value they hold and in which format.\n",
+    "\n",
+    "Furthermore we can split the whole dataset into fixed-size tar files called shards and load them on demand without unpacking. It turns out that this is exactly what we need for both AI training and data preprocessing:\n",
+    "\n",
+    "- for **training** we start a multiple CPU workers in parallel, open different shards in each, stream the data sequentially from disk (fast), decode it independently and them shuffle the samples we receive from each worker to create varied training batches\n",
+    "- for **preprocessing** we independently send each shard to a worker and save all the results in a new webdataset shard\n",
+    "\n",
+    "Reading samples sequentialy allows us to simply compress the whole file with `gzip` and offers best performance even on spinning or network disks.\n",
+    "\n",
+    ":::{.callout-note}  \n",
+    "For the Juwels cluster there is another crucial benefit. There is a pretty low limit on the total number of files on network disks (`inodes` to be precise) so there is a strong preference to keep data in a few large files. The network file system performance is also better if we don't have to open too many files.  \n",
+    ":::\n",
+    "\n",
+    "Keeping each shard around 5GB seems to work great (the processed shards will likely be a lot smaller but it's a lot easier to keep a 1-to-1 shard mapping). For the almost 4TB LibriLight dataset this translates to 625 files.\n",
+    "\n",
+    "We found it quite useful to also keep all the data in some splits. This is data dependent but for LibriLight we followed the original split (`small`, `medium`, `large`) but also extracted the `6454` speaker from the `large` split because it is was the largest single speaker dataset and it allowed us to use it during development without downloading the full 4TB.\n",
+    "\n",
+    ":::{.callout-caution}  \n",
+    "The sample file names should not have dots in them, otherwise the WebDataset code gets confused which files go together into one sample. This can be worked around later but it's easiest if we just do `.replace('.', '_')` when storing the initial raw dataset.  \n",
+    ":::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65d78287",
+   "metadata": {},
+   "source": [
+    "## Joins on WebDatasets\n",
+    "\n",
+    "One novel functionality we developed for this project is the capability to join multiple preprocessed webdatasets. This mechanism relies on keeping a constant ordering of samples in a shard and ensuring 1-to-1 correspondence between the input and output shards during preprocessing.\n",
+    "\n",
+    "Example usage:\n",
+    "\n",
+    "```python\n",
+    "ds = wds.WebDataset([str(x) for x in Path('librilight/').glob('*.tar')]).compose( # load all audio shards\n",
+    "    wds.decode(wds.torch_audio), # decode the audio data\n",
+    "    vq_stoks.merge_in( # merge another WebDataset\n",
+    "        # for each audio (`raw`) shard, find the path and name of a corresponding `vad` shard\n",
+    "        vq_stoks.derived_dataset('librilight-processed/', 'vad')\n",
+    "    ),\n",
+    ")\n",
+    "```\n",
+    "\n",
+    "`derived_dataset` creates for us a helper function that returns an opened derived dataset given the original shard file name:\n",
+    "```python\n",
+    "def derived_dataset(path, kind):\n",
+    "    def deriver(url):\n",
+    "        url = str(Path(path)/(Path(url).name.replace(\"raw\", kind) + \".gz\"))\n",
+    "        return wds.WebDataset(wds.SimpleShardList([url])).decode()\n",
+    "    return deriver\n",
+    "```\n",
+    "\n",
+    "This feature is experimental and the API may change as we develop more experience with this merging style."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e4013a0",
+   "metadata": {},
+   "source": [
+    "## Examples of preprocessing runs\n",
+    "\n",
+    "An example of running a preprocessing step locally on a single file:\n",
+    "\n",
+    "```bash\n",
+    "mkdir -p guttenberg-preproc && cd guttenberg-preproc\n",
+    "python -m whisperspeech.vad ../guttenberg-audiobooks/guttenberg-audiobooks-raw-000010.tar\n",
+    "```\n",
+    "    \n",
+    "This will generate a file named `guttenberg-audiobooks-vad-000000.tar.gz` in the `guttenberg-preproc` directory.\n",
+    "\n",
+    "On the cluster we can run multiple jobs in parallel (`24` in this case), each processing one input shard. Since each job is pretty short (around 30 minutes) it's easier for the scheduler to squeeze these between longer and higher-priority jobs.\n",
+    "\n",
+    "\n",
+    "```bash\n",
+    "mkdir -p whisperspeech-s2a-512c-dim64 && cd whisperspeech-s2a-512c-dim64\n",
+    "find ../librilight/ -name 'librilight-small-*.tar'| ~/clapa1/run-batch 24 \\\n",
+    "    'python -m whisperspeech.prepare_s2a_dataset $FILE ../librilight-preproc\n",
+    "            --vq_model ~/clapa1/scratch/vqmodel-512c-dim64-4e-hyptuned-32gpu.model\n",
+    "            --batch_size 8'\n",
+    "```\n",
+    "                \n",
+    "The `prepare_s2a_dataset` script is taking raw audio data from the input file, automatically finding corresponding shards with VAD results in `../librilight-preproc` and writing the results to the `whisperspeech-s2a-512c-dim64` directory."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5553c5de",
+   "metadata": {},
+   "source": [
+    "## Voice activity detection\n",
+    "\n",
+    "Code: [1B. Voice activity detection](https://github.com/collabora/WhisperSpeech/blob/main/nbs/1B.%20Voice%20activity%20detection.ipynb)\n",
+    "\n",
+    "Right now we are using the VAD model from WhisperX that is enough to avoid cutting audio in the middle of a word which would hurt automated transcriptions quite a lot. For more fancy datasets with multiple speakers we could use pyannote for it's detection of multiple people speaking at once and diarization capability.\n",
+    "\n",
+    "We later merge the VAD segments into longer chunks for more efficient training (less padding == higher efficiency). The code and histogram plots can be found in [2A. Whisper quantization dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/2A.%20Whisper%20quantization%20dataset%20preparation.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0fdefced",
+   "metadata": {},
+   "source": [
+    "## Transcription\n",
+    "\n",
+    "Code: [5A. T2S dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/5A.%20T2S%20dataset%20preparation.ipynb)\n",
+    "\n",
+    "For training the TTS model (T2S) we are using running batches of chunked speech segments though FasterWhisper. We use the `small.en` model since there seems to be little benefit from using the larger models on English speech. For multilingual TTS we would probably want to switch to `large-v2`.\n",
+    "\n",
+    ":::{.callout-note}  \n",
+    "Right now we extract both semantic tokens and transcriptions in one go. Doing the transcriptions is very time consuming are the result is unlikely to change. OTOH we may want to regenerate the semantic tokens if we train different quantized Whisper models. Because of that we may want to split this into two separate steps and only merge the results just before we generate the training dataset.  \n",
+    ":::"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f1681a56",
+   "metadata": {},
+   "source": [
+    "## Acoustic token extraction\n",
+    "\n",
+    "Code: [4A. S2A dataset preparation](https://github.com/collabora/WhisperSpeech/blob/main/nbs/4A.%20S2A%20dataset%20preparation.ipynb)\n",
+    "    \n",
+    "This is basically the same as T2S above but with Encodec instead of Whisper."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bcc413c",
+   "metadata": {},
+   "source": [
+    "## Temporary training dataset\n",
+    "\n",
+    "Code: NFY\n",
+    "\n",
+    "We noticed that it is beneficial to reshard the dataset just before training the models. This allows us:\n",
+    "\n",
+    "- make sensible train/validation splits,\n",
+    "- shuffle the samples (which stay in their original ordering to the last moment to enable effortless joins).\n",
+    "\n",
+    "We are developing a script to tackle both of these problems at the same time and generate a configurable number of output shards (you want to have more shards than you have data loader threads * GPUs, otherwise you may end up with the same sample repeated several times in a single batch)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/nbs/_quarto.yml b/nbs/_quarto.yml
@@ -1,5 +1,3 @@
-ipynb-filters: [nbdev_filter]
-
 project:
   type: website
   output-dir: _docs
@@ -24,6 +22,7 @@ website:
   repo-url: "https://github.com/collabora/WhisperSpeech"
   repo-actions: [issue]
   navbar:
+    logo: imgs/logo.svg
     background: primary
     search: true
     right:

diff --git a/nbs/collabora.scss b/nbs/collabora.scss
@@ -0,0 +1,19 @@
+/*-- scss:defaults --*/
+$primary: #5c2983;
+$link-color: #40ba2f;
+
+$code-block-bg: #eff6ef;
+$code-block-border-color: #bbcdbb;
+
+.navbar-title {
+  font-weight: bold;
+  color: #e0d6e7;
+}
+
+.navbar-logo {
+  min-width: 60px;
+}
+
+a {
+  font-weight: bold;
+}
diff --git a/nbs/custom.yml b/nbs/custom.yml
@@ -0,0 +1,12 @@
+format:
+  html:
+    theme:
+    - cosmo
+    - collabora.scss
+    code-block-background: true
+
+website:
+  navbar:
+    background: primary
+    search: true
+    logo: logo.svg
diff --git a/nbs/logo.svg b/nbs/logo.svg
diff --git a/nbs/sidebar.yml b/nbs/sidebar.yml
@@ -2,10 +2,4 @@ website:
   sidebar:
     contents:
       - index.ipynb
-      - 1. Acoustic token extraction.ipynb
-      - 2A. Semantic token extraction with k-means (failed).ipynb
-      - 2B. Synthetic semantic token extraction experiments.ipynb
-      - 2C. Whisper semantic embedding extraction.ipynb
-      - 2D. Semantic token extraction with distilled VQ models.ipynb
-      - 2E. Residual (RQ) semantic token extraction optimization.ipynb
-      - 2F. Residual (RQ) semantic token extraction model.ipynb
+      - Dataset preparation.ipynb
diff --git a/nbs/styles.css b/nbs/styles.css
@@ -0,0 +1,37 @@
+.cell {
+  margin-bottom: 1rem;
+}
+
+.cell > .sourceCode {
+  margin-bottom: 0;
+}
+
+.cell-output > pre {
+  margin-bottom: 0;
+}
+
+.cell-output > pre, .cell-output > .sourceCode > pre, .cell-output-stdout > pre {
+  margin-left: 0.8rem;
+  margin-top: 0;
+  background: none;
+  border-left: 2px solid lightsalmon;
+  border-top-left-radius: 0;
+  border-top-right-radius: 0;
+}
+
+.cell-output > .sourceCode {
+  border: none;
+}
+
+.cell-output > .sourceCode {
+  background: none;
+  margin-top: 0;
+}
+
+div.description {
+  padding-left: 2px;
+  padding-top: 5px;
+  font-style: italic;
+  font-size: 135%;
+  opacity: 70%;
+}
diff --git a/settings.ini b/settings.ini
@@ -18,7 +18,7 @@ tst_flags = notest
 
 ### Docs ###
 branch = master
-custom_sidebar = False
+custom_sidebar = True
 doc_host = https://%(user)s.github.io
 doc_baseurl = /%(repo)s
 git_url = https://github.com/%(user)s/%(repo)s
@@ -36,7 +36,8 @@ status = 3
 user = collabora
 
 ### Optional ###
-requirements = openai-whisper encodec pyarrow vocos scikit-learn vector_quantize_pytorch huggingface_hub
+requirements = openai-whisper encodec pyarrow vocos scikit-learn vector_quantize_pytorch huggingface_hub \
+               fastprogress
 # dev_requirements = 
 console_scripts = whisperspeech_extract_acoustic=whisperspeech.extract_acoustic:extract_acoustic
 		  whisperspeech_extract_semantic=whisperspeech.extract_semb:extract_semantic