huggingface · sanchit-gandhi · May 18, 2023 · May 18, 2023 · Jun 5, 2023 · Vaibhavs10
diff --git a/chapters/en/chapter2/asr_pipeline.mdx b/chapters/en/chapter2/asr_pipeline.mdx
@@ -1,14 +1,18 @@
 # Automatic speech recognition with a pipeline
 
-Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text.
-This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands
+Automatic Speech Recognition (ASR) is the task of transcribing a speech audio recording into text.
+This task has numerous practical applications, from creating closed captions for videos, to enabling voice commands
 for virtual assistants like Siri and Alexa.
 
 In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person
-asking a question about paying a bill using the same MINDS-14 dataset as before.
+asking a question about paying a bill using the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset as before.
 
-To get started, load the datas  et and upsample it to 16kHz as described in [Audio classification with a pipeline](introduction.mdx),
-if you haven't done that yet.
+To get started, load the `en-AU` subset of the data as before:
+```py
+from datasets import load_dataset
+
+minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
+```
 
 To transcribe an audio recording, we can use the `automatic-speech-recognition` pipeline from 🤗 Transformers. Let's
 instantiate the pipeline:
@@ -23,7 +27,7 @@ Next, we'll take an example from the dataset and pass its raw data to the pipeli
 
 ```py
 example = minds[0]
-asr(example["audio"]["array"])
+asr(example["audio"])
 {"text": "I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY COD CAN YOU PLEASE ASSIST"}
 ```
 
@@ -41,28 +45,27 @@ is often silent. Having said that, I wouldn't recommend trying to pay your next
 By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in
 this example. If you'd like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained
 ASR model [on the 🤗 Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&language=fr&sort=downloads).
-You can filter the models list by task first, then by language. Once you have found the model you like, pass it's name as
+You can filter the models list by task first, then by language. Once you have found the model you like, pass its name as
 the `model` argument to the pipeline.
 
-Let's try this for the German split of the MINDS-14. Load the "de-DE" subset:
+Let's try this for the German split of the MINDS-14. First, we load the "de-DE" subset:
 
 ```py
 from datasets import load_dataset
-from datasets import Audio
 
 minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
-minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
 ```
 
-Get an example and see what the transcription is supposed to be:
+Then get an example and see what the transcription is supposed to be:
 
 ```py
 example = minds[0]
 example["transcription"]
 "ich möchte gerne Geld auf mein Konto einzahlen"
 ```
 
-Find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example:
+Next, we can find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example.
+Here, we'll use the checkpoint [maxidl/wav2vec2-large-xlsr-german](https://huggingface.co/maxidl/wav2vec2-large-xlsr-german):
 
 ```py
 from transformers import pipeline
@@ -77,9 +80,8 @@ Also, stimmt's!
 When working on solving your own task, starting with a simple pipeline like the ones we've shown in this unit is a valuable
 tool that offers several benefits:
 - a pre-trained model may exist that already solves your task really well, saving you plenty of time
-- pipeline() takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into
+- `pipeline()` takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into
 the right format for a model
 - if the result isn't ideal, this still gives you a quick baseline for future fine-tuning
 - once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly
-and effortlessly via the `pipeline()` method making AI more accessible.
-
+and effortlessly via the `pipeline()` method, making AI more accessible
diff --git a/chapters/en/chapter2/audio_classification_pipeline.mdx b/chapters/en/chapter2/audio_classification_pipeline.mdx
@@ -2,29 +2,30 @@
 
 Audio classification involves assigning one or more labels to an audio recording based on its content. The labels
 could correspond to different sound categories, such as music, speech, or noise, or more specific categories like
-bird song or car engine sounds.
+bird song or car engine sounds. We can even go so far as labelling a music sample based on its genre, which we'll see more
+of in Unit 4.
 
 Before diving into details on how the most popular audio transformers work, and before fine-tuning a custom model, let's
 see how you can use an off-the-shelf pre-trained model for audio classification with only a few lines of code with 🤗 Transformers.
 
 Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored
 in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several
 languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call.
+Such a system could be used as the first stage of an automated call-centre, to put a customer through to the correct department
+based on what they've said.
 
-Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline, and upsample it to 16kHz
-sampling rate which is what most speech models require.
+Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline:
 
 ```py
 from datasets import load_dataset
-from datasets import Audio
 
 minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
-minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
 ```
 
 To classify an audio recording into a set of classes, we can use the `audio-classification` pipeline from 🤗 Transformers.
 In our case, we need a model that's been fine-tuned for intent classification, and specifically on
-the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let's load it by using the `pipeline()` function:
+the MINDS-14 dataset. Luckily for us, the Hub has a [model](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14)
+that does just that! Let's load it by using the `pipeline()` function:
 
 ```py
 from transformers import pipeline
@@ -35,18 +36,28 @@ classifier = pipeline(
 )
 ```
 
-This pipeline expects the audio data as a NumPy array. All the preprocessing of the raw audio data will be conveniently
-handled for us by the pipeline. Let's pick an example to try it out:
+All the preprocessing of the raw audio data will be conveniently handled for us by the pipeline, including any resampling.
+Let's pick an example to try it out:
 
 ```py
 example = minds[0]
 ```
 
-If you recall the structure of the dataset, the raw audio data is stored in a NumPy array under `["audio"]["array"]`, let's
-pass is straight to the `classifier`:
+If you recall the structure of the dataset, the audio data is stored as a dictionary array under the key `["audio"]`:
+
+```python
+example["audio"]
+```
+
+The resulting dictionary has two further keys:
+* `["array"]`: the 1-dimensional audio array
+* `["sampling_rate"]`: the sampling rate of the audio sample
+
+This is exactly the format of input data that the pipeline expects. Thus, we can pass the `["audio"]` sample straight
+to the `classifier`:
 
 ```py
-classifier(example["audio"]["array"])
+classifier(example["audio"])
 [
     {"score": 0.9631525278091431, "label": "pay_bill"},
     {"score": 0.02819698303937912, "label": "freeze"},
@@ -66,7 +77,7 @@ id2label(example["intent_class"])
 ```
 
 Hooray! The predicted label was correct! Here we were lucky to find a model that can classify the exact labels that we need.
-A lot of the times, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same
+A lot of the time, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same
 as the classes you need the model to distinguish. In this case, you can fine-tune a pre-trained model to "calibrate" it to
 your exact set of class labels. We'll learn how to do this in the upcoming units. Now, let's take a look at another very
 common task in speech processing, _automatic speech recognition_.