Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U2 updates #25

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions chapters/en/chapter2/asr_pipeline.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# Automatic speech recognition with a pipeline

Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text.
This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands
Automatic Speech Recognition (ASR) is the task of transcribing a speech audio recording into text.
This task has numerous practical applications, from creating closed captions for videos, to enabling voice commands
for virtual assistants like Siri and Alexa.

In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person
asking a question about paying a bill using the same MINDS-14 dataset as before.
asking a question about paying a bill using the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset as before.

To get started, load the datas et and upsample it to 16kHz as described in [Audio classification with a pipeline](introduction.mdx),
if you haven't done that yet.
To get started, load the `en-AU` subset of the data as before:
```py
from datasets import load_dataset

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
```

To transcribe an audio recording, we can use the `automatic-speech-recognition` pipeline from 🤗 Transformers. Let's
instantiate the pipeline:
Expand All @@ -23,7 +27,7 @@ Next, we'll take an example from the dataset and pass its raw data to the pipeli

```py
example = minds[0]
asr(example["audio"]["array"])
asr(example["audio"])
{"text": "I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY COD CAN YOU PLEASE ASSIST"}
```

Expand All @@ -41,28 +45,27 @@ is often silent. Having said that, I wouldn't recommend trying to pay your next
By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in
this example. If you'd like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained
ASR model [on the 🤗 Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&language=fr&sort=downloads).
You can filter the models list by task first, then by language. Once you have found the model you like, pass it's name as
You can filter the models list by task first, then by language. Once you have found the model you like, pass its name as
the `model` argument to the pipeline.

Let's try this for the German split of the MINDS-14. Load the "de-DE" subset:
Let's try this for the German split of the MINDS-14. First, we load the "de-DE" subset:

```py
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="de-DE", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference: It makes sense to explicitly resample here, just to reinforce the idea of sampling rates to the attendee.

We can later on explicitly write that sampling_rate handles different rates automagically.

WDYT?

```

Get an example and see what the transcription is supposed to be:
Then get an example and see what the transcription is supposed to be:

```py
example = minds[0]
example["transcription"]
"ich möchte gerne Geld auf mein Konto einzahlen"
```

Find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example:
Next, we can find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example.
Here, we'll use the checkpoint [maxidl/wav2vec2-large-xlsr-german](https://huggingface.co/maxidl/wav2vec2-large-xlsr-german):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like having links to the checkpoints on the Hub so that I can look at the model cards

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely, how about we swap the community checkpoint to an official DE checkpoint for XLSR: https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german


```py
from transformers import pipeline
Expand All @@ -77,9 +80,8 @@ Also, stimmt's!
When working on solving your own task, starting with a simple pipeline like the ones we've shown in this unit is a valuable
tool that offers several benefits:
- a pre-trained model may exist that already solves your task really well, saving you plenty of time
- pipeline() takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into
- `pipeline()` takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into
the right format for a model
- if the result isn't ideal, this still gives you a quick baseline for future fine-tuning
- once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly
and effortlessly via the `pipeline()` method making AI more accessible.

and effortlessly via the `pipeline()` method, making AI more accessible
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency with previous bullet points

35 changes: 23 additions & 12 deletions chapters/en/chapter2/audio_classification_pipeline.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,30 @@

Audio classification involves assigning one or more labels to an audio recording based on its content. The labels
could correspond to different sound categories, such as music, speech, or noise, or more specific categories like
bird song or car engine sounds.
bird song or car engine sounds. We can even go so far as labelling a music sample based on its genre, which we'll see more
of in Unit 4.

Before diving into details on how the most popular audio transformers work, and before fine-tuning a custom model, let's
see how you can use an off-the-shelf pre-trained model for audio classification with only a few lines of code with 🤗 Transformers.

Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored
in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several
languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call.
Such a system could be used as the first stage of an automated call-centre, to put a customer through to the correct department
based on what they've said.

Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline, and upsample it to 16kHz
sampling rate which is what most speech models require.
Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline:

```py
from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's cleaner if we don't have to do any data pre-/post-processing and let the pipeline handle this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, IMO it is good to reinforce the idea of sampling rates.

```

To classify an audio recording into a set of classes, we can use the `audio-classification` pipeline from 🤗 Transformers.
In our case, we need a model that's been fine-tuned for intent classification, and specifically on
the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let's load it by using the `pipeline()` function:
the MINDS-14 dataset. Luckily for us, the Hub has a [model](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14)
that does just that! Let's load it by using the `pipeline()` function:

```py
from transformers import pipeline
Expand All @@ -35,18 +36,28 @@ classifier = pipeline(
)
```

This pipeline expects the audio data as a NumPy array. All the preprocessing of the raw audio data will be conveniently
handled for us by the pipeline. Let's pick an example to try it out:
All the preprocessing of the raw audio data will be conveniently handled for us by the pipeline, including any resampling.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pick an example to try it out:

```py
example = minds[0]
```

If you recall the structure of the dataset, the raw audio data is stored in a NumPy array under `["audio"]["array"]`, let's
pass is straight to the `classifier`:
If you recall the structure of the dataset, the audio data is stored as a dictionary array under the key `["audio"]`:

```python
example["audio"]
```

The resulting dictionary has two further keys:
* `["array"]`: the 1-dimensional audio array
* `["sampling_rate"]`: the sampling rate of the audio sample

This is exactly the format of input data that the pipeline expects. Thus, we can pass the `["audio"]` sample straight
to the `classifier`:

```py
classifier(example["audio"]["array"])
classifier(example["audio"])
[
{"score": 0.9631525278091431, "label": "pay_bill"},
{"score": 0.02819698303937912, "label": "freeze"},
Expand All @@ -66,7 +77,7 @@ id2label(example["intent_class"])
```

Hooray! The predicted label was correct! Here we were lucky to find a model that can classify the exact labels that we need.
A lot of the times, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same
A lot of the time, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same
as the classes you need the model to distinguish. In this case, you can fine-tune a pre-trained model to "calibrate" it to
your exact set of class labels. We'll learn how to do this in the upcoming units. Now, let's take a look at another very
common task in speech processing, _automatic speech recognition_.