-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U2 updates #25
U2 updates #25
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,18 @@ | ||
# Automatic speech recognition with a pipeline | ||
|
||
Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. | ||
This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands | ||
Automatic Speech Recognition (ASR) is the task of transcribing a speech audio recording into text. | ||
This task has numerous practical applications, from creating closed captions for videos, to enabling voice commands | ||
for virtual assistants like Siri and Alexa. | ||
|
||
In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person | ||
asking a question about paying a bill using the same MINDS-14 dataset as before. | ||
asking a question about paying a bill using the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset as before. | ||
|
||
To get started, load the datas et and upsample it to 16kHz as described in [Audio classification with a pipeline](introduction.mdx), | ||
if you haven't done that yet. | ||
To get started, load the `en-AU` subset of the data as before: | ||
```py | ||
from datasets import load_dataset | ||
|
||
minds = load_dataset("PolyAI/minds14", name="en-AU", split="train") | ||
``` | ||
|
||
To transcribe an audio recording, we can use the `automatic-speech-recognition` pipeline from 🤗 Transformers. Let's | ||
instantiate the pipeline: | ||
|
@@ -23,7 +27,7 @@ Next, we'll take an example from the dataset and pass its raw data to the pipeli | |
|
||
```py | ||
example = minds[0] | ||
asr(example["audio"]["array"]) | ||
asr(example["audio"]) | ||
{"text": "I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY COD CAN YOU PLEASE ASSIST"} | ||
``` | ||
|
||
|
@@ -41,28 +45,27 @@ is often silent. Having said that, I wouldn't recommend trying to pay your next | |
By default, this pipeline uses a model trained for automatic speech recognition for English language, which is fine in | ||
this example. If you'd like to try transcribing other subsets of MINDS-14 in different language, you can find a pre-trained | ||
ASR model [on the 🤗 Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&language=fr&sort=downloads). | ||
You can filter the models list by task first, then by language. Once you have found the model you like, pass it's name as | ||
You can filter the models list by task first, then by language. Once you have found the model you like, pass its name as | ||
the `model` argument to the pipeline. | ||
|
||
Let's try this for the German split of the MINDS-14. Load the "de-DE" subset: | ||
Let's try this for the German split of the MINDS-14. First, we load the "de-DE" subset: | ||
|
||
```py | ||
from datasets import load_dataset | ||
from datasets import Audio | ||
|
||
minds = load_dataset("PolyAI/minds14", name="de-DE", split="train") | ||
minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) | ||
``` | ||
|
||
Get an example and see what the transcription is supposed to be: | ||
Then get an example and see what the transcription is supposed to be: | ||
|
||
```py | ||
example = minds[0] | ||
example["transcription"] | ||
"ich möchte gerne Geld auf mein Konto einzahlen" | ||
``` | ||
|
||
Find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example: | ||
Next, we can find a pre-trained ASR model for German language on the 🤗 Hub, instantiate a pipeline, and transcribe the example. | ||
Here, we'll use the checkpoint [maxidl/wav2vec2-large-xlsr-german](https://huggingface.co/maxidl/wav2vec2-large-xlsr-german): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally like having links to the checkpoints on the Hub so that I can look at the model cards There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lovely, how about we swap the community checkpoint to an official DE checkpoint for XLSR: https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german |
||
|
||
```py | ||
from transformers import pipeline | ||
|
@@ -77,9 +80,8 @@ Also, stimmt's! | |
When working on solving your own task, starting with a simple pipeline like the ones we've shown in this unit is a valuable | ||
tool that offers several benefits: | ||
- a pre-trained model may exist that already solves your task really well, saving you plenty of time | ||
- pipeline() takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into | ||
- `pipeline()` takes care of all the pre/post-processing for you, so you don't have to worry about getting the data into | ||
the right format for a model | ||
- if the result isn't ideal, this still gives you a quick baseline for future fine-tuning | ||
- once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly | ||
and effortlessly via the `pipeline()` method making AI more accessible. | ||
|
||
and effortlessly via the `pipeline()` method, making AI more accessible | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consistency with previous bullet points |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,29 +2,30 @@ | |
|
||
Audio classification involves assigning one or more labels to an audio recording based on its content. The labels | ||
could correspond to different sound categories, such as music, speech, or noise, or more specific categories like | ||
bird song or car engine sounds. | ||
bird song or car engine sounds. We can even go so far as labelling a music sample based on its genre, which we'll see more | ||
of in Unit 4. | ||
|
||
Before diving into details on how the most popular audio transformers work, and before fine-tuning a custom model, let's | ||
see how you can use an off-the-shelf pre-trained model for audio classification with only a few lines of code with 🤗 Transformers. | ||
|
||
Let's go ahead and use the same [MINDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset that you have explored | ||
in the previous unit. If you recall, MINDS-14 contains recordings of people asking an e-banking system questions in several | ||
languages and dialects, and has the `intent_class` for each recording. We can classify the recordings by intent of the call. | ||
Such a system could be used as the first stage of an automated call-centre, to put a customer through to the correct department | ||
based on what they've said. | ||
|
||
Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline, and upsample it to 16kHz | ||
sampling rate which is what most speech models require. | ||
Just as before, let's start by loading the `en-AU` subset of the data to try out the pipeline: | ||
|
||
```py | ||
from datasets import load_dataset | ||
from datasets import Audio | ||
|
||
minds = load_dataset("PolyAI/minds14", name="en-AU", split="train") | ||
minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Think it's cleaner if we don't have to do any data pre-/post-processing and let the pipeline handle this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above, IMO it is good to reinforce the idea of sampling rates. |
||
``` | ||
|
||
To classify an audio recording into a set of classes, we can use the `audio-classification` pipeline from 🤗 Transformers. | ||
In our case, we need a model that's been fine-tuned for intent classification, and specifically on | ||
the MINDS-14 dataset. Luckily for us, the Hub has a model that does just that! Let's load it by using the `pipeline()` function: | ||
the MINDS-14 dataset. Luckily for us, the Hub has a [model](https://huggingface.co/anton-l/xtreme_s_xlsr_300m_minds14) | ||
that does just that! Let's load it by using the `pipeline()` function: | ||
|
||
```py | ||
from transformers import pipeline | ||
|
@@ -35,18 +36,28 @@ classifier = pipeline( | |
) | ||
``` | ||
|
||
This pipeline expects the audio data as a NumPy array. All the preprocessing of the raw audio data will be conveniently | ||
handled for us by the pipeline. Let's pick an example to try it out: | ||
All the preprocessing of the raw audio data will be conveniently handled for us by the pipeline, including any resampling. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Once huggingface/transformers#23445 is merged) |
||
Let's pick an example to try it out: | ||
|
||
```py | ||
example = minds[0] | ||
``` | ||
|
||
If you recall the structure of the dataset, the raw audio data is stored in a NumPy array under `["audio"]["array"]`, let's | ||
pass is straight to the `classifier`: | ||
If you recall the structure of the dataset, the audio data is stored as a dictionary array under the key `["audio"]`: | ||
|
||
```python | ||
example["audio"] | ||
``` | ||
|
||
The resulting dictionary has two further keys: | ||
* `["array"]`: the 1-dimensional audio array | ||
* `["sampling_rate"]`: the sampling rate of the audio sample | ||
|
||
This is exactly the format of input data that the pipeline expects. Thus, we can pass the `["audio"]` sample straight | ||
to the `classifier`: | ||
|
||
```py | ||
classifier(example["audio"]["array"]) | ||
classifier(example["audio"]) | ||
[ | ||
{"score": 0.9631525278091431, "label": "pay_bill"}, | ||
{"score": 0.02819698303937912, "label": "freeze"}, | ||
|
@@ -66,7 +77,7 @@ id2label(example["intent_class"]) | |
``` | ||
|
||
Hooray! The predicted label was correct! Here we were lucky to find a model that can classify the exact labels that we need. | ||
A lot of the times, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same | ||
A lot of the time, when dealing with a classification task, a pre-trained model's set of classes is not exactly the same | ||
as the classes you need the model to distinguish. In this case, you can fine-tune a pre-trained model to "calibrate" it to | ||
your exact set of class labels. We'll learn how to do this in the upcoming units. Now, let's take a look at another very | ||
common task in speech processing, _automatic speech recognition_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference: It makes sense to explicitly resample here, just to reinforce the idea of sampling rates to the attendee.
We can later on explicitly write that
sampling_rate
handles different rates automagically.WDYT?