diff --git a/docs/source/en/task_summary.mdx b/docs/source/en/task_summary.mdx index 95c2d9c201a586..55e4a230a1b65e 100644 --- a/docs/source/en/task_summary.mdx +++ b/docs/source/en/task_summary.mdx @@ -967,3 +967,158 @@ Here is an example of doing translation using a model and a tokenizer. The proce We get the same translation as with the pipeline example. + +## Audio classification + +Audio classification assigns a class to an audio signal. The Keyword Spotting dataset from the [SUPERB](https://huggingface.co/datasets/superb) benchmark is an example dataset that can be used for audio classification fine-tuning. This dataset contains ten classes of keywords for classification. If you'd like to fine-tune a model for audio classification, take a look at the [run_audio_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/run_audio_classification.py) script or this [how-to guide](./tasks/audio_classification). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for audio classification inference: + +```py +>>> from transformers import pipeline + +>>> audio_classifier = pipeline( +... task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +>>> audio_classifier("jfk_moon_speech.wav") +[{'label': 'calm', 'score': 0.13856211304664612}, + {'label': 'disgust', 'score': 0.13148026168346405}, + {'label': 'happy', 'score': 0.12635163962841034}, + {'label': 'angry', 'score': 0.12439591437578201}, + {'label': 'fearful', 'score': 0.12404385954141617}] +``` + +The general process for using a model and feature extractor for audio classification is: + +1. Instantiate a feature extractor and a model from the checkpoint name. +2. Process the audio signal to be classified with a feature extractor. +3. Pass the input through the model and take the `argmax` to retrieve the most likely class. +4. Convert the class id to a class name with `id2label` to return an interpretable result. + + + +```py +>>> from transformers import AutoFeatureExtractor, AutoModelForAudioClassification +>>> from datasets import load_dataset +>>> import torch + +>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") +>>> dataset = dataset.sort("id") +>>> sampling_rate = dataset.features["audio"].sampling_rate + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks") +>>> model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks") + +>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") + +>>> with torch.no_grad(): +... logits = model(**inputs).logits + +>>> predicted_class_ids = torch.argmax(logits, dim=-1).item() +>>> predicted_label = model.config.id2label[predicted_class_ids] +>>> predicted_label +``` + + + +## Automatic speech recognition + +Automatic speech recognition transcribes an audio signal to text. The [Common Voice](https://huggingface.co/datasets/common_voice) dataset is an example dataset that can be used for automatic speech recognition fine-tuning. It contains an audio file of a speaker and the corresponding sentence. If you'd like to fine-tune a model for automatic speech recognition, take a look at the [run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) or [run_speech_recognition_seq2seq.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py) scripts or this [how-to guide](./tasks/asr). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for automatic speech recognition inference: + +```py +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline( +... task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h" +... ) +>>> speech_recognizer("jfk_moon_speech.wav") +{'text': "PRESENTETE MISTER VICE PRESIDENT GOVERNOR CONGRESSMEN THOMAS SAN O TE WILAN CONGRESSMAN MILLA MISTER WEBB MSTBELL SCIENIS DISTINGUISHED GUESS AT LADIES AND GENTLEMAN I APPRECIATE TO YOUR PRESIDENT HAVING MADE ME AN HONORARY VISITING PROFESSOR AND I WILL ASSURE YOU THAT MY FIRST LECTURE WILL BE A VERY BRIEF I AM DELIGHTED TO BE HERE AND I'M PARTICULARLY DELIGHTED TO BE HERE ON THIS OCCASION WE MEED AT A COLLEGE NOTED FOR KNOWLEGE IN A CITY NOTED FOR PROGRESS IN A STATE NOTED FOR STRAINTH AN WE STAND IN NEED OF ALL THREE"} +``` + +The general process for using a model and processor for automatic speech recognition is: + +1. Instantiate a processor (which regroups a feature extractor for input processing and a tokenizer for decoding) and a model from the checkpoint name. +2. Process the audio signal and text with a processor. +3. Pass the input through the model and take the `argmax` to retrieve the predicted text. +4. Decode the text with a tokenizer to obtain the transcription. + + + +```py +>>> from transformers import AutoProcessor, AutoModelForCTC +>>> from datasets import load_dataset +>>> import torch + +>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") +>>> dataset = dataset.sort("id") +>>> sampling_rate = dataset.features["audio"].sampling_rate + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") +>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h") + +>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +>>> predicted_ids = torch.argmax(logits, dim=-1) + +>>> transcription = processor.batch_decode(predicted_ids) +>>> transcription[0] +``` + + + +## Image classification + +Like text and audio classification, image classification assigns a class to an image. The [CIFAR-100](https://huggingface.co/datasets/cifar100) dataset is an example dataset that can be used for image classification fine-tuning. It contains an image and the corresponding class. If you'd like to fine-tune a model for image classification, take a look at the [run_image_classification.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/image-classification/run_image_classification.py) script or this [how-to guide](./tasks/image_classification). + +The following examples demonstrate how to use a [`pipeline`] and a model and tokenizer for image classification inference: + +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(task="image-classification") +>>> vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +[{'label': 'lynx, catamount', 'score': 0.4403027892112732}, + {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor', + 'score': 0.03433405980467796}, + {'label': 'snow leopard, ounce, Panthera uncia', + 'score': 0.032148055732250214}, + {'label': 'Egyptian cat', 'score': 0.02353910356760025}, + {'label': 'tiger cat', 'score': 0.023034192621707916}] +``` + +The general process for using a model and feature extractor for image classification is: + +1. Instantiate a feature extractor and a model from the checkpoint name. +2. Process the image to be classified with a feature extractor. +3. Pass the input through the model and take the `argmax` to retrieve the predicted class. +4. Convert the class id to a class name with `id2label` to return an interpretable result. + + + +```py +>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification +>>> import torch +>>> from datasets import load_dataset + +>>> dataset = load_dataset("huggingface/cats-image") +>>> image = dataset["test"]["image"][0] + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224") +>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224") + +>>> inputs = feature_extractor(image, return_tensors="pt") + +>>> with torch.no_grad(): +... logits = model(**inputs).logits + +>>> predicted_label = logits.argmax(-1).item() +>>> print(model.config.id2label[predicted_label]) +Egyptian cat +``` + +