Add SpeechEncoderDecoder & Speech2Text2 #13186

patrickvonplaten · 2021-08-19T15:09:23Z

This PR adds Facebook's new Speech Translation models - see paper here that are based on a pretrained Wav2Vec2 and achieve SOTA on CoVoST-2 @kahne .

Since those checkpoints are based on Wav2Vec2, we can use this PR to create the SpeechEncoderDecoder class which essentially allows one to use any pretrained speech encoder with any text decoder model. The Speech Translation models are converted to fit the format of SpeechEncoderDecoderModel and should be used as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
from datasets import load_dataset

import soundfile as sf
model = SpeechEncoderDecoder.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch
    
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

Since the decoder and tokenizer is different from the previous Speech2Text model: https://github.com/huggingface/transformers/tree/master/src/transformers/models/speech_to_text a new model folder speech_to_text_2 is created.
Currently, the tokenizer only supports decoding and not encoding (which is only needed for training) because the tokenizer merges files are not published (cc @kahne)
The model can only be used in combination with SpeechEncoderDecoderModel.

The SpeechEncoderDecoderModel is also fully added in this PR and tests for Wav2Vec2Bert, Speech2TextBert, Wav2Vec2SpeechToText2 are added.

The ASR pipeline is slighly adapted to make it work with SpeechEncoderDecoder.

@LysandreJik @anton-l - it would be great if you could take a look at the general model architecture

@Narsil - it would be very nice if you could check the changes to the pipeline

All models are uploaded and can be accessed here: https://huggingface.co/models?other=speech2text2

Future TODO:

Currently the tokenizer support only decoding, not training. If the community is interested in getting tokenizer training support for Speech2Text2 in the future, please ping @patrickvonplaten

XMerge branch 'master' of https://github.com/patrickvonplaten/transformers

…mers

…rmers

…mers

src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py

src/transformers/models/speech_to_text/modeling_speech_to_text.py

src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py

patrickvonplaten · 2021-08-26T14:28:44Z

src/transformers/pipelines/automatic_speech_recognition.py

-            input_ids = processed["input_features"]
-            tokens = self.model.generate(input_ids=input_ids)
+        if name.endswith("ForConditionalGeneration") or name.endswith("DecoderModel"):
+            encoder = self.model.get_encoder()


This is more general IMO -> not all models have "input_features" as the actual input

Why did github eat my comment away, I don't understand sometimes.

It feels odd to get an encoder from a DecoderModel. It might exist, but it's a bit odd, at the very least we need a small comment to explain why it's valid.

I really don't like mixing processed and encoder_outputs within the generate call. processed is data coming from the feature_extractor, it should contain everything necessary to run the model by itself. having a two staged calls (with encoder, then generate) could be fine, but now we're mixing processed and encoder_outputs which have no reason to coexist, so we now create a dependency between them which we have to maintain.

Fair point - I can change it to name.endswith("EncoderDecoderModel") to make it more readable. Note that this if statement only works for Encoder Decoder models

To break it down a bit - we start to get more and more exotic speech seq2seq models, which means that we can now already have input_values and input_features as possible input formats/names. generate() usually expects input_ids so it's not really made for generating from input_values and input_features. However all generate functionality (whether speech-features-to-text or speech-values-to-text) has one thing in common which is that it takes encoded hidden states and starts auto-regressively generating text (It's actually the same for image captioning and text generation) -> so to make the pipeline more general, I think it makes a lot of sense to call the encoder (every seq2seq model has an encoder) and first generate the hidden_states to be fed to the cross attention layer. Now the decoder also needs the original attention mask with the hidden states (all our seq2seq models pass a (encoder_)attention_mask to the decoder).
=> this is exactly what we are doing here and I don't see at all why this is a problem to be honest. We just can't leave it as it is now to make it work for both input_values and input_features

Also, I'm having trouble understanding what is meant by "having to maintain a dependency". That's what generate does under the hood for seq2seq:

input_ids, attention_mask -> encoder_outputs = encoder(input_ids), attention_mask -> generate(encoder_outputs=encoder_outputs, attention_mask=attention_mask)

All I'm doing here is skipping the input_ids, attention_mask -> encoder_outputs = encoder(input_ids) part because generate() from encoder inputs is not general enough (and shouldn't be!!!) to handle multiple different inputs. The whole idea of the big generate refactor a while back: https://discuss.huggingface.co/t/big-generate-refactor/1857 was exactly to enable use-cases such those

…n/transformers into add_s2t_wav2vec2

kahne · 2021-08-27T19:34:15Z

@patrickvonplaten I've updated the tarball with the fastBPE codes file (bpe.10k). Please re-download and let me know if you have questions :)

sgugger

Thanks for adding those models!

README.md

docs/source/model_doc/speechencoderdecoder.rst

sgugger · 2021-08-30T16:28:58Z

src/transformers/models/auto/tokenization_auto.py

@@ -233,7 +234,7 @@ def tokenizer_class_from_name(class_name: str):
        module_name = "openai"

    module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models")
-    return getattr(module, class_name)
+    return getattr(module, class_name, None)


Yes but it should at this stage: module should have class_name as an attribute. With this change you will make silent an error that should be raisde,

src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py

src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py

LysandreJik

Great work on this! Played with it locally, seems to run well. Same as for the other encoder-decoders, a notebook would be incredibly useful to know how to get started.

src/transformers/models/auto/feature_extraction_auto.py

src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py

LysandreJik · 2021-08-30T13:55:20Z

src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py

+        >>> model = SpeechEncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config)
+    """
+    model_type = "speech-encoder-decoder"
+    is_composition = True


Should this be uniformized across text-2-text encoder-decoders as well? (Replacing the flag is_encoder_decoder with is_composition)

Actually those are two separate things:

is_encoder_decoder is usually an instance variable that is true for all text-2-text models

is_composition is a class variable that is only true for EncoderDecoderModel, RAG, and this one now

src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py

src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py

src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py

README.md

tests/test_modeling_speech_encoder_decoder.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

qqaatw · 2021-09-01T12:01:28Z

docs/source/index.rst

    Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
    Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
-56. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
+57. `SpeechEncoderDecoder <https://huggingface.co/transformers/master/model_doc/speechencoderdecoder.html>`__


@patrickvonplaten This line seems broken as it has no paper reference.

Is it intended to be?

There is actually no paper for this architecture (we should do one ;-))

Got it, thanks :-)

alabrashJr · 2021-09-16T13:21:56Z

-Hi @patrickvonplaten ,
I was trying to try https://huggingface.co/facebook/s2t-wav2vec2-large-en-tr however I'm getting an error when I'm trying to implement the model.
there is no error message or stack trace that is available so I can share it.
I also tried to run it as a python script but it did work too.

from transformers import SpeechEncoderDecoderConfig

Traceback (most recent call last):
  File "main.py", line 1, in <module>
    from transformers import SpeechEncoderDecoderConfig
ImportError: cannot import name 'SpeechEncoderDecoderConfig' from 'transformers' (/venv/lib/python3.7/site-packages/transformers/__init__.py)

However, Pycharm gives me an error that this package is not available.

transformers __version__ = "4.10.2"
python 3.7.4

patrickvonplaten and others added 27 commits May 19, 2021 09:07

fix_torch_device_generate_test

f7197df

remove @

5f70018

X::qxX

f6d1d34

XMerge branch 'master' of https://github.com/patrickvonplaten/transformers

Merge branch 'master' of https://github.com/huggingface/transformers

02e5b56

Merge branch 'master' of https://github.com/huggingface/transformers

2ade5e3

:wqa:Merge branch 'master' of https://github.com/huggingface/transfor…

12dc58e

…mers

Merge branch 'master' of https://github.com/huggingface/transformers

b18ef83

Merge branch 'master' of https://github.com/huggingface/transformers

a941700

Merge branch 'master' of https://github.com/huggingface/transformers

34adbd9

Merge branch 'master' of https://github.com/huggingface/transformers

00967fa

Merge branch 'master' of https://github.com/huggingface/transformers

aea2f96

Merge branch 'master' of https://github.com/huggingface/transformers

b947a5c

Merge branch 'master' of https://github.com/huggingface/transformers

71079aa

:wMerge branch 'master' of https://github.com/huggingface/transformers

75eaa47

Merge branch 'master' of https://github.com/huggingface/transformers

11b5a4c

Merge branch 'master' of https://github.com/huggingface/transformers

2ba6f5c

:wqa: Merge branch 'master' of https://github.com/huggingface/transfo…

f956ea1

…rmers

Merge branch 'master' of https://github.com/huggingface/transformers

8c4570c

Merge branch 'master' of https://github.com/huggingface/transformers

a5cea4b

Merge branch 'master' of https://github.com/huggingface/transformers

14ab959

Merge branch 'master' of https://github.com/huggingface/transformers

e871407

Merge branch 'master' of https://github.com/huggingface/transformers

98c2245

Merge branch 'master' of https://github.com/huggingface/transformers

5bc06a5

Merge branch 'master' of https://github.com/huggingface/transformers

dfda5e2

Merge branch 'master' of https://github.com/patrickvonplaten/transfor…

3d03cc3

…mers

Merge branch 'master' of https://github.com/huggingface/transformers

14b0b94

up

4c6354e

patrickvonplaten commented Aug 19, 2021

View reviewed changes

src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py Show resolved Hide resolved

your_github_username added 2 commits August 19, 2021 23:09

correct some bugs

5a075c9

correct model

19106d1

patrickvonplaten commented Aug 26, 2021

View reviewed changes

src/transformers/models/speech_to_text/modeling_speech_to_text.py Show resolved Hide resolved

patrickvonplaten commented Aug 26, 2021

View reviewed changes

src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py Outdated Show resolved Hide resolved

patrickvonplaten commented Aug 26, 2021

View reviewed changes

src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py Outdated Show resolved Hide resolved

patrickvonplaten commented Aug 26, 2021

View reviewed changes

patrickvonplaten added 4 commits August 26, 2021 16:30

Apply suggestions from code review

f23e257

final fixes

745da1b

Merge branch 'add_s2t_wav2vec2' of https://github.com/patrickvonplate…

b54eaf6

…n/transformers into add_s2t_wav2vec2

finalize

b30bcaa

patrickvonplaten changed the title ~~[WIP][Speech2Text] Add s2t wav2vec2~~ Add SpeechEncoderDecoder & Speech2Text2 Aug 26, 2021

patrickvonplaten requested review from LysandreJik and anton-l August 26, 2021 15:12

sgugger approved these changes Aug 30, 2021

View reviewed changes

LysandreJik approved these changes Aug 30, 2021

View reviewed changes

patrickvonplaten and others added 7 commits September 1, 2021 10:06

Apply suggestions from code review

3ab86f5

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

apply suggestions Lysandre and Sylvain

5fbe61a

merge conflict

5fd96f6

apply nicos suggestions

652dbd8

Merge branch 'master' into add_s2t_wav2vec2

ef9e0f7

upload everything

e39a2ac

finish

ae346c9

patrickvonplaten merged commit 0b8c84e into huggingface:master Sep 1, 2021

patrickvonplaten deleted the add_s2t_wav2vec2 branch September 1, 2021 11:33

qqaatw reviewed Sep 1, 2021

View reviewed changes

patrickvonplaten mentioned this pull request Sep 13, 2021

[Speech2Text2] Skip newly added tokenizer test #13536

Merged

ivangtorre mentioned this pull request Oct 4, 2021

Speech2Text2 training support #13860

Closed

NielsRogge mentioned this pull request Oct 5, 2021

Add TrOCR + VisionEncoderDecoderModel #13874

Merged

2 tasks

NielsRogge mentioned this pull request Oct 25, 2021

[Design proposal] Fix EncoderDecoderModel classes to be more like BART and T5 #14139

Merged

patrickvonplaten mentioned this pull request Nov 14, 2021

[Speech2Text2] Enable tokenizers #14390

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpeechEncoderDecoder & Speech2Text2 #13186

Add SpeechEncoderDecoder & Speech2Text2 #13186

patrickvonplaten commented Aug 19, 2021 •

edited

Loading

patrickvonplaten Aug 26, 2021

Narsil Aug 27, 2021

patrickvonplaten Aug 27, 2021 •

edited

Loading

patrickvonplaten Aug 27, 2021 •

edited

Loading

kahne commented Aug 27, 2021

sgugger left a comment

sgugger Aug 30, 2021

LysandreJik left a comment

LysandreJik Aug 30, 2021

patrickvonplaten Sep 1, 2021

qqaatw Sep 1, 2021

qqaatw Sep 1, 2021

patrickvonplaten Sep 1, 2021

qqaatw Sep 1, 2021

alabrashJr commented Sep 16, 2021

Add SpeechEncoderDecoder & Speech2Text2 #13186

Add SpeechEncoderDecoder & Speech2Text2 #13186

Conversation

patrickvonplaten commented Aug 19, 2021 • edited Loading

Future TODO:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

kahne commented Aug 27, 2021

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alabrashJr commented Sep 16, 2021

patrickvonplaten commented Aug 19, 2021 •

edited

Loading

patrickvonplaten Aug 27, 2021 •

edited

Loading

patrickvonplaten Aug 27, 2021 •

edited

Loading