Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpeechEncoderDecoder & Speech2Text2 #13186

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Aug 19, 2021

This PR adds Facebook's new Speech Translation models - see paper here that are based on a pretrained Wav2Vec2 and achieve SOTA on CoVoST-2 @kahne .

Since those checkpoints are based on Wav2Vec2, we can use this PR to create the SpeechEncoderDecoder class which essentially allows one to use any pretrained speech encoder with any text decoder model. The Speech Translation models are converted to fit the format of SpeechEncoderDecoderModel and should be used as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
from datasets import load_dataset

import soundfile as sf
model = SpeechEncoderDecoder.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")

def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch
    
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)

inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

Since the decoder and tokenizer is different from the previous Speech2Text model: https://github.com/huggingface/transformers/tree/master/src/transformers/models/speech_to_text a new model folder speech_to_text_2 is created.
Currently, the tokenizer only supports decoding and not encoding (which is only needed for training) because the tokenizer merges files are not published (cc @kahne)
The model can only be used in combination with SpeechEncoderDecoderModel.

The SpeechEncoderDecoderModel is also fully added in this PR and tests for Wav2Vec2Bert, Speech2TextBert, Wav2Vec2SpeechToText2 are added.

The ASR pipeline is slighly adapted to make it work with SpeechEncoderDecoder.

@LysandreJik @anton-l - it would be great if you could take a look at the general model architecture

@Narsil - it would be very nice if you could check the changes to the pipeline

All models are uploaded and can be accessed here: https://huggingface.co/models?other=speech2text2

Future TODO:

  • Currently the tokenizer support only decoding, not training. If the community is interested in getting tokenizer training support for Speech2Text2 in the future, please ping @patrickvonplaten

patrickvonplaten and others added 27 commits May 19, 2021 09:07
input_ids = processed["input_features"]
tokens = self.model.generate(input_ids=input_ids)
if name.endswith("ForConditionalGeneration") or name.endswith("DecoderModel"):
encoder = self.model.get_encoder()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more general IMO -> not all models have "input_features" as the actual input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did github eat my comment away, I don't understand sometimes.

  1. It feels odd to get an encoder from a DecoderModel. It might exist, but it's a bit odd, at the very least we need a small comment to explain why it's valid.
  2. I really don't like mixing processed and encoder_outputs within the generate call. processed is data coming from the feature_extractor, it should contain everything necessary to run the model by itself. having a two staged calls (with encoder, then generate) could be fine, but now we're mixing processed and encoder_outputs which have no reason to coexist, so we now create a dependency between them which we have to maintain.

Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Fair point - I can change it to name.endswith("EncoderDecoderModel") to make it more readable. Note that this if statement only works for Encoder Decoder models

  2. To break it down a bit - we start to get more and more exotic speech seq2seq models, which means that we can now already have input_values and input_features as possible input formats/names. generate() usually expects input_ids so it's not really made for generating from input_values and input_features. However all generate functionality (whether speech-features-to-text or speech-values-to-text) has one thing in common which is that it takes encoded hidden states and starts auto-regressively generating text (It's actually the same for image captioning and text generation) -> so to make the pipeline more general, I think it makes a lot of sense to call the encoder (every seq2seq model has an encoder) and first generate the hidden_states to be fed to the cross attention layer. Now the decoder also needs the original attention mask with the hidden states (all our seq2seq models pass a (encoder_)attention_mask to the decoder).
    => this is exactly what we are doing here and I don't see at all why this is a problem to be honest. We just can't leave it as it is now to make it work for both input_values and input_features

Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm having trouble understanding what is meant by "having to maintain a dependency". That's what generate does under the hood for seq2seq:

input_ids, attention_mask  -> encoder_outputs = encoder(input_ids), attention_mask -> generate(encoder_outputs=encoder_outputs, attention_mask=attention_mask)

All I'm doing here is skipping the input_ids, attention_mask -> encoder_outputs = encoder(input_ids) part because generate() from encoder inputs is not general enough (and shouldn't be!!!) to handle multiple different inputs. The whole idea of the big generate refactor a while back: https://discuss.huggingface.co/t/big-generate-refactor/1857 was exactly to enable use-cases such those

@patrickvonplaten patrickvonplaten changed the title [WIP][Speech2Text] Add s2t wav2vec2 Add SpeechEncoderDecoder & Speech2Text2 Aug 26, 2021
@kahne
Copy link

kahne commented Aug 27, 2021

@patrickvonplaten I've updated the tarball with the fastBPE codes file (bpe.10k). Please re-download and let me know if you have questions :)

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding those models!

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
docs/source/model_doc/speechencoderdecoder.rst Outdated Show resolved Hide resolved
@@ -233,7 +234,7 @@ def tokenizer_class_from_name(class_name: str):
module_name = "openai"

module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models")
return getattr(module, class_name)
return getattr(module, class_name, None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it should at this stage: module should have class_name as an attribute. With this change you will make silent an error that should be raisde,

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! Played with it locally, seems to run well. Same as for the other encoder-decoders, a notebook would be incredibly useful to know how to get started.

>>> model = SpeechEncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config)
"""
model_type = "speech-encoder-decoder"
is_composition = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be uniformized across text-2-text encoder-decoders as well? (Replacing the flag is_encoder_decoder with is_composition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually those are two separate things:

  • is_encoder_decoder is usually an instance variable that is true for all text-2-text models
  • is_composition is a class variable that is only true for EncoderDecoderModel, RAG, and this one now

README.md Outdated Show resolved Hide resolved
tests/test_modeling_speech_encoder_decoder.py Show resolved Hide resolved
@patrickvonplaten patrickvonplaten merged commit 0b8c84e into huggingface:master Sep 1, 2021
@patrickvonplaten patrickvonplaten deleted the add_s2t_wav2vec2 branch September 1, 2021 11:33
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
56. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
57. `SpeechEncoderDecoder <https://huggingface.co/transformers/master/model_doc/speechencoderdecoder.html>`__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten This line seems broken as it has no paper reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually no paper for this architecture (we should do one ;-))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks :-)

@alabrashJr
Copy link

-Hi @patrickvonplaten ,
I was trying to try https://huggingface.co/facebook/s2t-wav2vec2-large-en-tr however I'm getting an error when I'm trying to implement the model.
there is no error message or stack trace that is available so I can share it.
I also tried to run it as a python script but it did work too.

from transformers import SpeechEncoderDecoderConfig
Traceback (most recent call last):
  File "main.py", line 1, in <module>
    from transformers import SpeechEncoderDecoderConfig
ImportError: cannot import name 'SpeechEncoderDecoderConfig' from 'transformers' (/venv/lib/python3.7/site-packages/transformers/__init__.py)

However, Pycharm gives me an error that this package is not available.
image

transformers __version__ = "4.10.2"
python 3.7.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants