-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpeechEncoderDecoder & Speech2Text2 #13186
Add SpeechEncoderDecoder & Speech2Text2 #13186
Conversation
XMerge branch 'master' of https://github.com/patrickvonplaten/transformers
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
Show resolved
Hide resolved
src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py
Outdated
Show resolved
Hide resolved
input_ids = processed["input_features"] | ||
tokens = self.model.generate(input_ids=input_ids) | ||
if name.endswith("ForConditionalGeneration") or name.endswith("DecoderModel"): | ||
encoder = self.model.get_encoder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more general IMO -> not all models have "input_features"
as the actual input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did github eat my comment away, I don't understand sometimes.
- It feels odd to get an
encoder
from aDecoderModel
. It might exist, but it's a bit odd, at the very least we need a small comment to explain why it's valid. - I really don't like mixing
processed
andencoder_outputs
within thegenerate
call.processed
is data coming from the feature_extractor, it should contain everything necessary to run the model by itself. having a two staged calls (with encoder, then generate) could be fine, but now we're mixingprocessed
andencoder_outputs
which have no reason to coexist, so we now create a dependency between them which we have to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Fair point - I can change it to
name.endswith("EncoderDecoderModel")
to make it more readable. Note that thisif statement
only works for Encoder Decoder models -
To break it down a bit - we start to get more and more exotic speech seq2seq models, which means that we can now already have
input_values
andinput_features
as possible input formats/names.generate()
usually expectsinput_ids
so it's not really made for generating frominput_values
andinput_features
. However all generate functionality (whether speech-features-to-text or speech-values-to-text) has one thing in common which is that it takes encoded hidden states and starts auto-regressively generating text (It's actually the same for image captioning and text generation) -> so to make the pipeline more general, I think it makes a lot of sense to call the encoder (every seq2seq model has an encoder) and first generate the hidden_states to be fed to the cross attention layer. Now the decoder also needs the original attention mask with the hidden states (all our seq2seq models pass a(encoder_)attention_mask
to the decoder).
=> this is exactly what we are doing here and I don't see at all why this is a problem to be honest. We just can't leave it as it is now to make it work for bothinput_values
andinput_features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm having trouble understanding what is meant by "having to maintain a dependency". That's what generate does under the hood for seq2seq:
input_ids, attention_mask -> encoder_outputs = encoder(input_ids), attention_mask -> generate(encoder_outputs=encoder_outputs, attention_mask=attention_mask)
All I'm doing here is skipping the input_ids, attention_mask -> encoder_outputs = encoder(input_ids)
part because generate()
from encoder inputs is not general enough (and shouldn't be!!!) to handle multiple different inputs. The whole idea of the big generate refactor a while back: https://discuss.huggingface.co/t/big-generate-refactor/1857 was exactly to enable use-cases such those
@patrickvonplaten I've updated the tarball with the fastBPE codes file ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding those models!
@@ -233,7 +234,7 @@ def tokenizer_class_from_name(class_name: str): | |||
module_name = "openai" | |||
|
|||
module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models") | |||
return getattr(module, class_name) | |||
return getattr(module, class_name, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but it should at this stage: module
should have class_name
as an attribute. With this change you will make silent an error that should be raisde,
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! Played with it locally, seems to run well. Same as for the other encoder-decoders, a notebook would be incredibly useful to know how to get started.
src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
>>> model = SpeechEncoderDecoderModel.from_pretrained('my-model', config=encoder_decoder_config) | ||
""" | ||
model_type = "speech-encoder-decoder" | ||
is_composition = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be uniformized across text-2-text encoder-decoders as well? (Replacing the flag is_encoder_decoder
with is_composition
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually those are two separate things:
is_encoder_decoder
is usually an instance variable that is true for all text-2-text modelsis_composition
is a class variable that is only true forEncoderDecoderModel
,RAG
, and this one now
src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_encoder_decoder/configuration_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
Outdated
Show resolved
Hide resolved
src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and | ||
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. | ||
56. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper | ||
57. `SpeechEncoderDecoder <https://huggingface.co/transformers/master/model_doc/speechencoderdecoder.html>`__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten This line seems broken as it has no paper reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually no paper for this architecture (we should do one ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks :-)
-Hi @patrickvonplaten ,
However, Pycharm gives me an error that this package is not available. transformers |
This PR adds Facebook's new Speech Translation models - see paper here that are based on a pretrained Wav2Vec2 and achieve SOTA on CoVoST-2 @kahne .
Since those checkpoints are based on
Wav2Vec2
, we can use this PR to create theSpeechEncoderDecoder
class which essentially allows one to use any pretrained speech encoder with any text decoder model. The Speech Translation models are converted to fit the format ofSpeechEncoderDecoderModel
and should be used as follows:Since the decoder and tokenizer is different from the previous
Speech2Text
model: https://github.com/huggingface/transformers/tree/master/src/transformers/models/speech_to_text a new model folder speech_to_text_2 is created.Currently, the tokenizer only supports decoding and not encoding (which is only needed for training) because the tokenizer merges files are not published (cc @kahne)
The model can only be used in combination with
SpeechEncoderDecoderModel
.The
SpeechEncoderDecoderModel
is also fully added in this PR and tests forWav2Vec2Bert
,Speech2TextBert
,Wav2Vec2SpeechToText2
are added.The ASR pipeline is slighly adapted to make it work with
SpeechEncoderDecoder
.@LysandreJik @anton-l - it would be great if you could take a look at the general model architecture
@Narsil - it would be very nice if you could check the changes to the pipeline
All models are uploaded and can be accessed here: https://huggingface.co/models?other=speech2text2
Future TODO:
Speech2Text2
in the future, please ping @patrickvonplaten