Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EnCodec model #23655

Merged
merged 139 commits into from
Jun 14, 2023
Merged

Add EnCodec model #23655

merged 139 commits into from
Jun 14, 2023

Conversation

hollance
Copy link
Contributor

What does this PR do?

Adds the EnCodec neural codec from the High Fidelity Neural Audio Compression paper.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker ArthurZucker requested a review from amyeroberts June 14, 2023 09:37
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Thanks for iterating so quickly on this :)

There's just one main behaviour regarding the handling of padding and truncation flags, and returned outputs which I think needs to be addressed before merging, otherwise just a few nits and is good to go.

tests/models/encodec/test_modeling_encodec.py Outdated Show resolved Hide resolved

original_checkpoint = torch.load(checkpoint_path)
recursively_load_weights(original_checkpoint, model, model_name)
model.save_pretrained(pytorch_dump_folder_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to check that the outputs of the original and converted model are the same (to some tolerance) before pushing to the hub here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, people can have their own fine-tuned/modified checkpoints and we don't import the model from the original library so the outputs are not fixed.

Comment on lines 184 to 187
return_attention_mask=True,
)
if padding:
padded_inputs["padding_mask"] = padded_inputs.pop("attention_mask")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, when you say 'we ignore attention_mask', what does 'ignore' mean?

I don't believe this is what is being returned now - there's a padding_mask when truncation=True

self.assertTrue(torch.allclose(input_values[0, 0, :30], EXPECTED_INPUT_VALUES, atol=1e-6))
self.assertTrue(torch.allclose(input_values[0, 1, :30], EXPECTED_INPUT_VALUES * 0.5, atol=1e-6))

def test_kwargs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been resolved. In reality, we shouldn't allow both of these arguments to be passed - and we should add a test that checks an exception is raised if that happens

input_values (`torch.FloatTensor` of shape `(batch_size, channels, sequence_length)`, *optional*):
Raw audio input converted to Float and padded to the approriate length in order to be encoded using chunks
of length self.chunk_length and a stride of `config.chunk_stride`.
padding_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)`, *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, is the shape of input_values correct?

AFAICT, if padding_mask is None, then it's created as:

padding_mask = torch.ones_like(input_values).bool()

which would imply it is the same dimensions as input_values. This also doesn't match the shape in the encode, and decode docstrings.

@ArthurZucker
Copy link
Collaborator

Ok, addressed everything! feel free to merge if it's good with you @amyeroberts

@ArthurZucker ArthurZucker requested a review from amyeroberts June 14, 2023 14:23
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful - thanks again for adding this model and iterating!

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@ArthurZucker ArthurZucker merged commit 0c3fdcc into huggingface:main Jun 14, 2023
@sgugger sgugger changed the title [WIP] add EnCodec model Add EnCodec model Jun 14, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
* boilerplate stuff

* messing around with the feature extractor

* fix feature extractor

* unit tests for feature extractor

* rename speech to audio

* quick-and-dirty import of Meta's code

* import weights (sort of)

* cleaning up

* more cleaning up

* move encoder/decoder args into config

* cleanup model

* rename EnCodec -> Encodec

* RVQ parameters in config

* add slow test

* add lstm init and test_init

* Add save & load

* finish EncodecModel

* remove decoder_input_values as they are ont used anywhere (not removed from doc yet)

* fix test feature extraction model name

* Add better slow test

* Fix tests

* some fixup and cleaning

* Improve further

* cleaning up quantizer

* fix up conversion script

* test don't pass, _encode_fram does not work

* update tests with output per encode and decode

* more cleanup

* rename _codebook

* remove old config cruft

* ratios & hop_length

* use ModuleList instead of Sequential

* clean up resnet block

* update types

* update tests

* fixup

* quick cleanup

* fix padding

* more styl,ing

* add patrick feedback

* fix copies

* fixup

* fix lstm

* fix shape issues

* fixup

* rename conv layers

* fixup

* fix decoding

* small conv refactoring

* remove norm_params

* simplify conv layers

* rename conv layers

* stuff

* Clean up

* Add padding logic

use padding mask

small conv refactoring

remove norm_params

simplify conv layers

rename conv layers

stuff

add batched test

update

Clean up

merge and update for padding

fix padding

fixup

* clean up more

* clean up more

* More clean ups

* cleanup convolutions

* typo

* fix typos

* fixup

* build PR doc?

* start refactoring docstring

* fix don't pad when no strid and chunk

* update docstring

* update docstring

* nits

* update going to lunch

* update config and model

* fix broken testse (becaue of the config changes)

* fix scale computation

* fixu[

* only return dict if speciefied or if config returns it

* remove todos

* update defaults in config

* update conversion script

* fix doctest

* more docstring + fixup

* nits on batched_tests

* more nits

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* update basxed on review

* fix update

* updaet tests

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fixup

* add overlap and chunl_length_s

* cleanup feature extraction

* teste edge cases truncation and padding

* correct processor values

* update config encodec, nits

* fix tests

* fixup

* fix 24Hz test

* elle tests are green

* fix fixup

* Apply suggestions from code review

* revert readme changes

* fixup

* add example

* use facebook checkpoints

* fix typo

* no pipeline tests

* use slef.pad everywhere we can

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* update based on review

* update

* update mdx

* fix bug and tests

* fixup

* fix doctest

* remove comment

* more nits

* add more coverage for `test_truncation_and_padding`

* fixup

* add last test

* fix text

* nits

* Update tests/models/encodec/test_modeling_encodec.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* take care of the last comments

* typo

* fix test

* nits

* fixup

* Update src/transformers/models/encodec/feature_extraction_encodec.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: arthur.zucker@gmail.com <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Comment on lines +569 to +571
if self.config.normalize:
# if the padding is non zero
input_values = input_values * padding_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why padding_mask is not used in case normalize=False?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants