Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove static pretrained maps from the library's internals #29112

Merged
merged 13 commits into from
Mar 25, 2024
Merged

Conversation

LysandreJik
Copy link
Member

No description provided.

@LysandreJik
Copy link
Member Author

LysandreJik commented Feb 19, 2024

Before this ungodly PR gets merged, I need to check that every checkpoint referenced here behaves the same once its pretrained map has been removed.

I'll link the PRs open as a result in this comment.

🟣: merged
🟢: open
🔴: closed
🟡: not open yet

Repos with PRs opened

ALBERT

🟣 albert/albert-base-v1#2
🟣 albert/albert-base-v2#6
🟣 albert/albert-large-v1#2
🟣 albert/albert-large-v2#3
🟣 albert/albert-xlarge-v1#2
🟣 albert/albert-xlarge-v2#2
🟣 albert/albert-xxlarge-v1#3
🟣 albert/albert-xxlarge-v2#3

BERT

🟣 google-bert/bert-base-cased#10
🟣 google-bert/bert-base-cased-finetuned-mrpc#2
🟣 google-bert/bert-base-chinese#16
🟣 google-bert/bert-base-german-cased#5
🟣 google-bert/bert-base-german-dbmdz-cased#4
🟣 google-bert/bert-base-german-dbmdz-uncased#4
🟣 google-bert/bert-base-multilingual-cased#5
🟣 google-bert/bert-base-multilingual-uncased#5
🟣 google-bert/bert-base-uncased#62
🟣 google-bert/bert-large-cased#3
🟣 google-bert/bert-large-cased-whole-word-masking#2
🟣 google-bert/bert-large-cased-whole-word-masking-finetuned-squad#2
🟣 google-bert/bert-large-uncased#3
🟣 google-bert/bert-large-uncased-whole-word-masking#3
🟣 google-bert/bert-large-uncased-whole-word-masking-finetuned-squad#4

CamemBERT

🟢 almanach/camembert-base#7

CTRL

🟣 Salesforce/ctrl#4

DistilBERT

🟣 distilbert/distilgpt2#11
🟣 distilbert/distilroberta-base#4

GPT-2

🟣 openai-community/gpt2#80
🟣 openai-community/gpt2-large#7
🟣 openai-community/gpt2-medium#13
🟣 openai-community/gpt2-xl#9

OpenAI GPT

🟣 openai-community/openai-gpt#6

RoBERTa

🟣 FacebookAI/roberta-base#12
🟣 openai-community/roberta-base-openai-detector#17
🟣 FacebookAI/roberta-large#6
🟣 FacebookAI/roberta-large-mnli#8
🟣 openai-community/roberta-large-openai-detector#5
🟣 FacebookAI/xlm-roberta-base#27
🟣 FacebookAI/xlm-roberta-large#17
🟣 FacebookAI/xlm-roberta-large-finetuned-conll02-dutch#3
🟣 FacebookAI/xlm-roberta-large-finetuned-conll02-spanish#3
🟣 FacebookAI/xlm-roberta-large-finetuned-conll03-english#11
🟣 FacebookAI/xlm-roberta-large-finetuned-conll03-german#4

Non-canonical models

🟣 facebook/m2m100_418M#16
🟣 openai/clip-vit-base-patch32#13
🟣 google/bigbird-roberta-large#2
🟣 google/bigbird-base-trivia-itc#2
🟢 google/rembert#2
🟢 YituTech/conv-bert-base#4
🟢 YituTech/conv-bert-medium-small#2
🟢 YituTech/conv-bert-small#2
🔴 facebook/wav2vec2-lv-60-espeak-cv-ft#5
🟣 facebook/blenderbot_small-90M#5
🟣 funnel-transformer/small#2
🟣 funnel-transformer/small-base#2
🟣 funnel-transformer/medium#2
🟣 funnel-transformer/medium-base#1
🟣 funnel-transformer/intermediate#2
🟣 funnel-transformer/intermediate-base#2
🟣 funnel-transformer/large#3
🟣 funnel-transformer/large-base#1
🟣 funnel-transformer/xlarge#2
🟣 funnel-transformer/xlarge-base#2
🟣 flaubert/flaubert_small_cased#2
🟣 flaubert/flaubert_base_uncased#2
🟣 flaubert/flaubert_base_cased#2
🟣 flaubert/flaubert_large_cased#2
🟣 google/realm-cc-news-pretrained-embedder#1
🟣 google/realm-cc-news-pretrained-encoder#1
🟣 google/realm-cc-news-pretrained-scorer#1
🟣 google/realm-cc-news-pretrained-openqa#1
🟣 google/realm-orqa-nq-openqa#1
🟣 google/realm-orqa-nq-reader#1
🟣 google/realm-orqa-wq-openqa#1
🟣 google/realm-orqa-wq-reader#1
🟣 google/fnet-base#1
🟣 google/fnet-large#1
🟣 microsoft/mpnet-base#4
🟣 google/reformer-crime-and-punishment#2
🟢 facebook/s2t-wav2vec2-large-en-de#3
🟢 allenai/longformer-base-4096#6
🟢 allenai/longformer-large-4096#3
🟢 allenai/longformer-large-4096-finetuned-triviaqa#4
🟢 allenai/longformer-base-4096-extra.pos.embd.only#2
🟢 allenai/longformer-large-4096-extra.pos.embd.only#2
🟣 cl-tohoku/bert-base-japanese#2
🟣 cl-tohoku/bert-base-japanese-whole-word-masking#3
🟣 cl-tohoku/bert-base-japanese-char#2
🟣 cl-tohoku/bert-base-japanese-char-whole-word-masking#2
🟣 google/electra-small-generator#4
🟣 google/electra-base-generator#2
🟣 google/electra-large-generator#2
🟣 google/electra-small-discriminator#1
🟣 google/electra-base-discriminator#3
🟣 google/electra-large-discriminator#1
🟢 microsoft/layoutlmv2-base-uncased#5
🟢 microsoft/layoutlmv2-large-uncased#3
🟢 microsoft/deberta-v2-xlarge#3
🟢 microsoft/deberta-v2-xxlarge#4
🟣 microsoft/deberta-v2-xlarge-mnli#2
🟢 microsoft/deberta-v2-xxlarge-mnli#2
🟢 vinai/bartpho-syllable#3
🟢 Helsinki-NLP/opus-mt-en-de#6
🟢 facebook/bart-base#4
🟢 facebook/bart-large-cnn#71
🟢 yjernite/bart_eli5#2
🟢 microsoft/layoutlm-base-uncased#3
🟢 microsoft/layoutlm-large-uncased#1
🟢 junnyu/roformer_chinese_char_small#1
🟢 junnyu/roformer_chinese_char_base#1
🟢 junnyu/roformer_small_discriminator#2
🟢 junnyu/roformer_small_generator#2
🟢 uclanlp/plbart-base#6
🟢 uclanlp/plbart-c-cpp-defect-detection#4
🟢 uclanlp/plbart-cs-java#1
🟢 uclanlp/plbart-en_XX-java#2
🟢 uclanlp/plbart-go-en_XX#1
🟢 uclanlp/plbart-java-clone-detection#2
🟢 uclanlp/plbart-java-cs#3
🟢 uclanlp/plbart-java-en_XX#2
🟢 uclanlp/plbart-javascript-en_XX#2
🟢 uclanlp/plbart-php-en_XX#2
🟢 uclanlp/plbart-python-en_XX#2
🟢 uclanlp/plbart-refine-java-medium#1
🟢 uclanlp/plbart-refine-java-small#2
🟢 uclanlp/plbart-ruby-en_XX#1
🟢 openbmb/cpm-ant-10b#2
🟢 microsoft/deberta-base#5
🟢 microsoft/deberta-large#4
🟢 microsoft/deberta-xlarge#2
🟢 microsoft/deberta-base-mnli#1
🟢 microsoft/deberta-large-mnli#1
🟢 microsoft/deberta-xlarge-mnli#5
🟢 vinai/bertweet-base#5
🟢 AI-Sweden-Models/gpt-sw3-126m#5
🟢 AI-Sweden-Models/gpt-sw3-356m#4
🟢 AI-Sweden-Models/gpt-sw3-1.3b#4
🟢 AI-Sweden-Models/gpt-sw3-6.7b#5
🟢 AI-Sweden-Models/gpt-sw3-6.7b-v2#4
🟢 AI-Sweden-Models/gpt-sw3-20b#6
🟣 AI-Sweden-Models/gpt-sw3-40b#4
🟢 facebook/mbart-large-50-one-to-many-mmt#7
🟢 facebook/xglm-564M#7
🟢 facebook/blenderbot-3B#7
🟢 microsoft/prophetnet-large-uncased#2
🟢 microsoft/xprophetnet-large-wiki100-cased#1
🟢 squeezebert/squeezebert-uncased#3
🟢 squeezebert/squeezebert-mnli#2
🟢 squeezebert/squeezebert-mnli-headless#2
🟢 abeja/gpt-neox-japanese-2.7b#5
🟢 microsoft/biogpt#26
🔴 facebook/wav2vec2-base-960h#11
🟣 moussaKam/mbarthez#2
🟣 moussaKam/barthez#3
🟣 moussaKam/barthez-orangesum-title#2
🟢 vinai/phobert-base#5
🟢 vinai/phobert-large#2
🟢 facebook/mbart-large-en-ro#3
🟢 facebook/mbart-large-cc25#5

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -52,8 +43,6 @@ class {{cookiecutter.camelcase_modelname}}Tokenizer(BertTokenizer):

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

{%- elif cookiecutter.tokenizer_type == "Based on BART" %}
from ...utils import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line with

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    "{{cookiecutter.checkpoint_identifier}}": 1024,
}

should also be removed!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! This needs to be adapted as well as the "new model addition" script that relies on these. Also there are 5k tests failing that I need to fix before I un-draft it :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good luck!

@@ -53,8 +44,6 @@ class {{cookiecutter.camelcase_modelname}}TokenizerFast(BertTokenizerFast):

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
slow_tokenizer_class = {{cookiecutter.camelcase_modelname}}Tokenizer

{%- elif cookiecutter.tokenizer_type == "Based on BART" %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also cleanup for bart based

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Standalone

pcuenca added a commit to huggingface/swift-transformers that referenced this pull request Feb 29, 2024
Fixes distilgpt2 tokenization.

Previously, we only used the fallback configuration if there was no
`tokenizer_config.json` in the model repo. These files are now being
added to some repos in the context of removing dependencies with
transformers' internals, like this PR:
huggingface/transformers#29112. But only keys
removed from the hardcoded rules are being added to minimize potential
breaking changes.

We now use the fallback config if tokenizer_config.json exists, no
tokenizer class is specified, and we do have a fallback config for this
architecture.
pcuenca added a commit to huggingface/swift-transformers that referenced this pull request Feb 29, 2024
Fixes distilgpt2 tokenization.

Previously, we only used the fallback configuration if there was no
`tokenizer_config.json` in the model repo. These files are now being
added to some repos in the context of removing dependencies with
transformers' internals, like this PR:
huggingface/transformers#29112. But only keys
removed from the hardcoded rules are being added to minimize potential
breaking changes.

We now use the fallback config if tokenizer_config.json exists, no
tokenizer class is specified, and we do have a fallback config for this
architecture.
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

casual G.O.A.T. = @LysandreJik

@LysandreJik LysandreJik force-pushed the remove-maps branch 2 times, most recently from 2ce0712 to dbedf2b Compare March 8, 2024 09:59
@LysandreJik LysandreJik changed the base branch from main to add_pretrained_ids_in_tests March 8, 2024 10:04
@LysandreJik
Copy link
Member Author

Updated the base to move some of the refactor to that PR: #29534

Base automatically changed from add_pretrained_ids_in_tests to main March 13, 2024 13:53
@LysandreJik LysandreJik force-pushed the remove-maps branch 3 times, most recently from 5f314cf to fbfb0e9 Compare March 14, 2024 16:13
@julien-c
Copy link
Member

why deprecate vs remove?

@LysandreJik
Copy link
Member Author

LysandreJik commented Mar 14, 2024

Deprecate for 2 months and then simply DEL a file and it's removed

@LysandreJik LysandreJik marked this pull request as ready for review March 14, 2024 16:58
@LysandreJik LysandreJik force-pushed the remove-maps branch 2 times, most recently from 9e70d94 to 3be48a5 Compare March 15, 2024 08:39
@LysandreJik
Copy link
Member Author

It should be good for review for the brave one @amyeroberts @ArthurZucker.

Failing tests are unrelated and also failing on main.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part

src/transformers/__init__.py Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO for me, we need to make sure the the test add some pre-trained models to test!

src/transformers/models/deprecated/_archive_maps.py Outdated Show resolved Hide resolved
src/transformers/models/deprecated/_archive_maps.py Outdated Show resolved Hide resolved
Comment on lines +43 to +47
from ..deprecated._archive_maps import ( # noqa: F401, E402
DPR_CONTEXT_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
DPR_QUESTION_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
DPR_READER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from ..deprecated._archive_maps import ( # noqa: F401, E402
DPR_CONTEXT_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
DPR_QUESTION_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
DPR_READER_PRETRAINED_MODEL_ARCHIVE_LIST, # noqa: F401, E402
)
from ..deprecated._archive_maps import DPR_CONTEXT_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, DPR_QUESTION_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, DPR_READER_PRETRAINED_MODEL_ARCHIVE_LIST #fmt skip # noqa: F401, E402

if that works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to keep it as is as I'll be looking for the suffix when removing these imports

@LysandreJik LysandreJik merged commit 39114c0 into main Mar 25, 2024
21 checks passed
@LysandreJik LysandreJik deleted the remove-maps branch March 25, 2024 09:33
hovnatan pushed a commit to hovnatan/transformers that referenced this pull request Mar 27, 2024
…ce#29112)

* [test_all] Remove static pretrained maps from the library's internals

* Deprecate archive maps instead of removing them

* Revert init changes

* [test_all] Deprecate instead of removing

* [test_all] PVT v2 support

* [test_all] Tests should all pass

* [test_all] Style

* Address review comments

* Update src/transformers/models/deprecated/_archive_maps.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/deprecated/_archive_maps.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* [test_all] trigger tests

* [test_all] LLAVA

* [test_all] Bad rebase

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@Rocketknight1 Rocketknight1 mentioned this pull request Mar 28, 2024
5 tasks
@ydshieh ydshieh mentioned this pull request Apr 5, 2024
itazap pushed a commit that referenced this pull request May 14, 2024
* [test_all] Remove static pretrained maps from the library's internals

* Deprecate archive maps instead of removing them

* Revert init changes

* [test_all] Deprecate instead of removing

* [test_all] PVT v2 support

* [test_all] Tests should all pass

* [test_all] Style

* Address review comments

* Update src/transformers/models/deprecated/_archive_maps.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/deprecated/_archive_maps.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* [test_all] trigger tests

* [test_all] LLAVA

* [test_all] Bad rebase

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants