Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers: ability to load from model subfolder #8586

Merged
merged 5 commits into from
Nov 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/pretrained_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ Pretrained models

Here is the full list of the currently provided pretrained models together with a short presentation of each model.

For a list that includes community-uploaded models, refer to `https://huggingface.co/models
For a list that includes all community-uploaded models, refer to `https://huggingface.co/models
<https://huggingface.co/models>`__.

+--------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Architecture | Shortcut name | Details of the model |
| Architecture | Model id | Details of the model |
+====================+============================================================+=======================================================================================================================================+
| BERT | ``bert-base-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. |
| | | | Trained on lower-cased English text. |
Expand Down
3 changes: 2 additions & 1 deletion examples/adversarial/run_hans.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
2 changes: 1 addition & 1 deletion examples/bert-loses-patience/run_glue_with_pabee.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,7 +476,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
Expand Down
2 changes: 1 addition & 1 deletion examples/bertology/run_bertology.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ def main():
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--data_subset", type=int, default=-1, help="If > 0: limit the data to a subset of data_subset instances."
Expand Down
3 changes: 2 additions & 1 deletion examples/contrib/legacy/run_language_modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
2 changes: 1 addition & 1 deletion examples/contrib/mm-imdb/run_mmimdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ def main():
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
Expand Down
2 changes: 1 addition & 1 deletion examples/deebert/run_glue_deebert.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
Expand Down
2 changes: 1 addition & 1 deletion examples/distillation/run_squad_w_distillation.py
Original file line number Diff line number Diff line change
Expand Up @@ -578,7 +578,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)

parser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion examples/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
Expand Down
3 changes: 2 additions & 1 deletion examples/language-modeling/run_mlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
Expand Down
3 changes: 2 additions & 1 deletion examples/language-modeling/run_mlm_wwm.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
Expand Down
3 changes: 2 additions & 1 deletion examples/language-modeling/run_plm.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
Expand Down
2 changes: 1 addition & 1 deletion examples/lightning_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ def add_model_specific_args(parser, root_dir):
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--encoder_layerdrop",
Expand Down
2 changes: 1 addition & 1 deletion examples/movement-pruning/masked_run_glue.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,7 +620,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
Expand Down
2 changes: 1 addition & 1 deletion examples/movement-pruning/masked_run_squad.py
Original file line number Diff line number Diff line change
Expand Up @@ -725,7 +725,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)

parser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion examples/multiple-choice/run_multiple_choice.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
3 changes: 2 additions & 1 deletion examples/multiple-choice/run_tf_multiple_choice.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
2 changes: 1 addition & 1 deletion examples/question-answering/run_squad.py
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ def main():
"--cache_dir",
default="",
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)

parser.add_argument(
Expand Down
3 changes: 2 additions & 1 deletion examples/question-answering/run_squad_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
3 changes: 2 additions & 1 deletion examples/question-answering/run_tf_squad.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
2 changes: 1 addition & 1 deletion examples/rag/finetune.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ export PYTHONPATH="../":"${PYTHONPATH}"
python examples/rag/finetune.py \
--data_dir $DATA_DIR \
--output_dir $OUTPUT_DIR \
--model_name_or_path $MODLE_NAME_OR_PATH \
--model_name_or_path $MODEL_NAME_OR_PATH \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

--model_type rag_sequence \
--fp16 \
--gpus 8 \
Expand Down
3 changes: 2 additions & 1 deletion examples/seq2seq/finetune_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
freeze_encoder: bool = field(default=False, metadata={"help": "Whether tp freeze the encoder."})
freeze_embeds: bool = field(default=False, metadata={"help": "Whether to freeze the embeddings."})
Expand Down
3 changes: 2 additions & 1 deletion examples/text-classification/run_glue.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
Expand Down
3 changes: 2 additions & 1 deletion examples/text-classification/run_tf_glue.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
3 changes: 2 additions & 1 deletion examples/text-classification/run_tf_text_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
2 changes: 1 addition & 1 deletion examples/text-classification/run_xnli.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,7 @@ def main():
"--cache_dir",
default=None,
type=str,
help="Where do you want to store the pre-trained models downloaded from s3",
help="Where do you want to store the pre-trained models downloaded from huggingface.co",
)
parser.add_argument(
"--max_seq_length",
Expand Down
3 changes: 2 additions & 1 deletion examples/token-classification/run_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ class ModelArguments:
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
3 changes: 2 additions & 1 deletion examples/token-classification/run_ner_old.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
3 changes: 2 additions & 1 deletion examples/token-classification/run_tf_ner.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@ class ModelArguments:
# If you want to tweak more attributes on your tokenizer, you should do it in a distinct script,
# or just modify its tokenizer_config.json.
cache_dir: Optional[str] = field(
default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
default=None,
metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)


Expand Down
12 changes: 6 additions & 6 deletions hubconf.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def config(*args, **kwargs):
# Using torch.hub !
import torch

config = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased') # Download configuration from S3 and cache.
config = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased') # Download configuration from huggingface.co and cache.
config = torch.hub.load('huggingface/transformers', 'config', './test/bert_saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
config = torch.hub.load('huggingface/transformers', 'config', './test/bert_saved_model/my_configuration.json')
config = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased', output_attentions=True, foo=False)
Expand All @@ -45,7 +45,7 @@ def tokenizer(*args, **kwargs):
# Using torch.hub !
import torch

tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'bert-base-uncased') # Download vocabulary from S3 and cache.
tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'bert-base-uncased') # Download vocabulary from huggingface.co and cache.
tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', './test/bert_saved_model/') # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`

"""
Expand All @@ -59,7 +59,7 @@ def model(*args, **kwargs):
# Using torch.hub !
import torch

model = torch.hub.load('huggingface/transformers', 'model', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/transformers', 'model', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'model', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'model', 'bert-base-uncased', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
Expand All @@ -78,7 +78,7 @@ def modelWithLMHead(*args, **kwargs):
# Using torch.hub !
import torch

model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'bert-base-uncased', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
Expand All @@ -96,7 +96,7 @@ def modelForSequenceClassification(*args, **kwargs):
# Using torch.hub !
import torch

model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
Expand All @@ -115,7 +115,7 @@ def modelForQuestionAnswering(*args, **kwargs):
# Using torch.hub !
import torch

model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'bert-base-uncased') # Download model and configuration from S3 and cache.
model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache.
model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')`
model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attentions=True) # Update configuration during loading
assert model.config.output_attentions == True
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/commands/user.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def register_subcommand(parser: ArgumentParser):
ls_parser.add_argument("--organization", type=str, help="Optional: organization namespace.")
ls_parser.set_defaults(func=lambda args: ListObjsCommand(args))
rm_parser = s3_subparsers.add_parser("rm")
rm_parser.add_argument("filename", type=str, help="individual object filename to delete from S3.")
rm_parser.add_argument("filename", type=str, help="individual object filename to delete from huggingface.co.")
rm_parser.add_argument("--organization", type=str, help="Optional: organization namespace.")
rm_parser.set_defaults(func=lambda args: DeleteObjCommand(args))
upload_parser = s3_subparsers.add_parser("upload", help="Upload a file to S3.")
Expand Down
Loading