Skip to content

[Bug] tokenizer.model_max_length is different when loading model from shortcut or local path #14561

@ShaneTian

Description

@ShaneTian

To reproduce

When loading model from local path, the tokenizer.model_max_length is VERY_LARGE_INTEGER:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
print(tokenizer.model_max_length)
# 1024

tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2")
print(tokenizer.model_max_length)
# 1000000000000000019884624838656

Related code

# Set max length if needed
if pretrained_model_name_or_path in cls.max_model_input_sizes:
# if we're using a pretrained model, ensure the tokenizer
# wont index sequences longer than the number of positional embeddings
model_max_length = cls.max_model_input_sizes[pretrained_model_name_or_path]
if model_max_length is not None and isinstance(model_max_length, (int, float)):
init_kwargs["model_max_length"] = min(init_kwargs.get("model_max_length", int(1e30)), model_max_length)

max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"gpt2": 1024,
"gpt2-medium": 1024,
"gpt2-large": 1024,
"gpt2-xl": 1024,
"distilgpt2": 1024,
}

Expected behavior

Assign correct model_max_length when loading model from local path:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("path/to/local/gpt2")
print(tokenizer.model_max_length)
# 1024

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions