run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

rosario-purple · 2024-01-16T14:39:44Z

System Info

transformers version: 4.36.2
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.25.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.21
JaxLib version: 0.4.21
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@ArthurZucker @younesbelkada @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Check out the transformers repo, and run this command (on a large server with appropriately configured accelerate, so it won't OOM):

python run_glue_no_trainer.py --model_name_or_path mistralai/Mistral-7B-v0.1 --task_name sst2 --per_device_train_batch_size 4 --learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/sst2

It will crash with this error and stack trace:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/scratch/brr/run_glue.py", line 662, in <module>
    main()
  File "/scratch/brr/run_glue.py", line 545, in main
    for step, batch in enumerate(active_dataloader):
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/data/data_collator.py", line 249, in __call__
    batch = self.tokenizer.pad(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprbynkmzk'>
  _warnings.warn(warn_message, ResourceWarning)

Expected behavior

It should train without crashing.

The text was updated successfully, but these errors were encountered:

rosario-purple · 2024-01-16T14:54:11Z

Adding these lines seems to fix it, not sure if this is the best/most general solution though:

    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code
    )
    tokenizer.pad_token = tokenizer.eos_token
    config.pad_token_id = tokenizer.pad_token_id
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        ignore_mismatched_sizes=args.ignore_mismatched_sizes,
        trust_remote_code=args.trust_remote_code,
    )

amyeroberts · 2024-01-16T19:28:37Z

Hi @rosario-purple, thanks for raising this issue!

The proposed fix is the recommended way to address this. Would you like to open a PR to add this to the script? This way you get the github contribution

github-actions · 2024-02-16T08:03:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JINO-ROHIT · 2024-04-09T11:29:55Z

@amyeroberts can i take this up?

amyeroberts · 2024-04-09T12:37:06Z

@JINO-ROHIT Sure!

…e#30234)

ArthurZucker added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Feb 19, 2024

JINO-ROHIT added a commit to JINO-ROHIT/transformers that referenced this issue Apr 10, 2024

adding pad token to fix huggingface#28534

b46d74d

JINO-ROHIT mentioned this issue Apr 10, 2024

Set pad_token in run_glue_no_trainer.py #28534 #30157

Closed

5 tasks

JINO-ROHIT added a commit to JINO-ROHIT/transformers that referenced this issue Apr 10, 2024

adding pad token to fix huggingface#28534

5e68f6a

JINO-ROHIT added a commit to JINO-ROHIT/transformers that referenced this issue Apr 13, 2024

Set pad_token in run_glue_no_trainer.py huggingface#28534

08ca930

JINO-ROHIT mentioned this issue Apr 13, 2024

Set pad_token in run_glue_no_trainer.py #28534 #30234

Merged

5 tasks

amyeroberts closed this as completed in #30234 Apr 15, 2024

amyeroberts pushed a commit that referenced this issue Apr 15, 2024

Set pad_token in run_glue_no_trainer.py #28534 (#30234)

f010786

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this issue Apr 18, 2024

Set pad_token in run_glue_no_trainer.py huggingface#28534 (huggingfac…

280a361

…e#30234)

ArthurZucker pushed a commit that referenced this issue Apr 22, 2024

Set pad_token in run_glue_no_trainer.py #28534 (#30234)

cd57355

itazap pushed a commit that referenced this issue May 14, 2024

Set pad_token in run_glue_no_trainer.py #28534 (#30234)

e4f44cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

rosario-purple commented Jan 16, 2024 •

edited

Loading

rosario-purple commented Jan 16, 2024

amyeroberts commented Jan 16, 2024

github-actions bot commented Feb 16, 2024

JINO-ROHIT commented Apr 9, 2024

amyeroberts commented Apr 9, 2024

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

Comments

rosario-purple commented Jan 16, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

rosario-purple commented Jan 16, 2024

amyeroberts commented Jan 16, 2024

github-actions bot commented Feb 16, 2024

JINO-ROHIT commented Apr 9, 2024

amyeroberts commented Apr 9, 2024

rosario-purple commented Jan 16, 2024 •

edited

Loading