Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

Closed
2 of 4 tasks
rosario-purple opened this issue Jan 16, 2024 · 5 comments · Fixed by #30234
Closed
2 of 4 tasks

run_glue_no_trainer.py script crashes on Mistral model due to tokenizer issue #28534

rosario-purple opened this issue Jan 16, 2024 · 5 comments · Fixed by #30234
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@rosario-purple
Copy link

rosario-purple commented Jan 16, 2024

System Info

  • transformers version: 4.36.2
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.0
  • Accelerate version: 0.25.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: DEEPSPEED
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
  • Jax version: 0.4.21
  • JaxLib version: 0.4.21
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@ArthurZucker @younesbelkada @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Check out the transformers repo, and run this command (on a large server with appropriately configured accelerate, so it won't OOM):

python run_glue_no_trainer.py --model_name_or_path mistralai/Mistral-7B-v0.1 --task_name sst2 --per_device_train_batch_size 4 --learning_rate 2e-5 --num_train_epochs 3 --output_dir /tmp/sst2

It will crash with this error and stack trace:

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/scratch/brr/run_glue.py", line 662, in <module>
    main()
  File "/scratch/brr/run_glue.py", line 545, in main
    for step, batch in enumerate(active_dataloader):
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/data/data_collator.py", line 249, in __call__
    batch = self.tokenizer.pad(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
    padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
    raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
/scratch/miniconda3/envs/brr/lib/python3.10/tempfile.py:860: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmprbynkmzk'>
  _warnings.warn(warn_message, ResourceWarning)

Expected behavior

It should train without crashing.

@rosario-purple
Copy link
Author

Adding these lines seems to fix it, not sure if this is the best/most general solution though:

    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code
    )
    tokenizer.pad_token = tokenizer.eos_token
    config.pad_token_id = tokenizer.pad_token_id
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        ignore_mismatched_sizes=args.ignore_mismatched_sizes,
        trust_remote_code=args.trust_remote_code,
    )

@amyeroberts
Copy link
Collaborator

Hi @rosario-purple, thanks for raising this issue!

The proposed fix is the recommended way to address this. Would you like to open a PR to add this to the script? This way you get the github contribution

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker ArthurZucker added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Feb 19, 2024
@JINO-ROHIT
Copy link
Contributor

@amyeroberts can i take this up?

@amyeroberts
Copy link
Collaborator

@JINO-ROHIT Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
4 participants