Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error finetuning Whisper using new tokenizer #25503

Closed
2 of 4 tasks
PeterBagnegaard opened this issue Aug 14, 2023 · 17 comments
Closed
2 of 4 tasks

Error finetuning Whisper using new tokenizer #25503

PeterBagnegaard opened this issue Aug 14, 2023 · 17 comments

Comments

@PeterBagnegaard
Copy link

System Info

  • transformers version: 4.28.0.dev0
  • Platform: Linux-6.2.15-100.fc36.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.9
  • Huggingface_hub version: 0.13.4
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

who can help

@ArthurZucker

Information

I am using whisper-medium-da

and I've based my code on the tutorials
Training a new tokenizer from an old one
https://huggingface.co/learn/nlp-course/chapter6/2
and
Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
https://huggingface.co/blog/fine-tune-whisper

I'm trying to finetine Whisper using a tokenizer other than the one provided by whisper (but based on it)

This gives the following error

You're using a WhisperTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [102,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [103,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

---------------------------------------------------------------------------
You're using a WhisperTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [112,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [3,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], line 2
      1 ### print("Start training")
----> 2 trainer.train()
      3 #trainer.evaluate()
      4 print("Done training")

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1662, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1657     self.model_wrapped = self.model
   1659 inner_training_loop = find_executable_batch_size(
   1660     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1661 )
-> 1662 return inner_training_loop(
   1663     args=args,
   1664     resume_from_checkpoint=resume_from_checkpoint,
   1665     trial=trial,
   1666     ignore_keys_for_eval=ignore_keys_for_eval,
   1667 )

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1929, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1927         tr_loss_step = self.training_step(model, inputs)
   1928 else:
-> 1929     tr_loss_step = self.training_step(model, inputs)
   1931 if (
   1932     args.logging_nan_inf_filter
   1933     and not is_torch_tpu_available()
   1934     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1935 ):
   1936     # if loss is nan or inf simply add the average of previous logged losses
   1937     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2699, in Trainer.training_step(self, model, inputs)
   2696     return loss_mb.reduce_mean().detach().to(self.args.device)
   2698 with self.compute_loss_context_manager():
-> 2699     loss = self.compute_loss(model, inputs)
   2701 if self.args.n_gpu > 1:
   2702     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2731, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2729 else:
   2730     labels = None
-> 2731 outputs = model(**inputs)
   2732 # Save past state if it exists
   2733 # TODO: this needs to be fixed and made cleaner later.
   2734 if self.args.past_index >= 0:

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1414, in WhisperForConditionalGeneration.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1409     if decoder_input_ids is None and decoder_inputs_embeds is None:
   1410         decoder_input_ids = shift_tokens_right(
   1411             labels, self.config.pad_token_id, self.config.decoder_start_token_id
   1412         )
-> 1414 outputs = self.model(
   1415     input_features,
   1416     attention_mask=attention_mask,
   1417     decoder_input_ids=decoder_input_ids,
   1418     encoder_outputs=encoder_outputs,
   1419     decoder_attention_mask=decoder_attention_mask,
   1420     head_mask=head_mask,
   1421     decoder_head_mask=decoder_head_mask,
   1422     cross_attn_head_mask=cross_attn_head_mask,
   1423     past_key_values=past_key_values,
   1424     decoder_inputs_embeds=decoder_inputs_embeds,
   1425     use_cache=use_cache,
   1426     output_attentions=output_attentions,
   1427     output_hidden_states=output_hidden_states,
   1428     return_dict=return_dict,
   1429 )
   1430 lm_logits = self.proj_out(outputs[0])
   1432 loss = None

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1279, in WhisperModel.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1272     encoder_outputs = BaseModelOutput(
   1273         last_hidden_state=encoder_outputs[0],
   1274         hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
   1275         attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
   1276     )
   1278 # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
-> 1279 decoder_outputs = self.decoder(
   1280     input_ids=decoder_input_ids,
   1281     attention_mask=decoder_attention_mask,
   1282     encoder_hidden_states=encoder_outputs[0],
   1283     head_mask=decoder_head_mask,
   1284     cross_attn_head_mask=cross_attn_head_mask,
   1285     past_key_values=past_key_values,
   1286     inputs_embeds=decoder_inputs_embeds,
   1287     use_cache=use_cache,
   1288     output_attentions=output_attentions,
   1289     output_hidden_states=output_hidden_states,
   1290     return_dict=return_dict,
   1291 )
   1293 if not return_dict:
   1294     return decoder_outputs + encoder_outputs

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1032, in WhisperDecoder.forward(self, input_ids, attention_mask, encoder_hidden_states, head_mask, cross_attn_head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1029 if inputs_embeds is None:
   1030     inputs_embeds = self.embed_tokens(input_ids)
-> 1032 attention_mask = self._prepare_decoder_attention_mask(
   1033     attention_mask, input_shape, inputs_embeds, past_key_values_length
   1034 )
   1036 # embed positions
   1037 if input_ids is not None:

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:921, in WhisperDecoder._prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length)
    918 combined_attention_mask = None
    920 if input_shape[-1] > 1:
--> 921     combined_attention_mask = _make_causal_mask(
    922         input_shape,
    923         inputs_embeds.dtype,
    924         device=inputs_embeds.device,
    925         past_key_values_length=past_key_values_length,
    926     )
    928 if attention_mask is not None:
    929     # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
    930     expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:79, in _make_causal_mask(input_ids_shape, dtype, device, past_key_values_length)
     75 """
     76 Make causal mask used for bi-directional self-attention.
     77 """
     78 bsz, tgt_len = input_ids_shape
---> 79 mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
     80 mask_cond = torch.arange(mask.size(-1), device=device)
     81 mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The tokenizer from whisper-medium-da have special tokens added in the very end of the vocab dict (with indices around 50000) whereas new_tokenizer has special tokens in the very beginning (with indices around 0).
I'm expecting that the error arises because tokens like <|endoftext|> and <|startoftranscript|> don't have the same index.
It seems that whenever I try to train my own tokenizer, even when using train_new_from_iterator from, the special tokens move to the beginning of the vocabulary dict.
I'm under the impression that I don't have to retrain Whisper from scratch when retraining the tokenizer, and that I can simply set the new_tokenizer as explained above and finetune whisper-medium-da on my own data.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, WhisperProcessor, WhisperForConditionalGeneration, AutoTokenizer
from datasets import Audio, load_dataset, DatasetDict, Dataset
from typing import Any, Dict, List, Union
from dataclasses import dataclass
import evaluate
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

processor_checkpoint = "openai/whisper-medium"
tokenizer_checkpoint = "whisper_new"
model_checkpoint = "openai/whisper-medium"

# Retrain the tokenizer. This is what I'm unable to do
from datasets import load_dataset

dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

old_tokenizer = AutoTokenizer.from_pretrained(processor_checkpoint)
new_tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), old_tokenizer.vocab_size)
new_tokenizer.save_pretrained(tokenizer_checkpoint)

# Create data_collator
processor = WhisperProcessor.from_pretrained(processor_checkpoint, language='Danish', task='transcribe')
processor.tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

# Load data
dataset_dict = DatasetDict()
dataset_dict["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "da", split="train+validation", use_auth_token=True)
dataset_dict["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "da", split="test", use_auth_token=True)
dataset_dict = dataset_dict.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])
dataset_dict = dataset_dict.cast_column("audio", Audio(sampling_rate=16000))
dataset_dict = dataset_dict.map(prepare_dataset, remove_columns=dataset_dict.column_names["train"], num_proc=4)

# Load model
model = WhisperForConditionalGeneration.from_pretrained(model_checkpoint)
model.config.forced_decoder_ids = None # ToDo Is this right?
model.config.suppress_tokens = []
model.resize_token_embeddings(len(processor.tokenizer))

# Train
metric = evaluate.load("wer")

training_args = Seq2SeqTrainingArguments(
    output_dir="home",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=8*1e-6,
    warmup_steps=500,
    max_steps=10000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=50,
    eval_steps=50,
    logging_steps=25,
    report_to="none", #["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
    optim="adafactor"
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

trainer.train()

Expected behavior

the trainer.train() would run smoothly without errors, just like it does when using the tokenizer provided by whisper.

@ArthurZucker
Copy link
Collaborator

This error is most probably indicating that the embedding layer received indices outside of its range. Did you properly resize the embedding layer to match the size of the tokenizers' length? (Running on CPU will allow you to see the actual source of the error)

@PeterBagnegaard
Copy link
Author

Thank you so much for the quick reply. This is a show-stopper for me.
I think you're right, but I don't know how to fix it.

When training with no_cuda=True I get the following error:

You're using a WhisperTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[13], line 2
      1 ### print("Start training")
----> 2 trainer.train()
      3 #trainer.evaluate()
      4 print("Done training")

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1662, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1657     self.model_wrapped = self.model
   1659 inner_training_loop = find_executable_batch_size(
   1660     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1661 )
-> 1662 return inner_training_loop(
   1663     args=args,
   1664     resume_from_checkpoint=resume_from_checkpoint,
   1665     trial=trial,
   1666     ignore_keys_for_eval=ignore_keys_for_eval,
   1667 )

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1929, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1927         tr_loss_step = self.training_step(model, inputs)
   1928 else:
-> 1929     tr_loss_step = self.training_step(model, inputs)
   1931 if (
   1932     args.logging_nan_inf_filter
   1933     and not is_torch_tpu_available()
   1934     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1935 ):
   1936     # if loss is nan or inf simply add the average of previous logged losses
   1937     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2699, in Trainer.training_step(self, model, inputs)
   2696     return loss_mb.reduce_mean().detach().to(self.args.device)
   2698 with self.compute_loss_context_manager():
-> 2699     loss = self.compute_loss(model, inputs)
   2701 if self.args.n_gpu > 1:
   2702     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2731, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2729 else:
   2730     labels = None
-> 2731 outputs = model(**inputs)
   2732 # Save past state if it exists
   2733 # TODO: this needs to be fixed and made cleaner later.
   2734 if self.args.past_index >= 0:

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1414, in WhisperForConditionalGeneration.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1409     if decoder_input_ids is None and decoder_inputs_embeds is None:
   1410         decoder_input_ids = shift_tokens_right(
   1411             labels, self.config.pad_token_id, self.config.decoder_start_token_id
   1412         )
-> 1414 outputs = self.model(
   1415     input_features,
   1416     attention_mask=attention_mask,
   1417     decoder_input_ids=decoder_input_ids,
   1418     encoder_outputs=encoder_outputs,
   1419     decoder_attention_mask=decoder_attention_mask,
   1420     head_mask=head_mask,
   1421     decoder_head_mask=decoder_head_mask,
   1422     cross_attn_head_mask=cross_attn_head_mask,
   1423     past_key_values=past_key_values,
   1424     decoder_inputs_embeds=decoder_inputs_embeds,
   1425     use_cache=use_cache,
   1426     output_attentions=output_attentions,
   1427     output_hidden_states=output_hidden_states,
   1428     return_dict=return_dict,
   1429 )
   1430 lm_logits = self.proj_out(outputs[0])
   1432 loss = None

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1279, in WhisperModel.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1272     encoder_outputs = BaseModelOutput(
   1273         last_hidden_state=encoder_outputs[0],
   1274         hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
   1275         attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
   1276     )
   1278 # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
-> 1279 decoder_outputs = self.decoder(
   1280     input_ids=decoder_input_ids,
   1281     attention_mask=decoder_attention_mask,
   1282     encoder_hidden_states=encoder_outputs[0],
   1283     head_mask=decoder_head_mask,
   1284     cross_attn_head_mask=cross_attn_head_mask,
   1285     past_key_values=past_key_values,
   1286     inputs_embeds=decoder_inputs_embeds,
   1287     use_cache=use_cache,
   1288     output_attentions=output_attentions,
   1289     output_hidden_states=output_hidden_states,
   1290     return_dict=return_dict,
   1291 )
   1293 if not return_dict:
   1294     return decoder_outputs + encoder_outputs

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1030, in WhisperDecoder.forward(self, input_ids, attention_mask, encoder_hidden_states, head_mask, cross_attn_head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1027 past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
   1029 if inputs_embeds is None:
-> 1030     inputs_embeds = self.embed_tokens(input_ids)
   1032 attention_mask = self._prepare_decoder_attention_mask(
   1033     attention_mask, input_shape, inputs_embeds, past_key_values_length
   1034 )
   1036 # embed positions

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162, in Embedding.forward(self, input)
    161 def forward(self, input: Tensor) -> Tensor:
--> 162     return F.embedding(
    163         input, self.weight, self.padding_idx, self.max_norm,
    164         self.norm_type, self.scale_grad_by_freq, self.sparse)

File ~/.local/lib/python3.10/site-packages/torch/nn/functional.py:2210, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2204     # Note [embedding_renorm set_grad_enabled]
   2205     # XXX: equivalent to
   2206     # with torch.no_grad():
   2207     #   torch.embedding_renorm_
   2208     # remove once script supports set_grad_enabled
   2209     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

This confuses me because I'm training the new tokenizer like this:

new_tokenizer = old_tokenizer.train_new_from_iterator(
                get_training_corpus(), 
                old_tokenizer.vocab_size,
                special_tokens_map=old_tokenizer.special_tokens_map,
                new_special_tokens=old_tokenizer.all_special_tokens)

saying that its vocab_size should be the same as the old one. the commands

print(old_tokenizer.vocab_size) # 50257
print(len(old_tokenizer.vocab)) # 50364

tell me that the vocab of the old tokenizer has appended the 107 special tokens at the end of the vocab, whereas the commands

print(new_tokenizer.vocab_size) # 50257
print(len(new_tokenizer.vocab)) # 50257

tells me that the new tokenizer has prepended(?) them.
So in the old tokenizer I have
vocab = [token1, token2, ..., special_token1, special_token2...] # length 50257 + 107
and in the new
vocab = [special_token1, special_token2..., token1, token2, ...] # length 50257

@ArthurZucker
Copy link
Collaborator

Okay, you might find help in huggingface/tokenizers#1277.
The tokenizer's length with additional special token is len(tokenizer) not tokenizer.vocab_size. You are probably using a fast tokenizer, which works a bit differently from a slow one. You need to debug which inputs gave tokens outside the range of the embedding layer and check what is the max embedding layer index!

@PeterBagnegaard
Copy link
Author

I've been trying to understand how issue 1277 can help, but unsuccessfully. The problem seems to be too different to what I'm trying to achieve.
I've made some tests to see how the ids and tokens fit together. In the old model the special token ids start right after the normal token ids at 50257 and continue all the way up to len(tokenizer). The first two special tokens after the normal tokens are bos and eos.
When using train_new_from_iterator it seems like it moves all the special tokens to the beginning of the vocab dict.

def test_tokenizer(tokenizer):
    idxs = [tokenizer.vocab[special_token] for special_token in tokenizer.all_special_tokens]
    is_wrong = all([idx < tokenizer.vocab_size for idx in idxs])
    print(f"Are special tokens after normal tokens? {not is_wrong}")
    print(f"bos_token: {tokenizer.vocab['<|startoftranscript|>']} eos_token: {tokenizer.vocab['<|endoftext|>']}")
    print("Special token ids: " + ", ".join([str(idx) for idx in idxs]))

def max_key_val(tokenizer):
    d = tokenizer.vocab
    key = max(d, key=d.get)
    return key, d[key]

def min_key_val(tokenizer):
    d = tokenizer.vocab
    key = min(d, key=d.get)
    return key, d[key]

print(f"Old tokenizer: \n{len(old_tokenizer)=} | {old_tokenizer.vocab_size=} | {min_key_val(old_tokenizer)=} | {max_key_val(old_tokenizer)=}")
test_tokenizer(old_tokenizer)

print(f"\nNew tokenizer: \n{len(new_tokenizer)=} | {new_tokenizer.vocab_size=} | {min_key_val(new_tokenizer)=} | {max_key_val(new_tokenizer)=}")
test_tokenizer(new_tokenizer)
Old tokenizer: 
len(old_tokenizer)=50364 | old_tokenizer.vocab_size=50257 | min_key_val(old_tokenizer)=('!', 0) | max_key_val(old_tokenizer)=('<|notimestamps|>', 50363)
Are special tokens after normal tokens? True
bos_token: 50258 eos_token: 50257
Special token ids: 50257, 50256, 50257, 50258, 50259, 50260, 50261, 50262, 50263, 50264, 50265, 50266, 50267, 50268, 50269, 50270, 50271, 50272, 50273, 50274, 50275, 50276, 50277, 50278, 50279, 50280, 50281, 50282, 50283, 50284, 50285, 50286, 50287, 50288, 50289, 50290, 50291, 50292, 50293, 50294, 50295, 50296, 50297, 50298, 50299, 50300, 50301, 50302, 50303, 50304, 50305, 50306, 50307, 50308, 50309, 50310, 50311, 50312, 50313, 50314, 50315, 50316, 50317, 50318, 50319, 50320, 50321, 50322, 50323, 50324, 50325, 50326, 50327, 50328, 50329, 50330, 50331, 50332, 50333, 50334, 50335, 50336, 50337, 50338, 50339, 50340, 50341, 50342, 50343, 50344, 50345, 50346, 50347, 50348, 50349, 50350, 50351, 50352, 50353, 50354, 50355, 50356, 50357, 50358, 50359, 50360, 50361, 50362, 50363

New tokenizer: 
len(new_tokenizer)=50257 | new_tokenizer.vocab_size=50257 | min_key_val(new_tokenizer)=('<|endoftext|>', 0) | max_key_val(new_tokenizer)=('sebiopsi', 50256)
Are special tokens after normal tokens? False
bos_token: 1 eos_token: 0
Special token ids: 0, 107, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107

The model expects the bos and eos at indices 50258 and 50257, but after using train_new_from_iterator these indices are wrong.

model.config
WhisperConfig {
  "_name_or_path": "openai/whisper-medium",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "apply_spec_augment": false,
  "architectures": [
    "WhisperForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "begin_suppress_tokens": [
    220,
    50257
  ],
  "bos_token_id": 50257, <==========
  "classifier_proj_size": 256,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 24,
  "decoder_start_token_id": 50258,
  "dropout": 0.0,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 24,
  eos_token_id": 50257, <==========
  "forced_decoder_ids": null,
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "mask_feature_length": 10,
  "mask_feature_min_masks": 0,
  "mask_feature_prob": 0.0,
  "mask_time_length": 10,
  "mask_time_min_masks": 2,
  "mask_time_prob": 0.05,
  "max_length": 448,
  "max_source_positions": 1500,
  "max_target_positions": 448,
  "model_type": "whisper",
  "num_hidden_layers": 24,
  "num_mel_bins": 80,
  "pad_token_id": 50257,
  "scale_embedding": false,
  "suppress_tokens": [],
  "torch_dtype": "float32",
  "transformers_version": "4.28.0.dev0",
  "use_cache": true,
  "use_weighted_layer_sum": false,
  vocab_size": 50364, <==========
}

I can make the error go away by making the vocab_size = len(old_tokenizer), but the ids will still not line up.

Maybe I should use a SentencePiece tokenizer to create a vocab file, but there are some problems with this too.
In my tokenizer folder I have both vocab.json and tokenizer.json, both of which contain the full vocab (for some reason?).
tokenizer.json also contains information about special tokens which I'm interested in.
I'm considering replacing tokenizer.json => 'model' => 'vocab' and vocab.json with the correct vocab, but because the special tokens have been added to these in the new tokenizer, I'll have to find all indices of normal tokens and shift them back by the number of special tokens.
There has got to be a simple way of doing this. This seems like an obvious error in train_new_from_iterator?

@ArthurZucker
Copy link
Collaborator

I'll try to have a look 😉

@ArthurZucker
Copy link
Collaborator

Okay, let's just take this step by step as the reproducer is huge and involved.

  1. What are you trying to achieve by training a new tokenizer? Do you have a new language?

  2. What could be wrong here:

new_tokenizer = old_tokenizer.train_new_from_iterator(
                get_training_corpus(), 
                old_tokenizer.vocab_size,
                special_tokens_map=old_tokenizer.special_tokens_map,
                new_special_tokens=old_tokenizer.all_special_tokens)

for me, this is problematic, because the content of old_tokenizer.special_tokens_map is also in old_tokenizer.all_special_tokens. Would heavily suggest removing this.

Also this was not in the training example provided so not really sure why you are adding it?

@ArthurZucker
Copy link
Collaborator

Could you share a pushed v ersion of the tokenizers?

@PeterBagnegaard
Copy link
Author

PeterBagnegaard commented Sep 12, 2023

  1. I have a dataset using specialized language. There is a lot of technical jargon which the standard whisper tokenizer doesn't handle well.
  2. This might very well be wrong, I added this in order to check whether it solved my problem. When training a new tokenizer as
new_tokenizer = old_tokenizer.train_new_from_iterator(
                get_training_corpus(), 
                old_tokenizer.vocab_size,
                special_tokens_map=old_tokenizer.special_tokens_map,
                new_special_tokens=old_tokenizer.all_special_tokens)

and

new_tokenizer = old_tokenizer.train_new_from_iterator(
                get_training_corpus(), 
                old_tokenizer.vocab_size,
                special_tokens_map=old_tokenizer.special_tokens_map)

and

new_tokenizer = old_tokenizer.train_new_from_iterator(
                get_training_corpus(), 
                old_tokenizer.vocab_size)

I get the same error. In all cases, the special tokens will be placed in the beginning of new_tokenizer.vocab and not the end like in old_tokenizer.vocab.

Could you share a pushed v ersion of the tokenizers?

Do you need me to share the folder containing vocab.json, tokenizer.json, merges.txt etc?

@ArthurZucker
Copy link
Collaborator

Yes, push the tokenizer to the hub and I'll be able to have a look at the internal state 😉

@PeterBagnegaard
Copy link
Author

PeterBagnegaard commented Sep 12, 2023

This is my first time using this feature. It should be available at peterBagnegaard/new_tokenizer.

I made it using the following lines

whisper = WhisperTokenizerFast.from_pretrained("openai/whisper-medium", language="danish")

whisper_new = whisper.train_new_from_iterator(
    get_training_corpus(),
    whisper.vocab_size)

whisper_new.push_to_hub("new_tokenizer")

@ArthurZucker
Copy link
Collaborator

Thanks! We actually have a few tests on our CI that should ensure that we can train a tokenizer from an old tokenizers, so if this is indeed a bug we'll have to fix it!

@PeterBagnegaard
Copy link
Author

PeterBagnegaard commented Sep 13, 2023

This might confuse more than it helps, but I've tried training my own tokenizer using the BpeTrainer, inspired by huggingface/tokenizers#1277.

# Based either on jstoone or openai
old_tokenizer = WhisperTokenizerFast.from_pretrained("jstoone/whisper-medium-da", language="danish")
# old_tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-medium", language="danish")

tokenizer = old_tokenizer.backend_tokenizer

# Either adding special tokens to trainer or not
trainer = trainers.BpeTrainer(vocab_size=old_tokenizer.vocab_size)#, special_tokens=old_tokenizer.all_special_tokens)

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

tokenizer.save("tokenizer.json")

fast_tokenizer = WhisperTokenizerFast(
tokenizer_file="tokenizer.json",
model_max_length=old_tokenizer.model_max_length,
language='danish',
task='transcribe',
predict_timestamps=True)

special_tokens = {"bos_token" : AddedToken(old_tokenizer.bos_token or "", normalized=True),
                  "eos_token" : AddedToken(old_tokenizer.eos_token or "", normalized=True),
                  "unk_token" : AddedToken(old_tokenizer.unk_token or "[UNK]", normalized=True),
                  "sep_token" : old_tokenizer.sep_token or "",
                  "pad_token" : old_tokenizer.pad_token or "",
                  "cls_token" : old_tokenizer.cls_token or "",
                  "mask_token" : old_tokenizer.mask_token or "",
                  "additional_special_tokens" : old_tokenizer.additional_special_tokens}

fast_tokenizer.add_special_tokens(special_tokens)

fast_tokenizer.set_prefix_tokens(task='transcribe', language='danish')

I've been experimenting with using both openAis tokenizer, as well as a tokenizer made by Jstoone (the one I'm fine-tuning further).
I've also tried adding special tokens to the trainer or not. This gives four possibilities:

OpenAi + added special tokens: [FAILS] special tokens are placed first in vocab
Jstoone + added special tokens: [FAILS] special tokens are placed first in vocab
OpenAi + No added special tokens: [PANICS] train_from_iterator throws PanicException: Missing additional token
Jstoone + No added special tokens: [WORKS] special tokens are placed last in vocab

So while I can technically continue, this seems like a problem (I am so confused!)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

Glad to know that this worked. A few major changes were recently pushed to the transformers library regarding added tokens which might have also fixed some issues you could have been facing!

Copy link

github-actions bot commented Nov 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@P-Sood
Copy link

P-Sood commented Nov 20, 2023

@PeterBagnegaard Did you ever get this to work? I am doing the same thing as you, but my model is predicting gibberish at the end.

Were you able to get Whisper to correctly learn a new tokenizer, and if you could, how did you?

@ArthurZucker
Copy link
Collaborator

If you train a new tokenizer, the model will have to be trained from scratch as you are learning a new mapping from token to ids which is literally miles away from the one it was trained on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants