ASR example doesn't save tokenizer settings #23222

RobertBaruch · 2023-05-08T23:45:16Z

System Info

transformers version: 4.28.1
Platform: Windows-10-10.0.22621-SP0
Python version: 3.11.2
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run training using run_speech_recognition_ctc.py and the included json file.

train.json.zip

Next, attempt to infer using the trained model:

import os.path

from datasets import load_dataset
from datasets import Audio
from transformers import pipeline, AutomaticSpeechRecognitionPipeline

cv13 = load_dataset(
    "mozilla-foundation/common_voice_13_0",
    "eo",
    split="train[:10]",
    )
print(cv13[0])
cv13 = cv13.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = cv13.features["audio"].sampling_rate
audio_file = cv13[0]["audio"]["path"]
d, n = os.path.split(audio_file)
audio_file = os.path.join(d, "eo_train_0", n)
print(audio_file)

transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
    "automatic-speech-recognition",
    model="xekri/wav2vec2-common_voice_13_0-eo-demo2",
)
print(transcriber(audio_file))

Output:

Found cached dataset common_voice_13_0 (C:/Users/rober/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/eo/13.0.0/22809012aac1fc9803eaffc44122e4149043748e93933935d5ea19898587e4d7)
{'client_id': 'b8c51543fe043c8f27d0de0428e060e309d9d824ac9ad33e40aba7062dafd99e2e87bbedc671007e31973afb599b1c290dbd922637b79132727b5f37bc1ee88e', 'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'audio': {'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.16407300e-11,  1.07661449e-12, -1.71219774e-11]), 'sampling_rate': 48000}, 'sentence': 'Ĉu ili tiel plaĉas al vi?', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': 'Internacia', 'locale': 'eo', 'segment': '', 'variant': ''}
C:\Users\rober\.cache\huggingface\datasets\downloads\extracted\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\eo_train_0\common_voice_eo_20453647.mp3
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.27k/2.27k [00:00<?, ?B/s]
F:\eo-reco\.env\Lib\site-packages\huggingface_hub\file_download.py:133: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\rober\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.26G/1.26G [01:56<00:00, 10.8MB/s]
Traceback (most recent call last):
  File "F:\eo-reco\infer.py", line 20, in <module>
    transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
                                                      ^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\pipelines\__init__.py", line 876, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\tokenization_utils_base.py", line 1795, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'xekri/wav2vec2-common_voice_13_0-eo-demo2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xekri/wav2vec2-common_voice_13_0-eo-demo2' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

Checking the uploaded repo, it seems that no tokenizer-related files (e.g. vocab.json, tokenizer_config.json, etc) were pushed.

I added some debug to run_speech_recognition_ctc.py and found that these files were generated locally, but got deleted locally during step 7 when Trainer was initialized (line 701).

The output from run_speech_recognition_ctc.py at that point was:

loading file vocab.json
loading file tokenizer_config.json
loading file added_tokens.json
loading file special_tokens_map.json
Adding <s> to the vocabulary
Adding </s> to the vocabulary
Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
05/08/2023 15:06:23 - WARNING - huggingface_hub.repository - Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
max_steps is given, it will override any value given in num_train_epochs

It seems that instantiating Training with push_to_hub=true creates a new repo and then empties anything in the local directory so that it can clone the repo into it. This deletes any files written to the local directory, which includes the tokenizer configs.

Expected behavior

No error.

The text was updated successfully, but these errors were encountered:

RobertBaruch · 2023-05-09T00:26:34Z

The comment on Trainer.push_to_hub does say Upload *self.model* and *self.tokenizer* to the 🤗 model hub. And in fact, it does call the trainer's tokenizer.save_pretrained function. However, in run_speech_recognition_ctc.py, tokenizer is set to feature_extractor in the initialization, and Wav2Vec2FeatureExtractor.save_pretrained does not save tokenizer settings.

RobertBaruch · 2023-05-09T01:48:53Z

When I replace these lines at the end of run_speech_recognition_ctc from this:

    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)
    else:
        trainer.create_model_card(**kwargs)

to this:

    tokenizer.save_pretrained(training_args.output_dir)
    trainer.create_model_card(**kwargs)
    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)

we do get tokenizer files. Also, may as well write the model card in any case.

amyeroberts · 2023-05-09T11:44:54Z

cc @sanchit-gandhi

hollance · 2023-05-09T13:44:31Z

The code in the run_speech_recognition_ctc.py script as well as the instructions from the ASR guide that you used in issue #23188 do the following:

trainer = Trainer(
    ...
    tokenizer=processor.feature_extractor,
    ...
)

The "processor" combines the feature extractor and tokenizer into a single class, but because we only pass the feature extractor to the Trainer, the tokenizer doesn't get saved. So that's clearly a mistake on our end.

The following fix should work:

trainer = Trainer(
    ...
    tokenizer=processor,
    ...
)

We're updating the docs to fix this. (It's a bit confusing that this argument from Trainer is called tokenizer but that's what's responsible for saving the non-model stuff.)

sanchit-gandhi · 2023-05-09T16:17:13Z

Probably we can directly add a new argument to the Trainer for the processor @hollance? This would stop all confusion IMO:

trainer = Trainer(
    ...
    processor=processor,
    ...
)

Here we could expect the user to pass either one of tokenizer or processor to the Trainer. Within the Trainer we only use the tokenizer to get the model input name, which after #20117 we can now get directly from the processor.

RobertBaruch · 2023-05-10T00:08:22Z

Can confirm, setting tokenizer=processor in run_speech_recognition_ctc.py works. Agree that tokenizer is a bit of a misleading keyword then.

sanchit-gandhi · 2023-06-12T16:34:30Z

Keeping this open since we really should update the Trainer to take processor as an argument over tokenizer=processor

github-actions · 2023-08-01T15:03:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

RobertBaruch mentioned this issue May 8, 2023

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

Closed

4 tasks

MKhalusova mentioned this issue May 9, 2023

[docs] Audio task guides fixes #23239

Merged

huggingface deleted a comment from github-actions bot Jun 12, 2023

huggingface deleted a comment from github-actions bot Jul 7, 2023

github-actions bot closed this as completed Aug 9, 2023

sanchit-gandhi reopened this May 16, 2024

sanchit-gandhi added Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels May 16, 2024

sanchit-gandhi self-assigned this May 16, 2024

sanchit-gandhi mentioned this issue May 16, 2024

[trainer] allow processor instead of tokenizer #30864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR example doesn't save tokenizer settings #23222

ASR example doesn't save tokenizer settings #23222

RobertBaruch commented May 8, 2023 •

edited

Loading

RobertBaruch commented May 9, 2023

RobertBaruch commented May 9, 2023

amyeroberts commented May 9, 2023

hollance commented May 9, 2023

sanchit-gandhi commented May 9, 2023

RobertBaruch commented May 10, 2023

sanchit-gandhi commented Jun 12, 2023

github-actions bot commented Aug 1, 2023

ASR example doesn't save tokenizer settings #23222

ASR example doesn't save tokenizer settings #23222

Comments

RobertBaruch commented May 8, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

RobertBaruch commented May 9, 2023

RobertBaruch commented May 9, 2023

amyeroberts commented May 9, 2023

hollance commented May 9, 2023

sanchit-gandhi commented May 9, 2023

RobertBaruch commented May 10, 2023

sanchit-gandhi commented Jun 12, 2023

github-actions bot commented Aug 1, 2023

RobertBaruch commented May 8, 2023 •

edited

Loading