Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASR example doesn't save tokenizer settings #23222

Open
2 of 4 tasks
RobertBaruch opened this issue May 8, 2023 · 8 comments
Open
2 of 4 tasks

ASR example doesn't save tokenizer settings #23222

RobertBaruch opened this issue May 8, 2023 · 8 comments
Assignees
Labels
Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@RobertBaruch
Copy link
Contributor

RobertBaruch commented May 8, 2023

System Info

  • transformers version: 4.28.1
  • Platform: Windows-10-10.0.22621-SP0
  • Python version: 3.11.2
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: NO

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run training using run_speech_recognition_ctc.py and the included json file.

train.json.zip

Next, attempt to infer using the trained model:

import os.path

from datasets import load_dataset
from datasets import Audio
from transformers import pipeline, AutomaticSpeechRecognitionPipeline

cv13 = load_dataset(
    "mozilla-foundation/common_voice_13_0",
    "eo",
    split="train[:10]",
    )
print(cv13[0])
cv13 = cv13.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = cv13.features["audio"].sampling_rate
audio_file = cv13[0]["audio"]["path"]
d, n = os.path.split(audio_file)
audio_file = os.path.join(d, "eo_train_0", n)
print(audio_file)

transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
    "automatic-speech-recognition",
    model="xekri/wav2vec2-common_voice_13_0-eo-demo2",
)
print(transcriber(audio_file))

Output:

Found cached dataset common_voice_13_0 (C:/Users/rober/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/eo/13.0.0/22809012aac1fc9803eaffc44122e4149043748e93933935d5ea19898587e4d7)
{'client_id': 'b8c51543fe043c8f27d0de0428e060e309d9d824ac9ad33e40aba7062dafd99e2e87bbedc671007e31973afb599b1c290dbd922637b79132727b5f37bc1ee88e', 'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'audio': {'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.16407300e-11,  1.07661449e-12, -1.71219774e-11]), 'sampling_rate': 48000}, 'sentence': 'Ĉu ili tiel plaĉas al vi?', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': 'Internacia', 'locale': 'eo', 'segment': '', 'variant': ''}
C:\Users\rober\.cache\huggingface\datasets\downloads\extracted\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\eo_train_0\common_voice_eo_20453647.mp3
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.27k/2.27k [00:00<?, ?B/s]
F:\eo-reco\.env\Lib\site-packages\huggingface_hub\file_download.py:133: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\rober\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.26G/1.26G [01:56<00:00, 10.8MB/s]
Traceback (most recent call last):
  File "F:\eo-reco\infer.py", line 20, in <module>
    transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
                                                      ^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\pipelines\__init__.py", line 876, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\tokenization_utils_base.py", line 1795, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'xekri/wav2vec2-common_voice_13_0-eo-demo2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xekri/wav2vec2-common_voice_13_0-eo-demo2' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

Checking the uploaded repo, it seems that no tokenizer-related files (e.g. vocab.json, tokenizer_config.json, etc) were pushed.

I added some debug to run_speech_recognition_ctc.py and found that these files were generated locally, but got deleted locally during step 7 when Trainer was initialized (line 701).

The output from run_speech_recognition_ctc.py at that point was:

loading file vocab.json
loading file tokenizer_config.json
loading file added_tokens.json
loading file special_tokens_map.json
Adding <s> to the vocabulary
Adding </s> to the vocabulary
Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
05/08/2023 15:06:23 - WARNING - huggingface_hub.repository - Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
max_steps is given, it will override any value given in num_train_epochs

It seems that instantiating Training with push_to_hub=true creates a new repo and then empties anything in the local directory so that it can clone the repo into it. This deletes any files written to the local directory, which includes the tokenizer configs.

Expected behavior

No error.

@RobertBaruch
Copy link
Contributor Author

The comment on Trainer.push_to_hub does say Upload *self.model* and *self.tokenizer* to the 🤗 model hub. And in fact, it does call the trainer's tokenizer.save_pretrained function. However, in run_speech_recognition_ctc.py, tokenizer is set to feature_extractor in the initialization, and Wav2Vec2FeatureExtractor.save_pretrained does not save tokenizer settings.

@RobertBaruch
Copy link
Contributor Author

When I replace these lines at the end of run_speech_recognition_ctc from this:

    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)
    else:
        trainer.create_model_card(**kwargs)

to this:

    tokenizer.save_pretrained(training_args.output_dir)
    trainer.create_model_card(**kwargs)
    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)

we do get tokenizer files. Also, may as well write the model card in any case.

@amyeroberts
Copy link
Collaborator

cc @sanchit-gandhi

@hollance
Copy link
Contributor

hollance commented May 9, 2023

The code in the run_speech_recognition_ctc.py script as well as the instructions from the ASR guide that you used in issue #23188 do the following:

trainer = Trainer(
    ...
    tokenizer=processor.feature_extractor,
    ...
)

The "processor" combines the feature extractor and tokenizer into a single class, but because we only pass the feature extractor to the Trainer, the tokenizer doesn't get saved. So that's clearly a mistake on our end.

The following fix should work:

trainer = Trainer(
    ...
    tokenizer=processor,
    ...
)

We're updating the docs to fix this. (It's a bit confusing that this argument from Trainer is called tokenizer but that's what's responsible for saving the non-model stuff.)

@sanchit-gandhi
Copy link
Contributor

Probably we can directly add a new argument to the Trainer for the processor @hollance? This would stop all confusion IMO:

trainer = Trainer(
    ...
    processor=processor,
    ...
)

Here we could expect the user to pass either one of tokenizer or processor to the Trainer. Within the Trainer we only use the tokenizer to get the model input name, which after #20117 we can now get directly from the processor.

@RobertBaruch
Copy link
Contributor Author

Can confirm, setting tokenizer=processor in run_speech_recognition_ctc.py works. Agree that tokenizer is a bit of a misleading keyword then.

@huggingface huggingface deleted a comment from github-actions bot Jun 12, 2023
@sanchit-gandhi
Copy link
Contributor

Keeping this open since we really should update the Trainer to take processor as an argument over tokenizer=processor

@huggingface huggingface deleted a comment from github-actions bot Jul 7, 2023
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Aug 9, 2023
@sanchit-gandhi sanchit-gandhi added Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! labels May 16, 2024
@sanchit-gandhi sanchit-gandhi self-assigned this May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants