Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

RobertBaruch · 2023-05-07T03:02:08Z

System Info

transformers version: 4.28.1
Platform: Windows-10-10.0.22621-SP0
Python version: 3.11.2
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Who can help?

@Narsil
@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Put together the script from Automatic Speech Recognition into a file main.py, up to but not including Inference.

Run under Windows. Training succeeds.

Put together the Inference section into a file infer.py.

Run under Windows.

Output:

Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████| 378M/378M [00:35<00:00, 10.6MB/s]
Traceback (most recent call last):
  File "f:\eo-reco\infer.py", line 10, in <module>
    transcriber = pipeline("automatic-speech-recognition", model="xekri/my_awesome_asr_model")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\pipelines\__init__.py", line 876, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\tokenization_utils_base.py", line 1795, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'xekri/my_awesome_asr_model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xekri/my_awesome_asr_model' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

main.py.zip
infer.py.zip

Expected behavior

No error.

The text was updated successfully, but these errors were encountered:

hollance · 2023-05-08T09:35:07Z

Your code works fine for me on macOS (I tried with the main branch of Transformers, which is version 4.29.0.dev0). It also looks like the tokenizer_config.json is present in your model repo, so all the required files are present.

Are you sure you don't have a F:\eo-reco\xekri\my_awesome_asr_model directory that would be interfering with this?

RobertBaruch · 2023-05-08T15:01:54Z

The problem happens even if I delete the local directory.

So the problem appears to be that there is a missing step in the docs:

processor.save_pretrained(save_directory="my_awesome_asr_mind_model")

Without this, there is no tokenizer_config.json.

The reason tokenizer_config.json was present in my repo is that I added the line and then ran the program again.

If you look at main.py.zip above, you can see where I had the line commented out. With that line commented out, the error happens.

hollance · 2023-05-08T16:15:22Z

It does look like those instructions are missing from the docs, I'll ping someone from the docs team to have a look. Thanks for reporting!

RobertBaruch · 2023-05-08T23:52:23Z

Possibly related: #23222

MKhalusova · 2023-05-09T17:32:39Z

Thanks for reporting this! If you pass processor to the Trainer, it will save both tokenizer and feature_extractor, and push them both to hub. I'll update the docs. #23239

github-actions · 2023-06-06T15:02:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hollance mentioned this issue May 9, 2023

ASR example doesn't save tokenizer settings #23222

Open

4 tasks

MKhalusova mentioned this issue May 9, 2023

[docs] Audio task guides fixes #23239

Merged

github-actions bot closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

RobertBaruch commented May 7, 2023

hollance commented May 8, 2023

RobertBaruch commented May 8, 2023 •

edited

Loading

hollance commented May 8, 2023

RobertBaruch commented May 8, 2023

MKhalusova commented May 9, 2023

github-actions bot commented Jun 6, 2023

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

Comments

RobertBaruch commented May 7, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

hollance commented May 8, 2023

RobertBaruch commented May 8, 2023 • edited Loading

hollance commented May 8, 2023

RobertBaruch commented May 8, 2023

MKhalusova commented May 9, 2023

github-actions bot commented Jun 6, 2023

RobertBaruch commented May 8, 2023 •

edited

Loading