Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running inference from ASR documentation, pipeline errors with "Can't load tokenizer" #23188

Closed
1 of 4 tasks
RobertBaruch opened this issue May 7, 2023 · 6 comments
Closed
1 of 4 tasks

Comments

@RobertBaruch
Copy link
Contributor

System Info

  • transformers version: 4.28.1
  • Platform: Windows-10-10.0.22621-SP0
  • Python version: 3.11.2
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: NO

Who can help?

@Narsil
@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Put together the script from Automatic Speech Recognition into a file main.py, up to but not including Inference.

Run under Windows. Training succeeds.

Put together the Inference section into a file infer.py.

Run under Windows.

Output:

Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████| 378M/378M [00:35<00:00, 10.6MB/s]
Traceback (most recent call last):
  File "f:\eo-reco\infer.py", line 10, in <module>
    transcriber = pipeline("automatic-speech-recognition", model="xekri/my_awesome_asr_model")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\pipelines\__init__.py", line 876, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "f:\eo-reco\.env\Lib\site-packages\transformers\tokenization_utils_base.py", line 1795, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'xekri/my_awesome_asr_model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xekri/my_awesome_asr_model' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

main.py.zip
infer.py.zip

Expected behavior

No error.

@hollance
Copy link
Contributor

hollance commented May 8, 2023

Your code works fine for me on macOS (I tried with the main branch of Transformers, which is version 4.29.0.dev0). It also looks like the tokenizer_config.json is present in your model repo, so all the required files are present.

Are you sure you don't have a F:\eo-reco\xekri\my_awesome_asr_model directory that would be interfering with this?

@RobertBaruch
Copy link
Contributor Author

RobertBaruch commented May 8, 2023

The problem happens even if I delete the local directory.

So the problem appears to be that there is a missing step in the docs:

processor.save_pretrained(save_directory="my_awesome_asr_mind_model")

Without this, there is no tokenizer_config.json.

The reason tokenizer_config.json was present in my repo is that I added the line and then ran the program again.

If you look at main.py.zip above, you can see where I had the line commented out. With that line commented out, the error happens.

@hollance
Copy link
Contributor

hollance commented May 8, 2023

It does look like those instructions are missing from the docs, I'll ping someone from the docs team to have a look. Thanks for reporting!

@RobertBaruch
Copy link
Contributor Author

Possibly related: #23222

@MKhalusova
Copy link
Contributor

Thanks for reporting this! If you pass processor to the Trainer, it will save both tokenizer and feature_extractor, and push them both to hub. I'll update the docs. #23239

@github-actions
Copy link

github-actions bot commented Jun 6, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants