Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add audio to text pipeline #103

Merged
merged 19 commits into from
Jul 16, 2024

Conversation

eliteprox
Copy link
Collaborator

@eliteprox eliteprox commented Jun 12, 2024

This change adds the audio-to-text pipeline to the AI Runner

rickstaa and others added 3 commits April 27, 2024 08:54
This commit contains a quick proof of concept to showcase how easy it is
to add a new pipeline.
@eliteprox eliteprox changed the title Speech to text pipeline Add speech to text pipeline Jun 12, 2024
@eliteprox eliteprox marked this pull request as ready for review June 19, 2024 07:33
@eliteprox eliteprox requested a review from rickstaa as a code owner June 19, 2024 07:33
@ad-astra-video
Copy link
Collaborator

ad-astra-video commented Jul 2, 2024

I did a review of this PR. Initial comments below.

My tests included: speech-mp3-lowbitrate.mp3 (worked), speech-aac-lowbitrate.m4a (worked), vp8-opus.webm (worked), vp8-vorbis.webm (worked), vp9-vorbis.webm (worked), h264 and h265 variants did not work for some reason in mp4 files. Pulling the audio out of the mp4 files were successfully processed.

  • Communicate processing was not successful when input is in wrong format or failure happens
    • I was getting an error on some transcoded files that were intentionally in different input format. The results right now is a null/empty text response: {"chunks":null,"text":""}. I think should be an error response code or at least an error in the response to provide error or "ok" if no error. The last line in the traceback below is where transformers is trying to use ffmpeg in container to parse the file I believe.

    • I checked and found that FFmpeg is installed in the runner container but is an ancient version 4.2.7.

      File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 186, in __next__ processed = next(self.subiterator) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 362, in preprocess inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.9/lib/python3.11/site-packages/transformers/pipelines/audio_utils.py", line 41, in ffmpeg_read raise ValueError( ValueError: Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote URL, ensure that the URL is the full address to **download** the audio file.

  • Is seed a parameter for the speech-to-text pipeline? I was not seeing it on Whisper, maybe keeping it in case a model would use it?
  • Add the download cmd to dl_checkpoints.sh
    • huggingface-cli download openai/whisper-large-v3 --include "*.safetensors" "*.json" --cache-dir models

@eliteprox
Copy link
Collaborator Author

I've updated the runner to handle these errors.

The ai-runner responds with a 400 Bad Request and logs 2024-07-02 15:52:18,181 INFO: 172.17.0.1:38806 - "POST /speech-to-text HTTP/1.1" 400 Bad Request

The gateway logs I0702 11:50:40.632077 2527064 ai_process.go:561] clientIP=192.168.10.71 request_id=dd12552c Error submitting request cap=31 modelID=openai/whisper-large-v3 try=6 orch=https://0.0.0.0:8936 err=speech-to-text container returned 400

The orchestrator logs 2024/07/02 11:50:40 ERROR speech-to-text container returned 400 err="{\"detail\":{\"msg\":\"Error processing audio file: Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote URL, ensure that the URL is the full address to **download** the audio file.\"}}"

Attached is 400 Bad Request response from Swagger UI:
image

@eliteprox eliteprox changed the title Add speech to text pipeline Add audio to text pipeline Jul 5, 2024
@eliteprox
Copy link
Collaborator Author

  • I was getting an error on some transcoded files that were intentionally in different input format. The results right now is a null/empty text response: {"chunks":null,"text":""}. I think should be an error response code or at least an error in the response to provide error or "ok" if no error. The last line in the traceback below is where transformers is trying to use ffmpeg in container to parse the file I believe.

I've added error handling to the AI Runner if the model experiences any error while processing the file. It specifically checks for
"invalid soundfile" and returns these as 400 Bad Request.

  • Is seed a parameter for the speech-to-text pipeline? I was not seeing it on Whisper, maybe keeping it in case a model would use it?

That's correct, I don't see a purpose for the seed parameter with this model. I think it's unlikely this pipeline will need it.

  • Add the download cmd to dl_checkpoints.sh

    • huggingface-cli download openai/whisper-large-v3 --include "*.safetensors" "*.json" --cache-dir models

I've added this to dl_checkpoints.sh

rickstaa and others added 2 commits July 15, 2024 11:31
This commit introduces support for the Stable Diffusion 3 Medium model
from Hugging Face:
[https://huggingface.co/stabilityai/stable-diffusion-3-medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium).

Please be aware that this model has restrictive licensing at the time of
writing and is not yet advised for public use. Ensure you read and
understand the [licensing
terms](https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/LICENSE)
before enabling this model on your orchestrator.
@rickstaa rickstaa force-pushed the main branch 2 times, most recently from cd1feb4 to 0d03040 Compare July 16, 2024 13:10
eliteprox and others added 4 commits July 16, 2024 09:37
This commit applies several code improvements to the audio-to-text
codebase. It also restructures the utility functions in the pipelines
module.
This commit ensures that both audio-to-text routes have known responses.
@eliteprox eliteprox merged commit 9fc476e into livepeer:main Jul 16, 2024
1 check passed
@rickstaa rickstaa deleted the speech_to_text_pipeline_poc_n branch July 16, 2024 14:13
eliteprox added a commit to eliteprox/ai-worker that referenced this pull request Jul 26, 2024
Add audio to text pipeline
---------
Co-authored-by: Rick Staa
eliteprox added a commit to eliteprox/ai-worker that referenced this pull request Jul 26, 2024
Add audio to text pipeline
---------
Co-authored-by: Rick Staa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants