Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Biswajit2902 · 2024-01-17T13:26:24Z

What does this PR do?

Fixes # (feature)

initial_prompt support for whisper Pipeline (automatic-speech-recognition)

Before submitting

Added initial_prompt as an option for whisper model
To handle initial prompt processor considered as optional parameter
Current implementation supports only Torch version of decoding.
how to use initial prompt;

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)


dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
audio = dataset[0]["audio"]["array"]
sampling_rate = dataset[0]["audio"]["sampling_rate"]

# including timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper", return_timestamps=True))

# without timestamp
print(pipe(audio, initial_prompt = "Biswajit, Whisper"))

Who can review?

Anyone in the community is free to review the PR once the tests have passed. @sanchit-gandhi , @Narsil, Can anyone help to take this PR forward please. Let me know, if anything is needed.

fixes #27317

Dev

removed spaces from blank line

unformatted import fixed at line 14

unformatted import fixed at line no. 14

kaminwong · 2024-01-26T16:15:12Z

Hi thank you your code saved my day! I think line 535 needs to modify a bit prompt_tensor = torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype).cuda() if is_torch_cuda_available else torch.tensor(generate_kwargs["prompt_ids"], dtype=out["tokens"].dtype), and add is_torch_cuda_available to line 22. without cuda it'll run on cpu which is a lot slower.

Biswajit2902 · 2024-01-27T04:46:25Z

@kaminwong , this is just to modify the output sequence to avoid showing inital_prompt in transcription.

Actual generation has device handles in below line.

           tokens = self.model.generate(
                attention_mask=attention_mask,
                **generate_kwargs,
            )

Apart from this token decoding part is serialised implementation which has no effect, that can be misuse of GPU.

kaminwong · 2024-01-27T11:23:04Z

Thanks for the reply! But if I don't make that changes I get the following error, so I assume prompt_tensor needs to be in cuda if device is also in cuda? Or is there any other way to correct the error? Thank you for your time.

File "/.../python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 538, in _forward if (tmp_tokens[0:nprompt_token] == prompt_tensor).sum() == nprompt_token: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I followed the code you posted:


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
    processor=processor
)

Biswajit2902 · 2024-01-28T08:22:12Z

@kaminwong , Thank you for addressing. I understood the issue. let me verify and reolved it.

…/transformers/pipelines/automatic_speech_recognition.py

…/transformers/pipelines/automatic_speech_recognition.py (formatted)

Biswajit2902 · 2024-01-28T14:45:48Z

@kaminwong , you can pull latest commit and install it should work now. its fixed.

thomasmol · 2024-04-23T12:01:42Z

@Biswajit2902 any new updates? let me know if you need help

Biswajit2902 · 2024-04-24T03:37:02Z

@thomasmol I will update on this soon. was busy since two weeks. Thank you for the reminder.

Biswajit2902 · 2024-04-29T03:44:40Z

@thomasmol @sanchit-gandhi , i see below conflict in AutomaticSpeechRecognitionPipeline._sanitize_parameters;

<<<<<<< main
            forward_params["generate_kwargs"]["max_new_tokens"] = max_new_tokens
        if initial_prompt is not None:
            forward_params["generate_kwargs"]["initial_prompt"] = initial_prompt
=======
            forward_params["max_new_tokens"] = max_new_tokens
>>>>>>> main

I want to understand why we removed generate_kwargs from forward_params. Also initial_prompt.

My changes before were working fine. But after this, there seems have some bug. I am working on resolving it. So need your input on this.

sanchit-gandhi · 2024-05-20T16:27:01Z

Hey @Biswajit2902 - you can read the motivation for this change here. Essentially, we're unifying the forward_params and generate_kwargs in _sanitize_parameters. However, for the purposes of your feature, you should strive to put the initial_prompt under preprocess_params:

preprocess_params["initial_prompt"] = initial_prompt

And then convert the text prompt to token ids in the preprocess method, which will then be passed to _forward.

Biswajit2902 · 2024-05-23T05:31:43Z

@sanchit-gandhi , Thanks for the pointer. Sorry got super busy could go back review. Will do it soon and close it.

Latest 2605

remove unused.

Latest 2605

clean up

Biswajit2902 · 2024-05-27T07:16:27Z

@sanchit-gandhi , Just an update. I have made the changes for this issue as suggested. But i have identified the output is not proper like before. seems like generate has some issue. its adding initial prompt with all the chunks. Will check and update on this. Also let me know if any existing issue going on this to your knowledge.

amyeroberts · 2024-07-16T19:32:44Z

cc @kamilakesbi as @sanchit-gandhi is off

basicblueberrry136 · 2024-08-06T20:15:04Z

are there any updates on this? or other ways you know of for pushing the model to more easily detect certain words using this pipeline?

amyeroberts · 2024-08-07T11:46:35Z

c @ylacombe

ylacombe · 2024-09-02T16:30:29Z

Hey @basicblueberrry136, thanks for your comment!
@sanchit-gandhi's review still has to be addressed before the next steps. Once it's done, I'll make another review! Hopefully it'll move fast!

JacobLinCool · 2024-10-21T10:59:36Z

I believe this is very helpful when used with the serverless inference API.

It seems that the serverless inference API uses the Transformers library to run models, and we cannot pass any parameter that has a type of tensor, as shown below:

const data = fs.readFileSync(filename);
const b64 = data.toString('base64');

const body = JSON.stringify({
    inputs: b64,
    parameters: {
        return_timestamps: true,
        generate_kwargs: {
            num_beams: 1,
            prompt_ids: [50362, 27338, 3763, 48022, 2257, 48022, 6784, 118, 25157, 1546, 15789, 23987, 5975, 17174, 28472, 25750, 6062, 1543],
        }
    }
});

It results in the following error:

{
  "error": "unknown error",
  "warnings": [
    "There was an inference error: unknown error: list indices must be integers or slices, not NoneType"
  ]
}

If initial_prompt is added, we can pass the prompt as a string to the serverless inference API.

jollyfish-cjy · 2025-02-17T02:52:15Z

Hi, thanks for your work! Are there any updates on this?

Biswajit2902 added 12 commits January 16, 2024 21:17

test

befe862

revert back

9eb6bf5

added option for inital_prompt for whisper API

caecfd2

added option for inital_prompt for whisper API

8eb3179

added option for inital_prompt for whisper API

74c505e

added option for inital_prompt for whisper API

47e770f

added option for inital_prompt for whisper API

de2f27a

added option for inital_prompt for whisper API

cf15d96

Merge pull request #1 from Biswajit2902/dev

a461d27

Dev

Update automatic_speech_recognition.py

62cf72b

removed spaces from blank line

Update automatic_speech_recognition.py

8e50b7f

unformatted import fixed at line 14

Update automatic_speech_recognition.py

f16bd91

unformatted import fixed at line no. 14

Biswajit2902 changed the title ~~Feature Update [added support for initial_prompt for automatic-speech-recognition whisper pipeline]~~ Feature Update [added initial_prompt support for automatic-speech-recognition whisper pipeline] Jan 17, 2024

Biswajit2902 added 2 commits January 17, 2024 19:15

fixed formatting

448c6f1

reformatted src/transformers/pipelines/automatic_speech_recognition.py

e58e861

Biswajit2902 marked this pull request as ready for review January 17, 2024 14:10

Biswajit2902 mentioned this pull request Jan 17, 2024

audio pipeline support for initial_prompt? #27317

Open

4 tasks

Merge branch 'main' into main

3169e2d

amyeroberts mentioned this pull request Jan 23, 2024

Fine tuning whisper and whisper lora with prompts #28549

Closed

Biswajit2902 added 4 commits January 24, 2024 13:18

Merge branch 'main' into main

1d4b9ed

Merge branch 'main' into main

7894a72

Merge branch 'main' into main

2180226

Merge branch 'main' into main

10d802d

Biswajit2902 added 2 commits January 28, 2024 19:45

added device handle for whisper decoding (with initial_prompt) in src…

3af7f9a

…/transformers/pipelines/automatic_speech_recognition.py

added device handle for whisper decoding (with initial_prompt) in src…

3fa18ab

…/transformers/pipelines/automatic_speech_recognition.py (formatted)

amyeroberts added Core: Pipeline Internals of the library; Pipeline. Audio labels Apr 24, 2024

codingl2k1 mentioned this pull request Apr 30, 2024

FEAT: Audio support verbose_json and timestamp xorbitsai/inference#1402

Merged

sanchit-gandhi mentioned this pull request May 20, 2024

openai/whisper-large-v2 prompt #29629

Closed

4 tasks

sanchit-gandhi linked an issue May 22, 2024 that may be closed by this pull request

openai/whisper-large-v2 prompt #29629

Closed

4 tasks

Biswajit2902 added 8 commits May 26, 2024 22:21

updated initial prompt implementation

49d44ec

Merge branch 'main' into latest-2605

536ac70

Merge pull request #2 from Biswajit2902/latest-2605

b2e8427

Latest 2605

Update automatic_speech_recognition.py

bf9068a

remove unused.

clean up

4aaa93c

Merge pull request #3 from Biswajit2902/latest-2605

92cb506

Latest 2605

clean up

cd2e264

Merge pull request #4 from Biswajit2902/latest-2605

9cc34f4

clean up

Merge branch 'huggingface:main' into main

877fda4

huggingface deleted a comment from github-actions bot Jul 16, 2024

Biswajit2902 added 2 commits August 10, 2024 21:46

Merge branch 'huggingface:main' into main

e35154d

Merge branch 'huggingface:main' into main

f171d8a

balbatross mentioned this pull request Sep 8, 2024

WhisperModel initial_prompt huggingface/transformers.js#923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Biswajit2902 commented Jan 17, 2024 •

edited by ArthurZucker

Loading

kaminwong commented Jan 26, 2024

Biswajit2902 commented Jan 27, 2024

kaminwong commented Jan 27, 2024 •

edited

Loading

Biswajit2902 commented Jan 28, 2024

Biswajit2902 commented Jan 28, 2024

thomasmol commented Apr 23, 2024

Biswajit2902 commented Apr 24, 2024

Biswajit2902 commented Apr 29, 2024 •

edited

Loading

sanchit-gandhi commented May 20, 2024 •

edited

Loading

Biswajit2902 commented May 23, 2024

Biswajit2902 commented May 27, 2024

amyeroberts commented Jul 16, 2024

basicblueberrry136 commented Aug 6, 2024

amyeroberts commented Aug 7, 2024

ylacombe commented Sep 2, 2024

JacobLinCool commented Oct 21, 2024

jollyfish-cjy commented Feb 17, 2025

Feature Update [added initial_prompt support for automatic-speech-recognition whisper pipeline] #28556

Are you sure you want to change the base?

Feature Update [added initial_prompt support for automatic-speech-recognition whisper pipeline] #28556

Conversation

Biswajit2902 commented Jan 17, 2024 • edited by ArthurZucker Loading

What does this PR do?

Before submitting

Who can review?

kaminwong commented Jan 26, 2024

Biswajit2902 commented Jan 27, 2024

kaminwong commented Jan 27, 2024 • edited Loading

Biswajit2902 commented Jan 28, 2024

Biswajit2902 commented Jan 28, 2024

thomasmol commented Apr 23, 2024

Biswajit2902 commented Apr 24, 2024

Biswajit2902 commented Apr 29, 2024 • edited Loading

sanchit-gandhi commented May 20, 2024 • edited Loading

Biswajit2902 commented May 23, 2024

Biswajit2902 commented May 27, 2024

amyeroberts commented Jul 16, 2024

basicblueberrry136 commented Aug 6, 2024

amyeroberts commented Aug 7, 2024

ylacombe commented Sep 2, 2024

JacobLinCool commented Oct 21, 2024

jollyfish-cjy commented Feb 17, 2025

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Feature Update [added `initial_prompt` support for automatic-speech-recognition whisper pipeline] #28556

Biswajit2902 commented Jan 17, 2024 •

edited by ArthurZucker

Loading

kaminwong commented Jan 27, 2024 •

edited

Loading

Biswajit2902 commented Apr 29, 2024 •

edited

Loading

sanchit-gandhi commented May 20, 2024 •

edited

Loading