Add AudioQuestionAnswering pipeline #33782

cdreetz · 2024-09-28T01:46:19Z

Feature request

A new AudioQuestionAnswering pipeline, just like DQA but instead of providing a document, applying OCR, and doing QA over it, provide audio file, apply STT, and do QA over the transcript. Advanced version includes diarization+STT as speaker annotations provide important context and will improve QA/understanding.

Motivation

This kind of pipeline is one that I have had to build on multiple occasions for processing audio, specifically phone call recordings. Just like the other pipelines which provide accessibility to some applied ML based pipeline for those to use quickly and easily, this will provide the same thing just for a different modality than what is currently provided.

Your contribution

I plan to contribute the entire pipeline. My inspiration and what I plan to base a lot of the PR for this pipeline comes from #18414.

I'm mostly just posting this issue to get feedback from HF team. Tagging @Narsil @NielsRogge as they also provided feedback on the DQA PR.

LysandreJik · 2024-09-30T18:35:51Z

cc @ylacombe @eustlb @Rocketknight1

Rocketknight1 · 2024-10-02T15:59:55Z

I think this is quite an interesting idea, and I'd support it as a pipeline (even though we don't have a matching Hub spec for it yet). cc @sanchit-gandhi who I think worked on diarization as well.

Overall though, I'd be happy to accept and review the PR, unless anyone else has objections!

cdreetz · 2024-10-14T18:40:59Z

Hey @Rocketknight1, thanks for the willingness to help! I've implemented a working version and iterated on it a bunch, but am at a point I think it would be best to get the opinions of maintainers. A few things undecided I would love some input on:

All current SUPPORTED_TASKS define a single default model in the init, but given the nature of AQA requiring 2 models, I am unsure whether to get SUPPORTED_TASKS to support 2 models, or just define the ASR as default, and then load QA model in the pipeline class.
Do we think there should be some max model size for default models? For the sake of accessibility a small default model makes sense, but for the pipeline to work as good as I like it to, requires slightly larger models (whisper tiny vs whisper large turbo)
Is it important to use models that fall in the question-answering task, or can general text gen models be used in place? Literally just passing the question and context kv's that QA expects, as a string to an Instruct tuned text gen model. Again also a question of how small we want defaults, llama3.2 1b is pretty small and would enable more generalized answers to question on context.

Rocketknight1 · 2024-10-15T17:13:36Z

Hmm, I see! I didn't realize when you first proposed this that it combined two separate models that weren't trained together. That is unusual for pipelines - is there a reason to use a single pipeline for this task, instead of just calling a STT pipeline and then passing output to an Instruct?

cdreetz added the Feature request Request for a new feature label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AudioQuestionAnswering pipeline #33782

Add AudioQuestionAnswering pipeline #33782

cdreetz commented Sep 28, 2024

LysandreJik commented Sep 30, 2024

Rocketknight1 commented Oct 2, 2024

cdreetz commented Oct 14, 2024

Rocketknight1 commented Oct 15, 2024

Add AudioQuestionAnswering pipeline #33782

Add AudioQuestionAnswering pipeline #33782

Comments

cdreetz commented Sep 28, 2024

Feature request

Motivation

Your contribution

LysandreJik commented Sep 30, 2024

Rocketknight1 commented Oct 2, 2024

cdreetz commented Oct 14, 2024

Rocketknight1 commented Oct 15, 2024