Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AudioQuestionAnswering pipeline #33782

Open
cdreetz opened this issue Sep 28, 2024 · 4 comments
Open

Add AudioQuestionAnswering pipeline #33782

cdreetz opened this issue Sep 28, 2024 · 4 comments
Labels
Feature request Request for a new feature

Comments

@cdreetz
Copy link

cdreetz commented Sep 28, 2024

Feature request

A new AudioQuestionAnswering pipeline, just like DQA but instead of providing a document, applying OCR, and doing QA over it, provide audio file, apply STT, and do QA over the transcript. Advanced version includes diarization+STT as speaker annotations provide important context and will improve QA/understanding.

Motivation

This kind of pipeline is one that I have had to build on multiple occasions for processing audio, specifically phone call recordings. Just like the other pipelines which provide accessibility to some applied ML based pipeline for those to use quickly and easily, this will provide the same thing just for a different modality than what is currently provided.

Your contribution

I plan to contribute the entire pipeline. My inspiration and what I plan to base a lot of the PR for this pipeline comes from #18414.

I'm mostly just posting this issue to get feedback from HF team. Tagging @Narsil @NielsRogge as they also provided feedback on the DQA PR.

@cdreetz cdreetz added the Feature request Request for a new feature label Sep 28, 2024
@LysandreJik
Copy link
Member

cc @ylacombe @eustlb @Rocketknight1

@Rocketknight1
Copy link
Member

I think this is quite an interesting idea, and I'd support it as a pipeline (even though we don't have a matching Hub spec for it yet). cc @sanchit-gandhi who I think worked on diarization as well.

Overall though, I'd be happy to accept and review the PR, unless anyone else has objections!

@cdreetz
Copy link
Author

cdreetz commented Oct 14, 2024

Hey @Rocketknight1, thanks for the willingness to help! I've implemented a working version and iterated on it a bunch, but am at a point I think it would be best to get the opinions of maintainers. A few things undecided I would love some input on:

  • All current SUPPORTED_TASKS define a single default model in the init, but given the nature of AQA requiring 2 models, I am unsure whether to get SUPPORTED_TASKS to support 2 models, or just define the ASR as default, and then load QA model in the pipeline class.
  • Do we think there should be some max model size for default models? For the sake of accessibility a small default model makes sense, but for the pipeline to work as good as I like it to, requires slightly larger models (whisper tiny vs whisper large turbo)
  • Is it important to use models that fall in the question-answering task, or can general text gen models be used in place? Literally just passing the question and context kv's that QA expects, as a string to an Instruct tuned text gen model. Again also a question of how small we want defaults, llama3.2 1b is pretty small and would enable more generalized answers to question on context.

@Rocketknight1
Copy link
Member

Hmm, I see! I didn't realize when you first proposed this that it combined two separate models that weren't trained together. That is unusual for pipelines - is there a reason to use a single pipeline for this task, instead of just calling a STT pipeline and then passing output to an Instruct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants