llms: Add support for using the whisper model to transcribe audio #696

devalexandre · 2024-03-20T04:09:56Z

PR Checklist

Read the Contributing documentation.
Read the Code of conduct documentation.
Name your Pull Request title clearly, concisely, and prefixed with the name of the primarily affected package you changed according to Good commit messages (such as memory: add interfaces for X, Y or util: add whizzbang helpers).
Check that there isn't already a PR that solves the problem the same way to avoid creating a duplicate.
Provide a description in this PR that addresses what the PR is solving, or reference the issue that it solves (e.g. Fixes #123).
Describes the source of new concepts.
References existing implementations as appropriate.
Contains test coverage for new functions.
Passes all golangci-lint checks.

llms/openai/internal/openaiclient/wisper.go

llms/openai/openaillm.go

llms/openai/openaillm_option.go

llms/openai/openaillm.go

eliben · 2024-03-25T13:33:58Z

llms/openai/multicontent_test.go

+	_, err := os.Stat(audioFilePath)
+	require.NoError(t, err)
+
+	rsp, err := llm.TranscribeAudio(context.Background(), audioFilePath)


Does it make sense to think of "transcribe audio" in the context of LLMs? AFAIU Whisper is a distinct model from LLMs like the GPT family.

Is this intended as a one-off method only for openai, or as some general audio transcription interface?

only for openai, but I think it's interesting to include it in the general context, because there are other models that do this, but at the moment it's only for openai

I haven't tried it yet, but this seems interesting as a locally running alternative: https://github.com/JigsawStack/insanely-fast-whisper-api

It would be cool if we could support something like that, so you can combine it with a Ollama to build some local-only tools.

(So far I've been using https://github.com/Purfview/whisper-standalone-win locally, which is a single-binary wrapper around https://github.com/SYSTRAN/faster-whisper)

We can implement it in other LLMS, the problem is that so far I have only found the standalone version for Windows, but I am looking for other alternatives

tmc · 2024-03-26T20:51:36Z

I agree with @eliben's intuition here, I'm not sure if audio transcription as a concept fits right into our llm namespace. I'm open to exposing this and generalizing over providers but I think it belongs in a different namespace.

devalexandre · 2024-03-27T13:40:30Z

@tmc , @eliben

What would be the implementation idea for this functionality? maybe use openai.TranscribeAudio, leaving it only within the openai package and not in the LLM namespace?

I think in use how it, do a loader
https://js.langchain.com/docs/integrations/document_loaders/file_loaders/openai_whisper_audio

documentloaders/whisper.go

devalexandre · 2024-04-23T12:18:54Z

@tmc some update ?

devalexandre added 3 commits March 19, 2024 15:41

llms: Wisper model

2808940

llms: Wisper model

388d7c7

fix: formated

ce706e1

devalexandre force-pushed the feat/wisper branch from cbd159d to ce706e1 Compare March 20, 2024 04:11

fix: formated

dbfcc28