30-Dec-24
A few notes for implementing a voice app for working with large language models (LLMs).
Perhaps the aspect of working with large language models that has been the most underaddressed to date has been the question of what to do with outputs/completions/chats/threads.
Some of the information generated by large language models is highly useful and in some cases also valuable. This goes for both personal and business users.
While some attention has been paid to how to manage a prompt inventory (with the advent of dedicated prompting IDEs and systems for prompt versioning, etc.) far less attention has been paid to the question of what to do with outputs. This is quite perplexing! The object of prompt engineering is to achieve better outputs from AI tools. So the question of what to do with those should be a central concern.
Parallel to the growth of large language models has been the tremendous advances in automatic speech recognition (ASR) which shares an underlying architecture with LLMs. The integration of voice functions into large language models enhances their utility even further, making it effortless to interact with them while on the go.
Accessing LLMs by voice has so far taken two main approaches (perhaps they could be called Gen and Gen 2!):
Gen 1: Voice interactions are achieved by combining two separate technologies: speech to text (STT) and text to speech (TTS). In this model, the user's prompts are converted to text, sent to the LLM for inference, and returned as text that is then synthesized with an artificial voice. In a sense, this is simply the classic model for interacting with large language models, with the voice elements as attachments on either end.
Gen 2: Speech to speech is fast emerging as a more streamlined architecture for implementing this. In this approach, the API itself focuses on managing the voice elements. And the inference is fine tuned for the low latency requirement which is seen in voice. Additionally, models are tuned to refine their mechanisms for outputting to better model the patterns expected in simulated voice interactions. Open AI Real Time API and the forthcoming feature editions to Gemini currently available for preview in Google AI Studio are examples of this.
However voice interaction is achieved, and in spite of the benefits that it brings, the same open-ended question about output storage remains.
A transcription is the normative method for recording a speech interaction. This can be achieved with both approaches. But the question then becomes, what should one do with this transcription?
Having used what I've called "Gen 1" and "Gen 2" tools, I think that there are some important distinctions that haven't yet fully been addressed in the tools.
I believe that what might be called "long prompting" is one of the huge advantages of voice capture for LLMs. It's much easier to write detailed context-rich prompts by using speech to text than it is by typing them out.
While the real time models are interesting, it's more challenging to control the interaction between you and the LLM. It's easy to use "Gen 1" models in an implementation that models push to talk: The user can hold down a microphone button until they're finished with their prompt, release it when done, and then send it. Alternatively, because in gen 1 approaches STT and TTS are decoupled, the frontend implementation can allow the user some degree of control over pause tolerance.
Implementing these kind of important UI distinctions is harder when all the logic is dictated by the API including what kind of pauses will be acceptable and how the user should be able to stop the real time response if it's not helpful (ie, interrupt words). The choice between the two is not a simple matter. .
The idea for the Voice LLM app described in this note Is a system that supports voice-first use whether the user is accessing the tool through a web UI or desktop interface or a smartphone app.
The objective is to take the functionalities already widely available for voice interaction with large language models and add onto them robust logic for output routing and storage. This would require adding voice command support onto the baseline interaction logic of the LLM.
The objective is to allow the voice app to serve as an integral part of business workflows. Instead of recording the interactions and transcripts that are siloed in the platform, as is the most common implementation currently, the idea is to connect the voice app seamlessly into integrated business systems (or personal ones). This could be achieved using the fast evolving MCP landscape.
The challenge would be in determining the most effective way to implement the saving logic.
During the course of a lengthy conversation with an LLM, the user might only want to record a few specific details.
One approach to this might be transcript summarization, achieved by the LLM or by a second connected one. A system prompt could be fed to a summarization LLM stating something like: "Please create a summarized version of the transcripts according to the user's instructions. Remove the prompts."
Ad hoc routing would also be highly useful. That is to say that during a Conversation. The user might say something like "that was really great. Can you create a Google Doc with that summary?" The logic would need to understand this to be a voice command and the user would need to have the integration with Google Drive preconfigured.
A second function that might be particularly useful for professional AI users would be the ability to differentiate between prompts, AI outputs, and artifacts like code generations when they are interspersed, as they would usually be, within a chat transcript.
For example, the user could pass a voice command along the lines of "extract the prompts from this conversation and save them to the prompt library". Implementation could involve the user designating a specific prompt library in the app where prompts are saved having been extracted from the body of the transcript.
Finally, the app might need to have some logic for more complex voice commands that demand higher levels of logic and multi step workflows. For example (user voice commands):
- I like those suggested ideas. Could you format them into a document and then send them to X? (extraction, document formatting, email integration)
- Could you take that summary of the CRM options, but for each option that you found to add the monthly subscription cost? When you've done that, go ahead and save that to the pricing research folder in the Business Drive. Now I've got to go!
A key goal for this idea would be supporting truly seamless and hands free operation. In this respect, on the go voice interactions have to be distinguished from those which could be conducted in a more stationary setting, like a home office or a conventional office.
A user might wish to use a voice LLM interaction while working at a standing desk, or perhaps doing housework. In these contexts, the user could use conventional interaction methods (ie their hands!) to fix transcription errors, interrupt outputs, etc.
A user interacting with an LLM with their hands full (literally!) would not have these options. In this second, more demanding setting the reliability of things like interrupt commands become essential - as do voice commands for refining or overriding incorrect prompt capture.
But it's in this second setting that the idea of capturing voice interactions becomes much more useful!
While a user may wish to engage with a large language model by voice at their office, they can simply copy the transcript into a Google Drive for retention.
A user on the go has to either expect that this will happen automatically (which could be a less intelligent saving method!) or hope that the voice commands work reliably. But if they did, it would support integrating more mobile and healthier workflows, especially for remote workers with colleagues who might find themselves working at a desk.
In this (second) use case, the mobile remote worker might interact with an LLM by voice to ideate some ideas for a meeting, use voice commands to route those outputs into a shared Drive, while colleagues in the office could immediately receive that output and develop it from their stationary workstations. This use case would be perfect for just about any modern team that has embraced remote and hybrid working!
Daniel Rosehill
(public at danielrosehill dot com)
This repository is licensed under CC-BY-4.0 (Attribution 4.0 International) License
The Creative Commons Attribution 4.0 International (CC BY 4.0) license allows others to:
- Share: Copy and redistribute the material in any medium or format.
- Adapt: Remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
- Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
For the full legal code, please visit the Creative Commons website.