-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example of using Gemini Live API with Mesop. #1191
Conversation
Demonstrates the following: - Setting up web socket connection to Gemini Live API via client - Playing audio responses from Gemini Live API - Sending audio input via user's microphone to Gemini Live API - Sending video frames via user's webcam to Gemini Live API - Using custom tool to interact with Mesop UI
Thanks for creating this! I pulled it down and played with it, pretty neat :) LGTM - left a couple of brainstorming questions. It feels like to me like quite a lot of work to do multimodal stuff in Mesop, which makes sense since the framework wasn't focused on that. I'm curious if you have thoughts about making it easier to build multimodal apps in mesop |
|
||
import mesop.labs as mel | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a good functionality inside mesop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can add it. Do you have any suggestion for how to expose the functionality API-wise?
In generally, I think we could add some more tooling to make it easier to write web components. I find it pretty tedious setting up the data and events. It's a slightly better experience than building a core mesop component though. I guess ideally, whatever AI tool could just generate most of the boilerplate for us at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think basically it would be updating
mesop/mesop/component_helpers/helper.py
Line 468 in 07c3190
event_handler = events[event] |
Agree WC are a bit tedious, although I have found pointing an LLM to an existing web component file and say "make something similar to foo.py and foo.js except to do X" gets you most of the boilerplate
|
||
connectedCallback() { | ||
super.connectedCallback(); | ||
window.addEventListener('audio-input-received', this.onAudioInputReceived); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this definitely works, but it's a bit of a strange pattern since it wouldn't work with multiple instances. I'm wondering what your thoughts are about:
- having gemini live be a master component and basically have everything else be a child, this way instead of listening to window, you would listen to this part of the DOM tree - it does some like logically this is the master/orchestrator component.
- having everything be 1 web component. It would simplify the cross-component communication, but then you would need to do pretty much all the UI in JS instead of Python, but I wonder if Namedslots support for web_components #1185 would solve this issue.
(don't need any changes to this example, but just food for thought)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree about this being a not great pattern. Mostly was a quick hack to avoid having to send data to Mesop and then to Gemini Live web component. I did consider some other ideas here, such as dynamically naming the event listeners, but figured I'd punt. Because, yeah, it took much more work than I would have liked to get it running.
- The orchestrator idea sounds interesting. That sounds like it could work.
- Yeah, I did consider that option, but agree that that option would kind of defeat the purpose of using Mesop and sort of limit the configurability of the UI. But definitely would have made things easier to implement. I think named slots for web components would sort of solve the problem, but not entirely since controlling the UI would be limited where the slots were. But for demo purposes, that may be enough. And allowing for custom control over how the UI elements look ends up adding more effort.
I think one option is to include Gemini Live, Audio Player, and Audio Recorder in a single web component since those technically don't need any UI elements. The buttons could be done as a Mesop components. Then could trigger click events to that update state to the centralized web component, such as a disabling/enabling microphone. I think that could work.
I guess Video Player could technically also be controller in the consolidated web component as well since the video preview isn't necessary. So displaying the video preview could potentially be a separate web component.
But yeah, I think this was definitely a lot more work to get going and definitely more work than I think someone building a demo on Mesop would want to do. Definitely too much javascript required. So would be cool if we could get a web component that would be easy to use and integrate.
I think one drawback of this is that the direct web socket connection on the client side isn't that secure, which means people would likely need a proxy to the web socket connection. So that would lead to more work for Mesop devs in terms of having to set that up.
I also have a version that runs the web socket in the backend Mesop server: https://github.com/richard-to/mesop-gemini-2-experiments. But the problem with this is that we likely don't want to couple the web socket connections to the Mesop server. Probably not too scalable even for demos. I didn't include those demos here since it doesn't work for 3.10. Some async feature I'm using kept throwing errors.
I think in the end, probably for more realistic use cases using video, we'd want some kind of Web RTC server rather than a web socket connection.
I don't have any concrete ideas of how to make it easier to build multi-modal Mesop apps as of yet. I think one purpose of trying to do this has been to see how much to see where there potential weak points are.
I think my initial thoughts are:
- Web socket support seems more and more important (so it's good we have that now)
- At least for the Gemini live API, seems good to separate out the Web Socket server for that API, but doing so is probably up to the user to set up, though could be helpful to point people to existing proxy implementations. But initially they could use the direct JS connection to the Gemini Live API for development
- It's good that I was able to get things working with web components. Definitely to make them shareable and usable, they'd need to be cleaned up a lot.
- I think a lot of it comes down to working with web components, making it easier to build them and share them, especially the latter case since ultimately we don't want the majority of Mesop devs to be digging into the javascript stuff, unfortunately building up that ecosystem of web components is a lot of work.
- In terms of web components, there's definitely a good amount of use cases where the web component itself has no UI. So I'll either proxy the event from a UI element (Video Player) or use a slot (Audio Recorder). And getting this set up is pretty tedious. Also I'm not sure there's a established pattern yet.
- I think the web component to web component communication can be a useful pattern, but as discussed it needs a more established design pattern
- I think still a lot of unknowns when working with larger amounts of data
I still have a couple more things I want to test out. I want to see if we can do something similar to AI Studio where each audio response can be replayed. And I think that scenario is more about how do we store the audio. Do we store it in state? And will it make the state too large after too many turns? Or do we do something like use the direct web component to web component communication since the audio data is already on the client?
Demonstrates the following: