Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement core real-time interaction capabilities for Gemini API #32

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions samples/stream_realtime/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Gemini Client in Go

This Go package provides a client for interacting with the Gemini **Live API**, enabling real-time multimodal interactions including text, audio, and video streaming. This client is inspired by the functionalities demonstrated in the Gemini 2.0 - Multimodal live API: Streaming, showcasing the ability to stream bidirectional audio and handle text-based conversations.

This client allows you to build applications that can:

* Send and receive text messages.
* Stream audio input to the Gemini API and receive audio responses.
* Stream video input (from camera or screen capture) to the Gemini API.

## Features

* **Real-time Communication:** Leverages WebSockets for low-latency, bidirectional communication with the Gemini Live API.
* **Text Input:** Send textual prompts to the Gemini API for conversational interactions.
* **Audio Input & Output:**
* Record audio from your microphone and stream it to the API.
* Receive and play back audio responses from the API in real-time.
* **Video Input (Camera & Screen Capture):**
* Capture frames from your camera or screen and stream them to the API for multimodal prompts.
* Supports configuration to switch between camera and screen capture modes.
* **Automatic Reconnection:** Attempts to reconnect to the API if the WebSocket connection is interrupted.
* **Configurable:** Allows for customization of the Gemini model to be used.
* **Error Handling & Logging:** Provides robust error handling and logging using `logrus`.

## Prerequisites

* **Go:** Version 1.18 or higher.
* **PortAudio:** Required for audio input/output. Installation instructions vary by operating system.
* **Linux (Debian/Ubuntu):** `sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev`
* **FFmpeg:** Required for camera capture and image conversion.
* You can usually install this using your system's package manager (e.g., `sudo apt-get install ffmpeg` on Debian/Ubuntu, `brew install ffmpeg` on macOS).
* **GEMINI_API_KEY:** You need a valid API key from Google AI Studio, which you can obtain by signing up for the Gemini API.

## Installation

1. **Clone the repository:**
```bash
git clone <repository_url>
cd <repository_directory>
```

2. **Build the application:**
```bash
go build -o gemini-client .
```

## Configuration

1. **Set the API Key:**
You need to set the `GEMINI_API_KEY` environment variable. You can do this by running the following command:

```bash
export GEMINI_API_KEY="YOUR_API_KEY"
```

Replace `YOUR_API_KEY` with your actual Gemini API key.

## Usage

The client can be run in different modes by setting the `MODE` environment variable. This mirrors the different interaction patterns showcased in the.

### Output Mode

```go
setupMsg := map[string]interface{}{
"setup": map[string]interface{}{
"model": "models/gemini-2.0-flash-exp",
"generationConfig": map[string]interface{}{
"responseModalities": []string{"TEXT"}, // "TEXT", "AUDIO"
},
},
}
```
The client defaults to text-based interaction, similar to the basic chat demonstrated in the "Live API - Websockets Quickstart" notebook. You can type messages in the terminal and send them to the Gemini API.

```bash
./gemini-client
```

You will be prompted with `message>: ` to enter your text.

### Audio Mode

To enable real-time bidirectional audio input and output, similar to the "Gemini 2.0 - Multimodal live API: Streaming", run the client without setting a specific responseModalities.
```go
"responseModalities": []string{"AUDIO"}, // "TEXT", "AUDIO"
```
The client will automatically start recording audio from your microphone and playing back audio responses.

```bash
./gemini-client
```

### Camera Mode

To send camera frames to the API, enabling multimodal prompts as explored in the examples, set the `MODE` environment variable to `camera`.

```bash
export MODE="camera"
./gemini-client
```

**Note:** Ensure your camera is properly configured and accessible by the system. The code currently targets `/dev/video1`. You might need to adjust this based on your camera setup.
```go
func (c *Client) openVideoCapture() (*exec.Cmd, error) {
cap := exec.Command("ffmpeg", "-f", "v4l2", "-i", "/dev/video1", "-f", "rawvideo", "-pix_fmt", "rgb24", "-vframes", "1", "-")
return cap, nil
}
```

### Screen Capture Mode

To send screen captures to the API, providing visual context to your prompts, set the `MODE` environment variable to `screen`.

```bash
export MODE="screen"
./gemini-client
```

The client will periodically capture your screen and send it to the API.

## Example Interaction

1. **Run the client (e.g., in text mode):**
```bash
./gemini-client
```
2. **You will see the prompt:**
```
message>:
```
3. **Type your message and press Enter:**
```
message>: What is the weather like today?
```
4. **The response from the Gemini API will be printed in the terminal.**

If running in audio or video modes, the client will continuously stream audio or video data to the API, enabling more dynamic and interactive conversations.

Refer to the Gemini API documentation for the available tools and their configurations.

## Future Improvements

1. **Windows Support:** Add support for Windows, including audio input/output and camera/screen capture.
2. **MacOS Support:** Add support for camera and screen capture on macOS.
3. **Tools Integrations:** Add support for Google Search, etc
4. **Function Calling**: Add support for calling functions in the Gemini API

## Contributing

Contributions are welcome! Please feel free to submit pull requests with improvements or bug fixes.

## Ported from

A Go-based implementation inspired by the functionality of [live_api_starter.ipynb](https://github.com/google-gemini/cookbook/blob/main/gemini-2/websockets/live_api_starter.ipynb) from [https://github.com/google-gemini/cookbook](https://github.com/google-gemini/cookbook).
23 changes: 23 additions & 0 deletions samples/stream_realtime/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
module google.golang.org/genai/samples/stream_realtime

go 1.23.0

require (
github.com/gordonklaus/portaudio v0.0.0-20230709114228-aafa478834f5
github.com/gorilla/websocket v1.5.3
github.com/hajimehoshi/oto v1.0.1
github.com/kbinani/screenshot v0.0.0-20240820160931-a8a2c5d0e191
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646
github.com/sirupsen/logrus v1.9.3
)

require (
github.com/gen2brain/shm v0.1.0 // indirect
github.com/godbus/dbus/v5 v5.1.0 // indirect
github.com/jezek/xgb v1.1.1 // indirect
github.com/lxn/win v0.0.0-20210218163916-a377121e959e // indirect
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8 // indirect
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067 // indirect
golang.org/x/mobile v0.0.0-20190415191353-3e0bab5405d6 // indirect
golang.org/x/sys v0.24.0 // indirect
)
44 changes: 44 additions & 0 deletions samples/stream_realtime/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/gen2brain/shm v0.1.0 h1:MwPeg+zJQXN0RM9o+HqaSFypNoNEcNpeoGp0BTSx2YY=
github.com/gen2brain/shm v0.1.0/go.mod h1:UgIcVtvmOu+aCJpqJX7GOtiN7X2ct+TKLg4RTxwPIUA=
github.com/godbus/dbus/v5 v5.1.0 h1:4KLkAxT3aOY8Li4FRJe/KvhoNFFxo0m6fNuFUO8QJUk=
github.com/godbus/dbus/v5 v5.1.0/go.mod h1:xhWf0FNVPg57R7Z0UbKHbJfkEywrmjJnf7w5xrFpKfA=
github.com/gordonklaus/portaudio v0.0.0-20230709114228-aafa478834f5 h1:5AlozfqaVjGYGhms2OsdUyfdJME76E6rx5MdGpjzZpc=
github.com/gordonklaus/portaudio v0.0.0-20230709114228-aafa478834f5/go.mod h1:WY8R6YKlI2ZI3UyzFk7P6yGSuS+hFwNtEzrexRyD7Es=
github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg=
github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE=
github.com/hajimehoshi/oto v1.0.1 h1:8AMnq0Yr2YmzaiqTg/k1Yzd6IygUGk2we9nmjgbgPn4=
github.com/hajimehoshi/oto v1.0.1/go.mod h1:wovJ8WWMfFKvP587mhHgot/MBr4DnNy9m6EepeVGnos=
github.com/jezek/xgb v1.1.1 h1:bE/r8ZZtSv7l9gk6nU0mYx51aXrvnyb44892TwSaqS4=
github.com/jezek/xgb v1.1.1/go.mod h1:nrhwO0FX/enq75I7Y7G8iN1ubpSGZEiA3v9e9GyRFlk=
github.com/kbinani/screenshot v0.0.0-20240820160931-a8a2c5d0e191 h1:5UHVWNX1qrIbNw7OpKbxe5bHkhHRk3xRKztMjERuCsU=
github.com/kbinani/screenshot v0.0.0-20240820160931-a8a2c5d0e191/go.mod h1:Pmpz2BLf55auQZ67u3rvyI2vAQvNetkK/4zYUmpauZQ=
github.com/lxn/win v0.0.0-20210218163916-a377121e959e h1:H+t6A/QJMbhCSEH5rAuRxh+CtW96g0Or0Fxa9IKr4uc=
github.com/lxn/win v0.0.0-20210218163916-a377121e959e/go.mod h1:KxxjdtRkfNoYDCUP5ryK7XJJNTnpC8atvtmTheChOtk=
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646 h1:zYyBkD/k9seD2A7fsi6Oo2LfFZAehjjQMERAvZLEDnQ=
github.com/nfnt/resize v0.0.0-20180221191011-83c6a9932646/go.mod h1:jpp1/29i3P1S/RLdc7JQKbRpFeM1dOBd8T9ki5s+AY8=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/sirupsen/logrus v1.9.3 h1:dueUQJ1C2q9oE3F7wvmSGAaVtTmUizReu6fjN8uqzbQ=
github.com/sirupsen/logrus v1.9.3/go.mod h1:naHLuLoDiP4jHNo9R0sCBMtWGeIprob74mVsIT4qYEQ=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/testify v1.7.0 h1:nwc3DEeHmmLAfoZucVR881uASk0Mfjw8xYJ99tb5CcY=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8 h1:idBdZTd9UioThJp8KpM/rTSinK/ChZFBE43/WtIy8zg=
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067 h1:KYGJGHOQy8oSi1fDlSpcZF0+juKwk/hEMv5SiwHogR0=
golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js=
golang.org/x/mobile v0.0.0-20190415191353-3e0bab5405d6 h1:vyLBGJPIl9ZYbcQFM2USFmJBK6KI+t+z6jL0lbwjrnc=
golang.org/x/mobile v0.0.0-20190415191353-3e0bab5405d6/go.mod h1:E/iHnbuqvinMTCcRqshq8CkpyQDoeVncDDYHnLhea+o=
golang.org/x/sys v0.0.0-20190312061237-fead79001313/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190429190828-d89cdac9e872/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201018230417-eeed37f84f13/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.24.0 h1:Twjiwq9dn6R1fQcyiK+wQyHWfaz/BJB+YIpzU/Cv3Xg=
golang.org/x/sys v0.24.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c h1:dUUwHk2QECo/6vqA44rthZ8ie2QXMNeKRTHCNY2nXvo=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
Loading