Describe is an app that generates customizable audiovisual descriptions of videos. It uses a combination of visual language models (VLMs) and language models (LMs) to generate a summary of the video content. The app is designed to be highly customizable, allowing users to control the level of visual detail, conciseness, and the influence of spoken context on the final summary.
To run this on your own videos quickly, check out the app on Sieve here. Below is a quick example output of an interview with Boris Johnson:
In this video, a blonde man in a blue Xchanging shirt, accompanied by another person in similar attire, engages with the press outdoors. As the reporters incessantly question him about his regrets regarding earlier comments, he deflects by offering tea, emphasizing the gesture's humanitarian nature and expressing sympathy for the reporters' long wait. Despite repeated inquiries, he maintains a polite demeanor, focusing on the act of offering tea rather than providing direct answers. The visual content captures this exchange against an outdoor backdrop, highlighting the interaction's dynamic and the man's approach in handling the situation.
Note: This app uses functions that run on the Sieve platform. You will need a Sieve account to use this. However, all the logic is open-source and pretty simple to replicate in your own environment as well. The functions used are:
- Speech Transcriber for transcriptions
- MoonDream (Low visual detail but fastest)
- InternLM-X (Medium visual detail and speed)
- CogVLM (High visual detail but slower)
- VILA (Ultra high visual detail but slowest)
Summaries are generated using an LLM backend that combines various pieces of context. We have two options for this: openai
and mixtral
.
- If
openai
is selected, we use GPT-4 by OpenAI. - If
mixtral
is selected, we use themistralai/Mixtral-8x7B-Instruct-v0.1
model hosted on Together AI.
In order to use the app in your account, you will need to add an OPENAI_API_KEY
or a TOGETHERAI_API_KEY
secret in your Sieve account settings and specify the backend accordingly.
The app contains 3 presets for conciseness
and visual_detail
as well as a boolean for spoken_context
.
This preset influences how much visual detail is present in the final summary by using different VLMs depending on the preset.
low
uses MoonDream - a lightweight and very fast VLM that is ideal for scenarios that don't require too much visual information. This is the fastest in terms of response time but not very high in detailmedium
uses InternLM-X - a SOTA VLM that outperforms GPT4v in many benchmarks. This is the 4bit quantized version and falls in between the other two in terms of speed and detail.high
uses CogVLM - a very rich and detailed VLM that uses a full sized LLM with it to generate captions, making it capable of highly detailed captions. This is slower in terms of response time but high in detail.ultra
uses VILA - a VLM pretrained with video understanding as a core focus, allowing it to work with multiple frames of a video at once. This allows the VLM to work with longer sections of a video at the same time, capturing more detail and providing better captions.
The default is high
.
This preset influences how verbose or concise the final summary should be by prompting GPT3.5 and GPT4 with constraints on things like length and finer detail.
concise
mode tries to keep the summary to 4 sentences and is the slimmest in terms of the outputmedium
mode tries to keep the summary to 8 sentencesdetailed
mode tries to keep the summary to at most 12 sentences and is the most verbose
The default is concise
.
This option dictates the usage of the speech in the video to influence the final summary. If set to False
, the summary will be generated based on only the visuals seen across the video. If set to True
, GPT3.5 will be used first to create shorter summaries which will then be combined at the end to create the final summary.
This option will give sentence level references for each sentence in the summary. It gives references to audio transcripts, visual captions and objects in the scenes, with start time and end_time indicating the span of each sentence in the video. It enables chunk_by_scene by default for better context.