A foundamental question #3

MonolithFoundation · 2024-12-18T08:36:57Z

Hello, the work is remarkable. However, I truly wish to engage in an in-depth consultation regarding the following:

What is the main source of the long video sequence ability? Does it primarily stem from Qwen2VL? I have experimented with Qwen2VL, and it tends to exhibit a high degree of hallucination when the number of frames exceeds approximately 16.

As per the report and demonstration presented in your video, it appears that it can handle videos of several minutes or even up to half an hour. What other techniques were employed here?

zs-zhong · 2024-12-18T09:44:20Z

Hello, and thank you for your interest!

Yes, Lyra's vision capabilities are primarily powered by Qwen2-VL.

Based on our experimental results and the Qwen2-VL report, the number of vision tokens is limited to 16,384. Therefore, the frame count can significantly exceed 32 in practice.

This issue is unlikely to occur with 16 frames. We suspect that your video preprocessing might not align with the official preprocessing steps of Qwen2-VL (GitHub repository, video inference section).

In contrast, our long-video framework diverges from conventional VLM approaches and comprises three stages:

Since VLMs and previous omni-models cannot directly process long audio (e.g., over an hour), the first step involves splitting the long video into video frames and corresponding audio files.
The audio files are then processed by our Lyra long speech model, which summarizes and extracts contextual information based on the user's instructions or queries.
The outputs from the Lyra long speech model, along with the video frames sampled through Qwen2-VL preprocessing, are sent to Lyra's LLM module to produce the final output.

For long videos, the frame count can become substantial (more than 10K), resulting in an excessive number of vision tokens. Conventional VLMs address this by applying temporal frame sampling (e.g., 0.1 FPS, 0.02 FPS, or even lower), which inevitably leads to a significant loss of input information.

However, Lyra's long speech model avoids temporal sampling in the audio modality, retaining nearly all the information from the video. This complements and enhances the MLLM’s ability to handle long video sequences effectively. Compared to the naive VLM, Lyra's omni-understanding capability is significantly more accurate.

MonolithFoundation · 2024-12-18T11:20:35Z

Hi, thank you for your reply. Regarding the audio part, the audio encoder employed is Whisper. How does it manage to capture long audio contextual information? As far as I know, Whisper can only handle audio within several minute.

zs-zhong · 2024-12-18T14:20:35Z

Thank you for your question! Whisper is generally limited to processing audio segments under 30 seconds. To address this, as explained in detail in Section 3.4 of our paper, we use a strategy similar to how CLIP processes high-resolution images. Specifically, we split long audio into smaller chunks based on time, extract features from each chunk using Whisper, and then concatenate all the features into a single long token sequence, which is then fed into the LLM.

From our experiments, this approach allows for an effective understanding of audio up to 5-7 minutes in length. However, for audio exceeding 7 minutes, issues such as hallucinations and repetitive content may occur. To enhance the model's ability to handle longer audio, we fine-tuned it on the Lyra-LongSpeech-12K dataset. Additionally, we explored splitting strategies and found that non-overlapping chunks currently yield the best results. Of course, this area still warrants further experimentation, including optimizing the length of Whisper-extracted features for better processing efficiency.

We hope this clarifies your question, and we appreciate your interest!

MonolithFoundation · 2024-12-19T02:55:04Z

Hi, how many tokens would be required if a 10-minute duration is split into chunks? (The code has not been run. However, would this easily exceed the context length of LLM?)

zs-zhong · 2024-12-19T07:46:11Z

Hi, our Lyra encodes audio at approximately 10 tokens per second (TPS). Thus, a 10-minute duration will require about 6,000 tokens, which is far below the context length of most current advanced LLMs.

zs-zhong · 2024-12-20T06:36:10Z

Hi, I’m going to close this issue as it seems to be resolved. If you have any further questions or concerns, feel free to reopen the discussion or create a new issue. Thanks!

zs-zhong closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A foundamental question #3

A foundamental question #3

MonolithFoundation commented Dec 18, 2024

zs-zhong commented Dec 18, 2024 •

edited

Loading

MonolithFoundation commented Dec 18, 2024

zs-zhong commented Dec 18, 2024 •

edited

Loading

MonolithFoundation commented Dec 19, 2024

zs-zhong commented Dec 19, 2024 •

edited

Loading

zs-zhong commented Dec 20, 2024

A foundamental question #3

A foundamental question #3

Comments

MonolithFoundation commented Dec 18, 2024

zs-zhong commented Dec 18, 2024 • edited Loading

MonolithFoundation commented Dec 18, 2024

zs-zhong commented Dec 18, 2024 • edited Loading

MonolithFoundation commented Dec 19, 2024

zs-zhong commented Dec 19, 2024 • edited Loading

zs-zhong commented Dec 20, 2024

zs-zhong commented Dec 18, 2024 •

edited

Loading

zs-zhong commented Dec 18, 2024 •

edited

Loading

zs-zhong commented Dec 19, 2024 •

edited

Loading