-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A foundamental question #3
Comments
Hello, and thank you for your interest! Yes, Lyra's vision capabilities are primarily powered by Qwen2-VL. Based on our experimental results and the Qwen2-VL report, the number of vision tokens is limited to 16,384. Therefore, the frame count can significantly exceed 32 in practice. This issue is unlikely to occur with 16 frames. We suspect that your video preprocessing might not align with the official preprocessing steps of Qwen2-VL (GitHub repository, video inference section). In contrast, our long-video framework diverges from conventional VLM approaches and comprises three stages:
For long videos, the frame count can become substantial (more than 10K), resulting in an excessive number of vision tokens. Conventional VLMs address this by applying temporal frame sampling (e.g., 0.1 FPS, 0.02 FPS, or even lower), which inevitably leads to a significant loss of input information. However, Lyra's long speech model avoids temporal sampling in the audio modality, retaining nearly all the information from the video. This complements and enhances the MLLM’s ability to handle long video sequences effectively. Compared to the naive VLM, Lyra's omni-understanding capability is significantly more accurate. |
Hi, thank you for your reply. Regarding the audio part, the audio encoder employed is Whisper. How does it manage to capture long audio contextual information? As far as I know, Whisper can only handle audio within several minute. |
Thank you for your question! Whisper is generally limited to processing audio segments under 30 seconds. To address this, as explained in detail in Section 3.4 of our paper, we use a strategy similar to how CLIP processes high-resolution images. Specifically, we split long audio into smaller chunks based on time, extract features from each chunk using Whisper, and then concatenate all the features into a single long token sequence, which is then fed into the LLM. From our experiments, this approach allows for an effective understanding of audio up to 5-7 minutes in length. However, for audio exceeding 7 minutes, issues such as hallucinations and repetitive content may occur. To enhance the model's ability to handle longer audio, we fine-tuned it on the Lyra-LongSpeech-12K dataset. Additionally, we explored splitting strategies and found that non-overlapping chunks currently yield the best results. Of course, this area still warrants further experimentation, including optimizing the length of Whisper-extracted features for better processing efficiency. We hope this clarifies your question, and we appreciate your interest! |
Hi, how many tokens would be required if a 10-minute duration is split into chunks? (The code has not been run. However, would this easily exceed the context length of LLM?) |
Hi, our Lyra encodes audio at approximately 10 tokens per second (TPS). Thus, a 10-minute duration will require about 6,000 tokens, which is far below the context length of most current advanced LLMs. |
Hi, I’m going to close this issue as it seems to be resolved. If you have any further questions or concerns, feel free to reopen the discussion or create a new issue. Thanks! |
Hello, the work is remarkable. However, I truly wish to engage in an in-depth consultation regarding the following:
What is the main source of the long video sequence ability? Does it primarily stem from Qwen2VL? I have experimented with Qwen2VL, and it tends to exhibit a high degree of hallucination when the number of frames exceeds approximately 16.
As per the report and demonstration presented in your video, it appears that it can handle videos of several minutes or even up to half an hour. What other techniques were employed here?
The text was updated successfully, but these errors were encountered: