LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Long-context capability is critical for multi-modal foundation models. We introduce LongVILA, a full-stack solution for long-context vision-language models, including system, model training, and dataset development. On the system side, we introduce the first long-context Multi-Modal Sequence Parallelism (MM-SP) system that enables long training and inference, enabling 2M context length training on 256 GPUs. MM-SP is also efficient, being 2.1x - 5.7x faster than Ring-Style Sequence Parallelism and 1.1x - 1.4x faster than Megatron-LM in text-only settings. Moreover, it seamlessly integrates with Hugging Face Transformers. For model training, we propose a five-stage pipeline comprising alignment, pre-training, short supervised fine-tuning, context extension, and long supervised fine-tuning. Regarding datasets, we meticulously construct large-scale visual language pre-training datasets and long video instruction-following datasets to support our multi-stage training process. The full-stack solution extends the feasible frame number of VILA by a factor of 128 (from 8 to 1024 frames) and improves long video captioning score from 2.00 to 3.26 (1.6x), achieving 99.5% accuracy in 1400-frames video (274k context length) needle in a haystack. LongVILA-8B also demonstrates consistent accuracy improvements on long videos in the VideoMME benchmark as the video frames increase.

Installation

./environment_setup.sh vila

Evaluations

Please refer to scripts/v1_5/eval/needle.sh, scripts/v1_5/eval/video_chatgpt/run_vila_benchmark.sh, and llava/eval/video_mme/eval.sh for needle-in-a-haystack, LongVILA-Caption, and Video MME evaluations.

Note

💡Sequence Parallelism Configuration

To enable sequence parallelism, you can set the following parameters in the training script:

seq_parallel_size:The degree of sequence parallelism (SP). SP is disabled by default (value: -1).

seq_parallel_ring_size: The communication process group size using optimized Ring Attention approach in SP. Ring Attention approach is disabled by default in SP.

seq_parallel_ring_type: Ring Attention implementation. Support ['ring_varlen', 'zigzag_ring_varlen'] in 2D attention. Only works when seq_parallel_ring_size > 1.

Please note that when SP is enabled, we treat each group of seq_parallel_size GPUs as a single device, with the global batch size calculated as the product of the per-device batch size and the data parallelism size.

🔒 License

The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of LLaMA. For LLAMA3-VILA checkpoints terms of use, please refer to the LLAMA3 License for additional details.
- Terms of Use of the data generated by OpenAI
- Dataset Licenses for each one used during training.

Citations

@article{longvila,
      title={LongVILA: Scaling Long-Context Visual Language Models for Long Videos},
      author={Fuzhao Xue and Yukang Chen and Dacheng Li and Qinghao Hu and Ligeng Zhu and Xiuyu Li and Yunhao Fang and Haotian Tang and Shang Yang and Zhijian Liu and Yihui He and Hongxu Yin and Pavlo Molchanov and Jan Kautz and Linxi Fan and Yuke Zhu and Yao Lu and Song Han},
      year={2024},
      eprint={2408.10188},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon. Thanks for their wonderful work.
LongVA: we borrowed the long video needle in the haystack evaluation script from this repository.
LongLoRA: we modified the low-rank long-context fine-tuning code from this repository.
USP (YunChang): we adopted the 2D attention implementation from this repository.
RingFlashAttention: we adopted the ring flash attention implementation from this repository.
DeepSpeed Ulysses: we adopted the all-to-all implementation from this repository.
Video-ChatGPT: we borrowed video evaluation script from this repository.
MMC4, COYO-700M, M3IT, OpenORCA/FLAN, ShareGPT4V, WIT, GSM8K-ScRel, VisualGenome, VCR, ScienceQA, Shot2Story, Youcook2, Vatex, ShareGPT-Video, ShareGPT4o for providing datasets used in this research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongVILA.md

LongVILA.md

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Installation

Evaluations

🔒 License

Citations

Acknowledgement

Files

LongVILA.md

Latest commit

History

LongVILA.md

File metadata and controls

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

💡 Introduction

Installation

Evaluations

🔒 License

Citations

Acknowledgement