Chenming Zhu
Tai Wang*
Wenwei Zhang
Jiangmiao Pang
Xihui Liu*
The University of Hong Kong Shanghai AI Laboratory
- [2024-11-29] We update the custom data instruction tuning tutorial, now you can train the model on your own dataset!
- [2024-10-19] We release the inference codes with checkpoints as well as the image and 3D scene demos. You can chat with LLaVA-3D with your own machines.
- [2024-09-28] We release the paper of LLaVA-3D. 🎉
- 🔍 Model Architecture
- 🔨 Install
- 📦 Model Zoo
- 🤖 Demo
- 📝 TODO List
- 🔗 Citation
- 📄 License
- 👏 Acknowledgements
We test our codes under the following environment:
- Python 3.10
- Pytorch 2.1.0
- CUDA Version 11.8
To start:
- Clone this repository.
git clone https://github.com/ZCMax/LLaVA-3D.git
cd LLaVA-3D
- Install Packages
conda create -n llava-3d python=3.10 -y
conda activate llava-3d
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
pip install -e .
-
Download the Camera Parameters File and put the json file under the
./playground/data/annotations
. -
Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
The trained model checkpoints are available here. Currently we only provide the 7B model, and we will continue to update the model zoo.
We currently support single image as inputs for 2D tasks and posed RGB-D images as inputs for 3D tasks. You can run the demo by using the script llava/eval/run_llava_3d.py
. For 2D tasks, use the image-file
parameter, and for 3D tasks, use the video-path
parameter to provide the corresponding data. Here, we provide some demos as examples:
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--image-file https://llava-vl.github.io/static/images/view.jpg \
--query "What are the things I should be cautious about when I visit here?"
We provide the demo scene here. Download the demo data and put it under the ./demo
.
- 3D Question Answering
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0356_00 \
--query "Tell me the only object that I could see from the other room and describe the object."
- 3D Dense Captioning
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0566_00 \
--query "The related object is located at [0.981, 1.606, 0.430]. Describe the object in detail."
- 3D Localization
python llava/eval/run_llava_3d.py \
--model-path ChaimZhu/LLaVA-3D-7B \
--video-path ./demo/scannet/scene0382_01 \
--query "The related object is located at [-0.085,1.598,1.310]. Please output the 3D bounding box of the object and then describe the object."
- Release the training and inference code.
- Release the checkpoint, demo data and script.
- Release gradio demo.
- Release the evaluation script.
- Release the training and evaluation datasets.
If you find our work and this codebase helpful, please consider starring this repo 🌟 and cite:
@article{zhu2024llava,
title={LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness},
author={Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui},
journal={arXiv preprint arXiv:2409.18125},
year={2024}
}
This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.