Here we give an overview of data format. For details, please check the data loading code: data/pretrain_dataset.py and data/fintune_dataset.py
All the data except IMU are organized in .csv
format. Each .csv
has two columns: caption
and url
. \t
is used as the delimiter. For example,
caption url
Woman receiving a foot massage at health spa Stock Photo cluster_p_ssd:s3://laion400m_mmg_ssd/29347/293477138.jpg
Long injury list troubles Paul Hart as Portsmouth search for some Cup form cluster_p_ssd:s3://laion400m_mmg_ssd/43069/430692001.jpg
... ...
All finetuning data are converted into multi-turn conversation format. The .json
file contains a list of training samples, where each sample contains the following keys: id
, image
and conversations
. For example,
{'id': '000000033471', 'image': 'InstructionTuning/image/coco/train2017/000000033471.jpg', 'conversations': [{'from': 'human', 'value': 'What are the colors of the bus in the image?'}, {'from': 'gpt', 'value': 'The bus in the image is white and red.'}, {'from': 'human', 'value': 'What feature can be seen on the back of the bus?'}, {'from': 'gpt', 'value': 'The back of the bus features an advertisement.'}]}
Stage | Pretraining | Instruction Tuning | ||
---|---|---|---|---|
Modality | Dataset | Download | Dataset | Download |
Image | LAION-400M | link | LLaVA-mix665K | link |
LAION-COCO | link | COCO Caption | link | |
Video | WebVid-2.5M | link | MSRVTT Caption | link |
MSRVTT-QA | link | |||
Video Conversation | link | |||
Audio | WavCaps | link | AudioCaps | link |
Audio Conversation | link | |||
Point | Cap3D | link | Point Conversation | link |
Depth | CC3M | link | LLaVA-150K | link |
Normal | CC3M | link | LLaVA-150K | link |
IMU | Ego4D | link | Ego4D | link |
fMRI | NSD | link | NSD | link |
Notes
- The depth/normal map are generated from CC3M and 50K random-subset of LLaVA-150K using a pretrained DPT.
- The IMU data is preprocessed with this script.
Annotation Download: Please download the annotation from this link and put them under datasets/InstructionTuning
.
Then download original datasets from the above table and put them under corresponding folders. The file structure should be:
datasets
└── InstructionTuning
├── audio
│ ├── audioset2
│ ├── audiocap_train.json
│ ├── audiocap_val.json
│ └── audio_conversation.json
├── depth_normal
│ ├── depth
│ ├── normal
│ ├── llava_instruct_50k_depth.json
│ └── llava_instruct_50k_normal.json
├── fmri
│ ├── NSD
│ └── fmri_fixed_train.json
├── image
│ ├── coco
│ ├── gqa
│ ├── ocr_vqa
│ ├── vg
│ ├── cococap_train.json
│ ├── llava_v1_5_mix665k_image.json
│ └── llava_v1_5_mix665k_text.json
├── imu
│ ├── ego4d
│ └── imu_fixed_50k.json
├── point
│ ├── pointllm/8192_npy
│ └── pointllm_70k.json
└── video
├── msr-vtt/MSR-VTT
├── msrvtt_cap_test.json
├── msrvtt_cap_trainval.json
├── msrvtt_vqa_test.json
├── msrvtt_vqa_train.json
├── msrvtt_vqa_val.json
├── video_complex_reasoning_10k.json
├── video_conversation_10k.json
└── video_detail_10k.json