Data

Data Format

Here we give an overview of data format. For details, please check the data loading code: data/pretrain_dataset.py and data/fintune_dataset.py

Pretraining Data

All the data except IMU are organized in .csv format. Each .csv has two columns: caption and url. \t is used as the delimiter. For example,

caption url
Woman receiving a foot massage at health spa Stock Photo     cluster_p_ssd:s3://laion400m_mmg_ssd/29347/293477138.jpg
Long injury list troubles Paul Hart as Portsmouth search for some Cup form      cluster_p_ssd:s3://laion400m_mmg_ssd/43069/430692001.jpg
... ...

Instruction Tuning Data

All finetuning data are converted into multi-turn conversation format. The .json file contains a list of training samples, where each sample contains the following keys: id, image and conversations. For example,

{'id': '000000033471', 'image': 'InstructionTuning/image/coco/train2017/000000033471.jpg', 'conversations': [{'from': 'human', 'value': 'What are the colors of the bus in the image?'}, {'from': 'gpt', 'value': 'The bus in the image is white and red.'}, {'from': 'human', 'value': 'What feature can be seen on the back of the bus?'}, {'from': 'gpt', 'value': 'The back of the bus features an advertisement.'}]}

Download Links

Stage	Pretraining		Instruction Tuning
Modality	Dataset	Download	Dataset	Download
Image	LAION-400M	link	LLaVA-mix665K	link
	LAION-COCO	link	COCO Caption	link
Video	WebVid-2.5M	link	MSRVTT Caption	link
			MSRVTT-QA	link
			Video Conversation	link
Audio	WavCaps	link	AudioCaps	link
			Audio Conversation	link
Point	Cap3D	link	Point Conversation	link
Depth	CC3M	link	LLaVA-150K	link
Normal	CC3M	link	LLaVA-150K	link
IMU	Ego4D	link	Ego4D	link
fMRI	NSD	link	NSD	link

Notes

The depth/normal map are generated from CC3M and 50K random-subset of LLaVA-150K using a pretrained DPT.
The IMU data is preprocessed with this script.

Instruction Tuning Data

Annotation Download: Please download the annotation from this link and put them under datasets/InstructionTuning.

Then download original datasets from the above table and put them under corresponding folders. The file structure should be:

datasets
└── InstructionTuning
    ├── audio
    │   ├── audioset2
    │   ├── audiocap_train.json
    │   ├── audiocap_val.json
    │   └── audio_conversation.json
    ├── depth_normal
    │   ├── depth
    │   ├── normal
    │   ├── llava_instruct_50k_depth.json
    │   └── llava_instruct_50k_normal.json
    ├── fmri
    │   ├── NSD
    │   └── fmri_fixed_train.json
    ├── image
    │   ├── coco
    │   ├── gqa
    │   ├── ocr_vqa
    │   ├── vg
    │   ├── cococap_train.json
    │   ├── llava_v1_5_mix665k_image.json
    │   └── llava_v1_5_mix665k_text.json
    ├── imu
    │   ├── ego4d
    │   └── imu_fixed_50k.json
    ├── point
    │   ├── pointllm/8192_npy
    │   └── pointllm_70k.json
    └── video
        ├── msr-vtt/MSR-VTT
        ├── msrvtt_cap_test.json
        ├── msrvtt_cap_trainval.json
        ├── msrvtt_vqa_test.json
        ├── msrvtt_vqa_train.json
        ├── msrvtt_vqa_val.json
        ├── video_complex_reasoning_10k.json
        ├── video_conversation_10k.json
        └── video_detail_10k.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data.md

Data.md

Data

Data Format

Pretraining Data

Instruction Tuning Data

Download Links

Instruction Tuning Data

Files

Data.md

Latest commit

History

Data.md

File metadata and controls

Data

Data Format

Pretraining Data

Instruction Tuning Data

Download Links

Instruction Tuning Data