Skip to content

Latest commit

 

History

History
91 lines (79 loc) · 6.25 KB

Data.md

File metadata and controls

91 lines (79 loc) · 6.25 KB

Data

Data Format

Here we give an overview of data format. For details, please check the data loading code: data/pretrain_dataset.py and data/fintune_dataset.py

Pretraining Data

All the data except IMU are organized in .csv format. Each .csv has two columns: caption and url. \t is used as the delimiter. For example,

caption url
Woman receiving a foot massage at health spa Stock Photo     cluster_p_ssd:s3://laion400m_mmg_ssd/29347/293477138.jpg
Long injury list troubles Paul Hart as Portsmouth search for some Cup form      cluster_p_ssd:s3://laion400m_mmg_ssd/43069/430692001.jpg
... ...

Instruction Tuning Data

All finetuning data are converted into multi-turn conversation format. The .json file contains a list of training samples, where each sample contains the following keys: id, image and conversations. For example,

{'id': '000000033471', 'image': 'InstructionTuning/image/coco/train2017/000000033471.jpg', 'conversations': [{'from': 'human', 'value': 'What are the colors of the bus in the image?'}, {'from': 'gpt', 'value': 'The bus in the image is white and red.'}, {'from': 'human', 'value': 'What feature can be seen on the back of the bus?'}, {'from': 'gpt', 'value': 'The back of the bus features an advertisement.'}]}

Download Links

Stage Pretraining Instruction Tuning
Modality Dataset Download Dataset Download
Image LAION-400M link LLaVA-mix665K link
LAION-COCO link COCO Caption link
Video WebVid-2.5M link MSRVTT Caption link
MSRVTT-QA link
Video Conversation link
Audio WavCaps link AudioCaps link
Audio Conversation link
Point Cap3D link Point Conversation link
Depth CC3M link LLaVA-150K link
Normal CC3M link LLaVA-150K link
IMU Ego4D link Ego4D link
fMRI NSD link NSD link

Notes

  • The depth/normal map are generated from CC3M and 50K random-subset of LLaVA-150K using a pretrained DPT.
  • The IMU data is preprocessed with this script.

Instruction Tuning Data

Annotation Download: Please download the annotation from this link and put them under datasets/InstructionTuning.

Then download original datasets from the above table and put them under corresponding folders. The file structure should be:

datasets
└── InstructionTuning
    ├── audio
    │   ├── audioset2
    │   ├── audiocap_train.json
    │   ├── audiocap_val.json
    │   └── audio_conversation.json
    ├── depth_normal
    │   ├── depth
    │   ├── normal
    │   ├── llava_instruct_50k_depth.json
    │   └── llava_instruct_50k_normal.json
    ├── fmri
    │   ├── NSD
    │   └── fmri_fixed_train.json
    ├── image
    │   ├── coco
    │   ├── gqa
    │   ├── ocr_vqa
    │   ├── vg
    │   ├── cococap_train.json
    │   ├── llava_v1_5_mix665k_image.json
    │   └── llava_v1_5_mix665k_text.json
    ├── imu
    │   ├── ego4d
    │   └── imu_fixed_50k.json
    ├── point
    │   ├── pointllm/8192_npy
    │   └── pointllm_70k.json
    └── video
        ├── msr-vtt/MSR-VTT
        ├── msrvtt_cap_test.json
        ├── msrvtt_cap_trainval.json
        ├── msrvtt_vqa_test.json
        ├── msrvtt_vqa_train.json
        ├── msrvtt_vqa_val.json
        ├── video_complex_reasoning_10k.json
        ├── video_conversation_10k.json
        └── video_detail_10k.json