Data file name | Size |
---|---|
sharegpt4v_instruct_gpt4-vision_cap100k.json | 134 MB |
share-captioner_coco_lcs_sam_1246k_1107.json | 1.5 GB |
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json | 1.2 GB |
This dataset is curated from LAION, CC, SBU, SAM, COCO, web-landmark, web-celebrity, wikiart, etc, resulting in total 102K high-quality image-text pairs with the help of powerful GPT4-Vision.
The pretraining dataset used in this release is a mixture of LAION, CC, SBU, SAM, COCO datasets, resulting in total 1246K image-text pairs with the help of our general ShareCaptioner
We replace 23K image-text pairs related to the image captioning task in LLaVA-mix-665K with a equivalent subset in our collected GPT4V-generated high-quality image-text pairs.
First, download all images we used.
- LAION-CC-SBU-558K: images.zip
- COCO: train2017
- WebData: images. Only for academic usage.
- SAM: images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from here.
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg
- TextVQA: trainvalimages
- VisualGenome: part1, part2
Then, organize the data as follows in projects/ShareGPT4V/data
:
ShareGPT4V
├── ...
├── data
│ ├── llava
│ │ ├── llava_pretrain
│ │ │ ├── images
│ ├── coco
│ │ ├── train2017
│ ├── sam
│ │ ├── images
│ ├── gqa
│ │ ├── images
│ ├── ocr_vqa
│ │ ├── images
│ ├── textvqa
│ │ ├── train_images
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ ├── sharegpt4v
│ │ ├── share-captioner_coco_lcs_sam_1246k_1107.json
│ │ ├── sharegpt4v_instruct_gpt4-vision_cap100k.json
│ │ ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
│ ├── share_textvqa
│ │ ├── images
│ ├── web-celebrity
│ │ ├── images
│ ├── web-landmark
│ │ ├── images
│ ├── wikiart
│ │ ├── images
├── ...
Important notice: For the convenience, we provide a zip file for web data. These images must be used for academic purpose.