Data

Data file name	Size
sharegpt4v_instruct_gpt4-vision_cap100k.json	134 MB
share-captioner_coco_lcs_sam_1246k_1107.json	1.5 GB
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json	1.2 GB

ShareGPT4V Dataset

This dataset is curated from LAION, CC, SBU, SAM, COCO, web-landmark, web-celebrity, wikiart, etc, resulting in total 102K high-quality image-text pairs with the help of powerful GPT4-Vision.

ShareGPT4V-PT Dataset

The pretraining dataset used in this release is a mixture of LAION, CC, SBU, SAM, COCO datasets, resulting in total 1246K image-text pairs with the help of our general ShareCaptioner

SFT Dataset

We replace 23K image-text pairs related to the image captioning task in LLaVA-mix-665K with a equivalent subset in our collected GPT4V-generated high-quality image-text pairs.

Prepare Images

First, download all images we used.

LAION-CC-SBU-558K: images.zip
COCO: train2017
WebData: images. Only for academic usage.
SAM: images. We only use 000000~000050.tar for now. If you just want to use ShareGPT4V for SFT, you can quickly download 9K images from here.
GQA: images
OCR-VQA: download script. We save all files as .jpg
TextVQA: trainvalimages
VisualGenome: part1, part2

Then, organize the data as follows in projects/ShareGPT4V/data:

ShareGPT4V
├── ...
├── data
│   ├── llava
│   │   ├── llava_pretrain
│   │   │   ├── images
│   ├── coco
│   │   ├── train2017
│   ├── sam
│   │   ├── images
│   ├── gqa
│   │   ├── images
│   ├── ocr_vqa
│   │   ├── images
│   ├── textvqa
│   │   ├── train_images
│   ├── vg
│   │   ├── VG_100K
│   │   ├── VG_100K_2
│   ├── sharegpt4v
│   │   ├── share-captioner_coco_lcs_sam_1246k_1107.json
│   │   ├── sharegpt4v_instruct_gpt4-vision_cap100k.json
│   │   ├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
│   ├── share_textvqa
│   │   ├── images
│   ├── web-celebrity
│   │   ├── images
│   ├── web-landmark
│   │   ├── images
│   ├── wikiart
│   │   ├── images
├── ...

Important notice: For the convenience, we provide a zip file for web data. These images must be used for academic purpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data.md

Data.md

Data

ShareGPT4V Dataset

ShareGPT4V-PT Dataset

SFT Dataset

Prepare Images

Files

Data.md

Latest commit

History

Data.md

File metadata and controls

Data

ShareGPT4V Dataset

ShareGPT4V-PT Dataset

SFT Dataset

Prepare Images