- Phantom-0.5B|1.8B|3.8B/7B has been released in π€ Huggingface Models.
- Preprint of Phantom has been uploaded in ArXiv.
- Phantom Triples Dataset for DPO-like concept has been released in π€ Huggingface Datasets.
- The demo code of Phantom-0.5B|1.8B|3.8B|7B has been updated in this repository.
- Online demo for Phantom-0.5B|1.8B|3.8B|7B has been released in π€ Huggingface Spaces.
- The code of fintuning Phantom-0.5B|1.8B|3.8B|7B will be soon updated in this repository.
Official PyTorch implementation code for realizing the technical part of Phantom of Latent to improve numerous vision language performances with efficient model size. This code is developed from scratch, where the model architecture and all configurations are inspired by InternVL. I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.
import torch
from config import *
from PIL import Image
from utils.utils import *
from model.load_model import load_model
from torchvision.transforms.functional import pil_to_tensor
# model selection
size = '7b' # [Select One] '0.5b' (transformers more recent version) | '1.8b' | '3.8b' (transformers==4.37.2) | '7b'
# User prompt
prompt_type="with_image" # Select one option "text_only", "with_image"
img_path='figures/demo.png'
question="Describe the image in detail"
# loading model
model, tokenizer = load_model(size=size)
# prompt type -> input prompt
if prompt_type == 'with_image':
# Image Load
image = pil_to_tensor(Image.open(img_path).convert("RGB"))
inputs = [{'image': image, 'question': question}]
elif prompt_type=='text_only':
inputs = [{'question': question}]
# cpu -> gpu
for param in model.parameters():
if not param.is_cuda:
param.data = param.cuda()
# Generate
with torch.inference_mode():
# Model
_inputs = model.eval_process(inputs=inputs,
data='demo',
tokenizer=tokenizer,
device='cuda:0')
generate_ids = model.generate(**_inputs, do_sample=False, max_new_tokens=256)
answer = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(answer)
Dataset Description (Total: 2852771, 2.8M)
------------------------------
* Real-World Image: 1218630, 1.2M
* Real-World Text: 143000, 143K
* Document & Chart & Diagram & Sign & Symbol: 743850, 744k
* Math: 747291, 747k
- Math with Vision: 180497, 180k
- Math with Text only: 566794, 566k
------------------------------
- ShareGPT4O-Images (57289, 57k)
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (405617, 405k)
- ALLAVA4V-Text (143000, 143k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- SMR [ArXivQA, TextbookQA] (116035, 116K)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)
Dataset Description (Total 2040186, 2.0M)
--------------------------------------------
* Real-World Image: 871160, 871k
* Real-World Text: 102389, 102k
* Document & Chart & Diagram & Sign & Symbol: 529709, 529k
* Math: 536928, 536k
- Math with Vision: 129694, 129k
- Math with Text only: 407234, 407k
--------------------------------------------
- ShareGPT4O-Images (40106, 40k)
- ShareGPT4V-Caption [without SAM] (64925, 64k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (475669, 475k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (290460, 290k)
- ALLAVA4V-Text (102389, 102k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (19363, 19k)
- SMR [ArXivQA, TextbookQA] (82843, 82K)
- DocDownstream (409140, 409k)
- DocReason (18363, 18k)
- GLLaVA (127484, 127k)
- MathVision (2210, 2k)
- MathInstruct [TextOnlyDataset] (188288, 188k)
- MathPlus [TextOnlyDataset] (218946, 218k)
We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.
- ShareGPT4V [link]
- ALLAVA4V-VFLAN[link]
- ALLAVA4V-Text [link]
- MiniGemini [link]
- SMR [link]
- DocDownstream [link]
- DocReason [link]
- GLLaVA [link]
- MathVision [link]
- MathInstruct [link]
- MathPlus [link]
Gathered Dataset Layout
Phantom_Dataset_Path
βββ llava # ShareGPT4V
β βββ llava_pretrain
β βββ images
βββ coco # ShareGPT4V
β βββ train2017
βββ sam # ShareGPT4V
β βββ images
βββ gqa # ShareGPT4V
β βββ images
βββ ocr_vqa # ShareGPT4V
β βββ images
βββ textvqa # ShareGPT4V
β βββ train_images
βββ vg # ShareGPT4V
β βββ VG_100K
β βββ VG_100K_2
βββ share_textvqa # ShareGPT4V
β βββ images
βββ web-celebrity # ShareGPT4V
β βββ images
βββ web-landmark # ShareGPT4V
β βββ images
βββ wikiart # ShareGPT4V
β βββ images
βββ share_textvqa # ShareGPT4V
β βββ images
βββ docvqa # MiniGemini
β βββ images
βββ chartqa # MiniGemini
β βββ train
β βββ images
βββ dvqa # MiniGemini
β βββ images
βββ ai2d # MiniGemini
β βββ images
βββ ALLaVA-4V # MiniGemini (ALLAVA-VFLAN)
β βββ allava_vflan
β βββ images
βββ arxivqa # SMR (ArXivQA)
β βββ images
βββ TextbookQA # SMR (TextbookQA)
β βββ train
β βββ val
βββ imgs # DocDownstream & DocReason
β βββ ChartQA
β βββ DUE_Benchmark
β βββ DeepForm
β βββ DocVQA
β βββ InfographicsVQA
β βββ KleisterCharity
β βββ TabFact
β βββ WikiTableQuestions
β βββ TextCaps
β βββ TextVQA
β βββ VisualMRC
βββ geo3k # GLLaVA
| βββ train
βββ geoqa_plus # GLLaVA
βββ images # MathVision
|
βββ sharegpt4v_instruct_gpt4-vision_cap100k.json # ShareGPT4V-Caption
βββ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json # ShareGPT4V-Instruction
βββ Evol-Instruct-GPT4-Turbo-143K.json # ALLAVA4V-Text
βββ SMR.json # SMR
βββ train.jsonl # DocDownstream
βββ detailed_explanation.jsonl # DocReason
βββ minigemini_pretrain.json # MiniGemini-Pretrain
βββ minigemini_instruction.json # MiniGemini-Instruction
βββ gllava_align.parquet # GLLaVA-Align
βββ gllava_qa.parquet # GLLaVA-QA
βββ mathvision.parquet # MathVision
βββ MathInstruct.json # MathInstruct
βββ mathplus.parquet # MathPlus
These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.
- SQA-IMG [link]
- AI2D [link]
- ChartQA [link]
- SEED [link]
- SEED-Bench-2-Plus [link]
- POPE [link]
- HallusionBench [link]
- MME [link]
- MathVista [link]
- MMB [link]
- MM-Vet [link]
- MM-Vet-v2 [link]
- LLaVA-W [link]
- LLaVA-Wilder [link]
- BLINK [link]
- CV-Bench [link]
- VisualWebBench [link]
- MMStar [link]
- MathVerse [link]
Evaluation Dataset Directory Layout
Evaluation_Dataset_Path
βββ ScienceQA # SQA-IMG
βββ ai2d # AI2D
βββ chartqa # ChartQA
βββ SEED-Bench # SEED-IMG
βββ SEED-Bench-2-plus # SEED-Bench-2-Plus
βββ POPE # POPE
βββ HallusionBench # HallusionBench
βββ MME_Benchmark_release_version # MME
βββ MathVista # MathVista
βββ MMBench # MMB
βββ mm-vet # MM-Vet
βββ mm-vet-v2 # MM-Vet-v2
βββ llava-bench-in-the-wild # LLaVA Bench in the Wild
βββ LLaVA-Bench-Wilder # LLaVA Wilder
βββ BLINK # BLINK
βββ CV-Bench # CV-Bench
βββ VisualWebBench # VisualWebBench
βββ MMStar # MMStar
βββ MathVerse # MathVerse