TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
TextHawk: Efficient Fine-Grained Perception of Multimodal Large Language Models
UI-Hawk: Unleashing the Screen Stream Understanding for GUI Agents
The TextHawk series represents a cutting-edge family of Large Vision-Language Models (LVLMs) designed for highly efficient fine-grained perception. Notably, TextHawk sets a milestone as the first LVLM to achieve a 16x token compression ratio. This is made possible through the integration of four key components:
- Scalable Positional Embeddings (SPEs)
- Query Proposal Network (QPN)
- ReSampling and ReArrangement (ReSA)
- Multi-Level Cross-Attention (MLCA)
Building on the same architecture, TextHawk2 enhances performance by leveraging greater data diversity and reinforcing the visual encoder. This iteration achieves state-of-the-art results across multiple benchmarks, excelling in tasks related to general multimodal understanding, Optical Character Recognition (OCR), and visual grounding.
For instance, TextHawk2 delivers impressive metrics such as 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% accuracy@0.5 on RefCOCOg-test.
TextHawk series can compress multiple times more words displayed on a small image, where each character measures under 8 pixels, into a few tokens, allowing for accurate recovery. It’s reminiscent of the futuristic gadgets in Doraemon anime.
We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:
- A brief summary of the document topics.
- Short QA pairs, up to 10.
- Insights behind each answer.
- [Optional] An imaginary conversations between two researchers.
DocGemini consists of 30K images and 195K QA pairs with insights.
Dataset | QA | Conversation |
---|---|---|
DocVQA | link | link |
ChartQA | link | link |
InfoVQA | link | link |
Note: Alternatively, you can produce data on your own using the scripts we provide.
TextHawk
Model | ViT (Params.) |
MME perception |
MMB dev |
SEED image |
GQA | DocVQA | ChartQA | InfoVQA | TabFact | WTQ | RefCOCO val |
RefCOCO test-A |
RefCOCO test-B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(0.1B) |
- | - | - | - | 67.5 | 41.8 | 11.6 | 54.6 | 18.8 | - | - | - | |
- | - | - | - | - | 76.6 | 58.6 | 40.0 | - | - | - | - | - | |
(1B) |
1528.4 | 74.8 | 66.1 | - | - | - | - | - | - | - | - | - | |
(0.3B) |
1510.7 | 65.2 | - | 62.0 | - | - | - | - | - | - | - | - | |
(0.3B) |
- | 58.8 | - | - | - | - | - | - | - | 87.0 | 91.1 | 81.8 | |
(2B) |
1487.6 | 60.6 | 65.4 | 57.5 | 62.6 | 66.3 | - | - | - | 88.6 | 92.3 | 84.5 | |
(2B) |
- | 59.3 | - | 60.7 | 66.5 | 65.1 | 36.1 | - | 25.3 | - | - | - | |
(0.3B) |
- | - | - | - | 65.4 | 59.3 | 42.2 | 67.6 | 29.4 | - | - | - | |
(2B) |
- | - | - | - | 73.0 | 66.9 | - | - | 31.9 | - | - | - | |
(0.4B) |
1520.9 | 73.0 | 69.2 | 64.7 | 73.6 | 64.0 | 47.3 | 70.7 | 33.5 | 87.3 | 90.9 | 83.3 | |
(0.4B) |
1500.0 | 74.6 | 69.2 | 64.6 | 76.4 | 66.6 | 50.6 | 71.1 | 34.7 | 87.2 | 90.8 | 82.5 |
Note:
is fine-tuned without the DocGemini.
@article{yu24texthawk2,
author = {Ya{-}Qi Yu and Minghui Liao and Jiwen Zhang and Jihao Wu},
title = {TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens},
journal = {CoRR},
volume = {abs/2410.05261},
year = {2024}
}
@article{yu24texthawk,
author = {Ya{-}Qi Yu and Minghui Liao and Jihao Wu and Yongxin Liao and Xiaoyu Zheng and Wei Zeng},
title = {TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models},
journal = {CoRR},
volume = {abs/2404.09204},
year = {2024}
}
@article{zhang24uihawk,
title = {{UI-Hawk}: Unleashing the Screen Stream Understanding for GUI Agents},
author = {Jiwen Zhang and Yaqi Yu and Minghui Liao and Wentao Li and Jihao Wu and Zhongyu Wei},
journal = {Preprints},
volume = {manuscript/202408.2137},
year = {2024}
}