MINT-1T:
Scaling Open-Source Multimodal Data by 10x:
A Multimodal Dataset with One Trillion Tokens

🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with one trillion text tokens and 3.4 billion images, a ~10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers.

We release all subsets of MINT-1T, including:

🌐 HTML Data
📚 PDF Data
- We provide shards of MINT-1T PDFs for each CommonCrawl snapshot:
🔬 ArXiv Data

Updates

[7/24] 🎉 We open-sourced the 🍃 MINT-1T dataset!
[6/17] We released our technical report.

Citation

If you found our work useful, please consider citing:

@article{awadalla2024mint1t,
      title={MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens}, 
      author={Anas Awadalla and Le Xue and Oscar Lo and Manli Shu and Hannah Lee and Etash Kumar Guha and Matt Jordan and Sheng Shen and Mohamed Awadalla and Silvio Savarese and Caiming Xiong and Ran Xu and Yejin Choi and Ludwig Schmidt},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MINT-1T:
Scaling Open-Source Multimodal Data by 10x:
A Multimodal Dataset with One Trillion Tokens

Updates

Citation

About

Releases

Packages

Contributors 2

mlfoundations/MINT-1T

Folders and files

Latest commit

History

Repository files navigation

MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Updates

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

MINT-1T:
Scaling Open-Source Multimodal Data by 10x:
A Multimodal Dataset with One Trillion Tokens

Packages