For pre-training YOLO-World, we adopt several datasets as listed in the below table:
Data | Samples | Type | Boxes |
---|---|---|---|
Objects365v1 | 609k | detection | 9,621k |
GQA | 621k | grounding | 3,681k |
Flickr | 149k | grounding | 641k |
CC3M-Lite | 245k | image-text | 821k |
We put all data into the data
directory, such as:
├── coco
│ ├── annotations
│ ├── lvis
│ ├── train2017
│ ├── val2017
├── flickr
│ ├── annotations
│ └── images
├── mixed_grounding
│ ├── annotations
│ ├── images
├── mixed_grounding
│ ├── annotations
│ ├── images
├── objects365v1
│ ├── annotations
│ ├── train
│ ├── val
NOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file
, data_root
, and data_prefix
.
We provide the annotations of the pre-training data in the below table:
Data | images | Annotation File |
---|---|---|
Objects365v1 | Objects365 train |
objects365_train.json |
MixedGrounding | GQA |
final_mixed_train_no_coco.json |
Flickr30k | Flickr30k |
final_flickr_separateGT_train.json |
LVIS-minival | COCO val2017 |
lvis_v1_minival_inserted_image_name.json |
Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.
For training YOLO-World, we mainly adopt two kinds of dataset classs:
MultiModalDataset
is a simple wrapper for pre-defined Dataset Class, such as Objects365
or COCO
, which add the texts (category texts) into the dataset instance for formatting input texts.
Text JSON
The json file is formatted as follows:
[
['A_1','A_2'],
['B'],
['C_1', 'C_2', 'C_3'],
...
]
We have provided the text json for LVIS
, COCO
, and Objects365
The YOLOv5MixedGroundingDataset
extends the COCO
dataset by supporting loading texts/captions from the json file. It's desgined for MixedGrounding
or Flickr30K
with text tokens for each object.
For custom dataset, we suggest the users convert the annotation files according to the usage. Note that, converting the annotations to the standard COCO format is basically required.
-
Large vocabulary, grounding, referring: you can follow the annotation format as the
MixedGrounding
dataset, which addscaption
andtokens_positive
for assigning the text for each object. The texts can be a category or a noun phrases. -
Custom vocabulary (fixed): you can adopt the
MultiModalDataset
wrapper as theObjects365
and create a text json for your custom categories.
The following annotations are generated according to the automatic labeling process in our paper. Adn we report the results based on these annotations.
To use CC3M annotations, you need to prepare the CC3M
images first.
Data | Images | Boxes | File |
---|---|---|---|
CC3M-246K | 246,363 | 820,629 | Download 🤗 |
CC3M-500K | 536,405 | 1,784,405 | Download 🤗 |
CC3M-750K | 750,000 | 4,504,805 | Download 🤗 |