Custom dataset using custom_dataset_script.py #52

leondgarse · 2022-03-29T10:46:52Z

leondgarse
Mar 29, 2022
Maintainer

Custom recognition dataset

For data folder in format:

foo                 # dataset base
├── train           # training data
│   ├── aa          # class folder
│   │   ├── 100.jpg # image
│   │   ├── 101.jpg # image
│   │   └── 102.jpg # image
│   ├── bb          # class folder
│   │   ├── 111.jpg
│   │   ├── 112.jpg
│   │   └── 113.jpg
│   ├── cc
│   │   ├── 707.jpg
│   │   ├── 708.jpg
│   │   └── 709.jpg
...
├── test            # testing data
│   ├── aa          # class folder
│   │   ├── 200.jpg # image
│   │   └── 202.jpg # image
│   ├── bb          # class folder
│   │   ├── 211.jpg

Create dataset json file by:

python3 custom_dataset_script.py --train_images foo/train --test_images foo/test
# >>>> total_train_samples: 471, total_test_samples: 471, num_classes: 8
# >>>> Saved to: foo.json

Or use --test_split for dataset not having standalone test folder:

python3 custom_dataset_script.py --train_images temp_test/data/train/ --test_split 0.1 -s goo
# >>>> total_train_samples: 423, total_test_samples: 48, num_classes: 8
# >>>> Saved to: goo.json

Then this json file can be used as training dataset:

CUDA_VISIBLE_DEVICES='1' python3 train_script.py --data_name goo.json

Required json format detail

It's a json file containing at least 2 keys: ['train', 'test'] or ['train', 'validation'].

'train' / 'test' / 'validation' is a list containing elements of dict, each has 2 keys 'image' and 'label'. If both 'test' and 'validation' provided, will pick 'validation' one.
[optional] info containing elements "num_classes" and "base_path".
- "base_path" is the absolute path of ./, may change this value if move the dataset to a new path.
- "num_classes" is also optional, will use the max value from all labels if not provided.
[optional] indices_2_labels is used to map int indices to class names.

{
  "info": {'num_classes': 8, "base_path": "/datasets"},  # optional
  "train": [
    {"image": "foo/train/aa/100.jpg", "label": 0},
    {"image": "foo/train/aa/101.jpg", "label": 0},
  ],
  "test": [
    {"image": "foo/test/aa/200.jpg", "label": 0},
    {"image": "foo/test/aa/202.jpg", "label": 0},
  ],
  "indices_2_labels": {0: "cat", 1: "dog"},  # optional
}

Check dataset

from keras_cv_attention_models.imagenet import data
tt = data.init_dataset('foo.json', batch_size=16)[0]
_ = data.show_batch_sample(tt)

leondgarse · 2022-03-29T11:22:37Z

leondgarse
Mar 29, 2022
Maintainer Author

Custom detection detaset

As detection dataset format more custom, here take COCO dataset for example. For other specific format, may use this as a template and create own one.

coco                # dataset base
├── images          # images
│   ├── train2017   # train images folder
│   │   ├── 100.jpg # image
│   │   ├── 101.jpg # image
│   │   └── 102.jpg # image
│   ├── val2017     # val images folder
│   │   ├── 211.jpg
│   │   ├── 212.jpg
│   │   └── 213.jpg
...
├── labels          # labels
│   ├── train2017   # train labels folder
│   │   ├── 100.txt # label + bbox info
│   │   ├── 101.txt # label + bbox info
│   │   └── 102.txt # label + bbox info
│   ├── val2017     # val labels folder
│   │   ├── 211.txt
│   │   └── 213.txt

Each txt label contains label + bbox info for relative image. For COCO format, each line is label, center_width, center_height, width, height, bbox value range in [0, 1].

! cat 100.txt
0 0.643867 0.404833 0.050891 0.117708
31 0.659750 0.451573 0.021875 0.056521

Another dataset structure is using COCO format annotations json file. bbox in COCO format is [center_width, center_height, width, height].

coco                # dataset base
├── images          # images
│   ├── train2017   # train images folder
│   │   ├── 100.jpg # image
│   ├── val2017     # val images folder
│   │   ├── 211.jpg
...
└── annotations                     # annotations
     ├── instances_train2017.json   # COCO format train annotations
     └── instances_val2017.json     # COCO format test annotations

Create dataset json file using txt labels by:

# --bbox_source_format cxcywh means source bbox format `[center_width, center_height, width, height]`
python3 custom_dataset_script.py --train_images ../coco/images/train2017/ --train_labels ../coco/labels/train2017/  \
--test_images ../coco/images/val2017/ --test_labels ../coco/labels/val2017 \
--bbox_source_format cxcywh -s coco

Or using --test_split for dataset not having standalone test folder:

# Default --bbox_source_format is yxyx, means source bbox format `[top, left, bottom, right]`
python3 custom_dataset_script.py --train_images ../coco/images/train2017/ --train_labels ../coco/labels/train2017/ \
--test_split 0.1 -s dodo

Or providing the annotations json file as train_labels / test_labels , will force convert assuming bbox_source_format == cxcywh:

python3 custom_dataset_script.py --train_images ../coco/images/train2017/ --train_labels ../coco/annotations/instances_train2017.json \
--test_images ../coco/images/val2017/ --test_labels ../coco/annotations/instances_val2017.json -s fofo

Then this json file can be used as training dataset:

CUDA_VISIBLE_DEVICES='1' python3 coco_train_script.py --data_name dodo.json

Required json format detail

It's a json file containing at least 2 keys: ['train', 'test'] or ['train', 'validation'].

'train' / 'test' / 'validation' is a list containing elements of dict, each has 2 keys 'image' and 'objects'. If both 'test' and 'validation' provided, will pick 'validation' one.
- Each 'objects' has 2 keys 'bbox' and 'label'.
- Each 'bbox' in format [top, left, bottom, right], and value in range [0, 1].
- This target format is just compiling with "coco/2017" loaded from tfds.
[optional] info containing elements "num_classes" and "base_path".
- "base_path" is the absolute path of ./, may change this value if move the dataset to a new path.
- "num_classes" is also optional, will use the max value from all labels if not provided.
[optional] indices_2_labels is used to map int indices to class names.

{
  "info": {'num_classes': 80, "base_path": "/datasets"},  # optional
  "train": [
    {"image": "/dataset/coco/images/train2017/100.jpg",
     "objects": {
       "label": [65, 65, 49],
       "bbox":[[0.548703, 0.476851, 0.321469, 0.523592], [0.264453, 0.457306, 0.380063, 0.478647], [0.498773, 0.489612, 0.997547, 0.979224]]
     }
    },
    {"image": "/dataset/coco/images/train2017/101.jpg",
     "objects": {
       "label": [0, 21],
       "bbox":[[0.643867, 0.404833, 0.050891, 0.117708], [0.65975, 0.451573, 0.021875, 0.056521]]
     }
    },
  ],
  "test": [
    {"image": "/dataset/coco/images/val2017/211.jpg",
     "objects": {
       "label": [0, 27],
       "bbox":[[0.580031, 0.355855, 0.2855, 0.682436], [0.408117, 0.646054, 0.379734, 0.172857]]
     }
    },
  ],
  "indices_2_labels": {0: "cat", 1: "dog"},  # optional
}

Check dataset

from keras_cv_attention_models.coco import data
# Set `anchors_mode="anchor_free"` will just return original bbox
tt = data.init_dataset('coco.json', batch_size=16, anchors_mode="anchor_free")[0]
indices_2_labels = None  # For label different from COCO, specify map dict like {0: "foo", 1: "goo"} for better display
ax = data.show_batch_sample(tt, anchors_mode="anchor_free", indices_2_labels=indices_2_labels)

Example Usage

This Colab kecam_coco_tiny_test.ipynb is a basic test of custom detection dataset. Processes including creating / showing / training / evaluating / predicting. The dataset coco_tiny is created from COCO dataset with only bicycle class, means num_classes=1.

1 reply

nadongjin Apr 12, 2022

You are the best developer I've ever seen. :)

leondgarse · 2023-07-22T12:02:07Z

leondgarse
Jul 22, 2023
Maintainer Author

Custom caption detaset

As caption dataset format more custom, here take flickr30k dataset for example. For other specific format, may use this as a template and create own one.

flickr30k             # dataset base
├── flickr30k-images  # images
│  ├── 100.jpg        # image
│  ├── 101.jpg        # image
│  └── 102.jpg        # image
...
└── results_20130124.token  # caption table, or coco caption annotation json file

The caption tabel file contains the image name and caption mapping info, could be a tsv or json file, or COCO caption annotation json format file. A tsv format one could be with 2 columns ["image", "caption"].

! $ head -n 2 flickr30k/results_20130124.token
1000092795.jpg#0        Two young guys with shaggy hair look at their hands while hanging out in the yard .
1000092795.jpg#1        Two young , White males are outside near many bushes .

Or json file contains a list, and each element is a dict with keys ["image", "caption"].

[
  {"image": "flickr30k/flickr30k-images/3391453209.jpg", "caption": "A woman in a black coat stands on a curb outside a market ."},
  {"image": "flickr30k/flickr30k-images/44904567.jpg", "caption": "A man using an electric razor shaves someone 's head ."},
]

Create dataset json file by:

python3 custom_dataset_script.py --train_images flickr30k/flickr30k-images/ \
--train_captions flickr30k/results_20130124.token --test_split 0.1

Or provides standalone test folder:

python3 custom_dataset_script.py --train_images coco_dog_cat/train2017/images/ --test_images coco_dog_cat/val2017/images/ \
--train_captions annotations/captions_train2017.json --test_captions annotations/captions_val2017.json \
-s coco_captions

Target saving format can be json or tsv specifying by --save_format, where tsv could be half smaller.

python3 custom_dataset_script.py --train_images flickr30k/flickr30k-images/ \
--train_captions flickr30k/results_20130124.token --test_split 0.1 --save_format tsv
# >>>> total_train_samples: 143023, total_test_samples: 15892
# >>>> Saved to: flickr30k.tsv

Then this file can be used as training dataset:

CUDA_VISIBLE_DEVICES='1' python3 train_script.py --data_name flickr30k.tsv --model BeitBasePatch16 --text_model GPT2_Base \
--optimizer adam --disable_positional_related_ops --random_crop_min 1

Required json/tsv format detail

It's a json file containing at least 2 keys: ['train', 'test'] or ['train', 'validation'].

'train' / 'test' / 'validation' is a list containing elements of dict, each has 2 keys 'image' and 'caption'. If both 'test' and 'validation' provided, will pick 'validation' one.
[optional] info containing an element "base_path", which is the absolute path of ./, may change this value if move the dataset to a new path.

{
  "info": {"base_path": "/datasets"},  # optional
  "train": [
    {"image": "flickr30k/flickr30k-images/2224450291.jpg", "caption": "The man is outdoors , holding a camera ."},
    {"image": "flickr30k/flickr30k-images/3643175169.jpg", "caption": "A man stands on a ladder propped up against a brick building ."}
  ],
  "test": [
    {"image": "flickr30k/flickr30k-images/374538975.jpg", "caption": "A choir performing for an audience ."}
  ]
}

Or tsv file with 2 columns ['image', 'caption']. A special line TEST\tTEST is used for separating train and test dataset, and can also provide a line base_path\t/xxx like "info" in json one, indicates the absolute path of the dataset.

base_path   /datasets
flickr30k/flickr30k-images/5403974296.jpg  A woman pulling a slingshot back .
flickr30k/flickr30k-images/1263801010.jpg  A person in a red coat looking out at a snowy landscape .
TEST    TEST
flickr30k/flickr30k-images/3192005501.jpg  woman in the hospital sticking her tongue out
flickr30k/flickr30k-images/8189395281.jpg  A woman wearing fishnet stockings is practicing her skating while her coach watches her

Check dataset

from keras_cv_attention_models import clip
caption_tokenizer = clip.GPT2Tokenizer('gpt2')
tt = clip.init_dataset('flickr30k.tsv', batch_size=16, caption_tokenizer=caption_tokenizer)[0]
ax = clip.show_batch_sample(tt, caption_tokenizer=caption_tokenizer, rescale_mode='torch')

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom dataset using custom_dataset_script.py #52

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Custom dataset using custom_dataset_script.py #52

leondgarse Mar 29, 2022 Maintainer

Custom recognition dataset

Required json format detail

Check dataset

Replies: 2 comments · 1 reply

leondgarse Mar 29, 2022 Maintainer Author

Custom detection detaset

Required json format detail

Check dataset

Example Usage

nadongjin Apr 12, 2022

leondgarse Jul 22, 2023 Maintainer Author

Custom caption detaset

Required json/tsv format detail

Check dataset

leondgarse
Mar 29, 2022
Maintainer

Replies: 2 comments 1 reply

leondgarse
Mar 29, 2022
Maintainer Author

leondgarse
Jul 22, 2023
Maintainer Author