Skip to content

Commit 7924b5a

Browse files
sungchul2sungchul.kim
and
sungchul.kim
authored
Add visual prompting documentation (#2354)
* (WIP) write docs * Add visual prompting documentation * Update CHANGELOG --------- Co-authored-by: sungchul.kim <sungchul@ikvensx010>
1 parent 06142b6 commit 7924b5a

File tree

3 files changed

+103
-0
lines changed

3 files changed

+103
-0
lines changed

CHANGELOG.md

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ All notable changes to this project will be documented in this file.
1515
- Add new visual prompting task: train/eval (https://github.com/openvinotoolkit/training_extensions/pull/2203)
1616
- Add new visual prompting task: export (https://github.com/openvinotoolkit/training_extensions/pull/2274)
1717
- Add new visual prompting task: deploy (https://github.com/openvinotoolkit/training_extensions/pull/2311)
18+
- Add new visual prompting task: documentation (https://github.com/openvinotoolkit/training_extensions/pull/2354)
1819
- Add new visual prompting task: optimize (PTQ) (https://github.com/openvinotoolkit/training_extensions/pull/2318)
1920
- Add new object detector ResNeXt101-ATSS (<https://github.com/openvinotoolkit/training_extensions/pull/2309>)
2021

docs/source/guide/explanation/algorithms/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,4 @@ Contents
3131
segmentation/index
3232
anomaly/index
3333
action/index
34+
visual_prompting/index
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
Visual Prompting
2+
=================
3+
4+
Visual prompting is a computer vision task that uses a combination of an image and prompts, such as texts, bounding boxes, points, and so on to troubleshoot problems.
5+
Using these useful prompts, the main purpose of this task is to obtain labels from unlabeled datasets, and to use generated label information on particular domains or to develop a new model with the generated information.
6+
7+
This section examines the solutions for visual prompting offered by the OpenVINO Training Extensions library.
8+
`Segment Anything (SAM) <https://arxiv.org/abs/2304.02643>`_, is one of the most famous visual prompting methods and this model will be used to adapt a new dataset domain.
9+
Because `SAM <https://arxiv.org/abs/2304.02643>`_ was trained by using web-scale dataset and has huge backbone network, fine-tuning the whole network is difficult and lots of resources are required.
10+
Therefore, in this section, we try to fine-tune only mask decoder only for several epochs to increase performance on the new dataset domain.
11+
For fine-tuning `SAM <https://arxiv.org/abs/2304.02643>`_, we use following algorithms components:
12+
13+
.. _visual_prompting_finetuning_pipeline:
14+
15+
- ``Pre-processing``: Resize an image according to the longest axis and pad the rest with zero.
16+
17+
- ``Optimizer``: We use `Adam <https://arxiv.org/abs/1412.6980>`_ optimizer.
18+
19+
- ``Loss function``: We use standard loss combination, 20 * focal loss + dice loss + iou loss, used in `SAM <https://arxiv.org/abs/2304.02643>`_ as it is.
20+
21+
- ``Additional training techniques``
22+
- ``Early stopping``: To add adaptability to the training pipeline and prevent overfitting. Early stopping will be automatically applied.
23+
24+
25+
.. note::
26+
27+
Currently, fine-tuning `SAM <https://arxiv.org/abs/2304.02643>`_ with bounding boxes in the OpenVINO Training Extensions is only supported.
28+
We will support fine-tuning with other prompts (points and texts) and continuous fine-tuning with predicted mask information in the near future.
29+
30+
.. note::
31+
32+
Currently, Post-Training Quantization (PTQ) for `SAM <https://arxiv.org/abs/2304.02643>`_ is only supported, not Quantization Aware Training (QAT).
33+
34+
35+
**************
36+
Dataset Format
37+
**************
38+
.. _visual_prompting_dataset:
39+
40+
For the dataset handling inside OpenVINO™ Training Extensions, we use `Dataset Management Framework (Datumaro) <https://github.com/openvinotoolkit/datumaro>`_.
41+
42+
We support three dataset formats for visual prompting:
43+
44+
- `Common Semantic Segmentation <https://openvinotoolkit.github.io/datumaro/stable/docs/data-formats/formats/common_semantic_segmentation.html>`_ for semantic segmentation
45+
46+
- `COCO <https://openvinotoolkit.github.io/datumaro/stable/docs/data-formats/formats/coco.html>`_ for instance segmentation
47+
48+
- `Pascal VOC <https://openvinotoolkit.github.io/datumaro/stable/docs/data-formats/formats/pascal_voc.html>`_ for instance segmentation and semantic segmentation
49+
50+
51+
If you organized supported dataset format, starting training will be very simple. We just need to pass a path to the root folder and desired model template to start training:
52+
53+
.. code-block::
54+
55+
$ otx train <model_template> \
56+
--train-data-roots <path_to_data_root> \
57+
--val-data-roots <path_to_data_root>
58+
59+
.. note::
60+
61+
During training, mDice for binary mask without label information is used for train/validation metric.
62+
After training, if using ``otx eval`` to evaluate performance, mDice for binary or multi-class masks with label information will be used.
63+
As you can expect, performance will be different between ``otx train`` and ``otx eval``, but if unlabeled mask performance is high, labeld mask performance is high as well.
64+
65+
66+
******
67+
Models
68+
******
69+
.. _visual_prompting_model:
70+
71+
We support the following model templates in experimental phase:
72+
73+
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+---------------------+-----------------+
74+
| Template ID | Name | Complexity (GFLOPs) | Model size (MB) |
75+
+======================================================================================================================================================================================+===========+=====================+=================+
76+
| `Visual_Prompting_SAM_ViT_B <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/algorithms/visual_prompting/configs/sam_vit_b/template_experimental.yaml>`_ | SAM_ViT_B | 487 | 374 |
77+
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+---------------------+-----------------+
78+
79+
To check feasibility of `SAM <https://arxiv.org/abs/2304.02643>`_, we did experiments using three public datasets with each other domains: `WGISD <https://github.com/thsant/wgisd>`_, `Trashcan <https://conservancy.umn.edu/handle/11299/214865>`_, and `FLARE22 <https://flare22.grand-challenge.org/>`_, and checked `Dice score <https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient>`_.
80+
We used sampled training data from `Trashcan <https://conservancy.umn.edu/handle/11299/214865>`_ and `FLARE22 <https://flare22.grand-challenge.org/>`_, and full training data (=110) from `WGISD <https://github.com/thsant/wgisd>`_. The below table shows performance improvement after fine-tuning.
81+
82+
+---------------------------------------------------------------+--------------------+--------+-------------------+
83+
| Dataset | #samples | Before | After fine-tuning |
84+
+===============================================================+====================+========+===================+
85+
| `WGISD <https://github.com/thsant/wgisd>`_ | 110 | 92.32 | 92.46 (+0.14) |
86+
+---------------------------------------------------------------+--------------------+--------+-------------------+
87+
| `Trashcan <https://conservancy.umn.edu/handle/11299/214865>`_ | 100 | 79.61 | 83.92 (+4.31) |
88+
+---------------------------------------------------------------+--------------------+--------+-------------------+
89+
| `FLARE22 <https://flare22.grand-challenge.org/>`_ | 1 CT (=100 slices) | 91.48 | 91.68 (+0.20) |
90+
+---------------------------------------------------------------+--------------------+--------+-------------------+
91+
92+
According to datasets, ``learning rate`` and ``batch size`` can be adjusted like below:
93+
94+
.. code-block::
95+
96+
$ otx train <model_template> \
97+
--train-data-roots <path_to_data_root> \
98+
--val-data-roots <path_to_data_root> \
99+
params \
100+
--learning_parameters.dataset.train_batch_size <batch_size_to_be_updated> \
101+
--learning_parameters.optimizer.lr <learning_rate_to_be_updated>

0 commit comments

Comments
 (0)