Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix early stopping in converter patching + fix lr warmup for all tasks #4131

Merged
merged 14 commits into from
Dec 4, 2024

Conversation

kprokofi
Copy link
Collaborator

@kprokofi kprokofi commented Nov 25, 2024

Summary

Fix found bugs in OTX2.2 for OTX2.2.1 and Geti2.6 releases.

  • Correct EarlyStoppingWithWarmup overriding in converter instead of lightning early stopping. (I checked with Geti config that now it overrides early stopping correctly)

  • Add an interface for warming up scheduler in configs for all models. It should be possible to set warmup for every model. Next, warmup with 3 steps looks like a bug. This PR set it to 0 for now. (we need to search for better schedule later)

  • Change patience for ReduceLROnPlateue for classification task from 1 epoch to 5 (seems reasonable for me, 1 epoch looks very strange)

  • Increase patience for classification and align this default number across all current Geti tasks. (patience=3 are not reasonable and may hurt model convergence/performance due to random fluctuations.)
    From Geti it is clearly seen, that sometimes more epochs are required to converge
    image

  • Template update

image

How to test

Checklist

  • I have added unit tests to cover my changes.​
  • I have added integration tests to cover my changes.​
  • I have ran e2e tests and there is no issues.
  • I have added the description of my changes into CHANGELOG in my target branch (e.g., CHANGELOG in develop).​
  • I have updated the documentation in my target branch accordingly (e.g., documentation in develop).
  • I have linked related issues.

License

  • I submit my code changes under the same Apache License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below).
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

@github-actions github-actions bot added the TEST Any changes in tests label Nov 27, 2024
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (releases/2.2.0@c6e2952). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/otx/core/model/base.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##             releases/2.2.0    #4131   +/-   ##
=================================================
  Coverage                  ?   80.57%           
=================================================
  Files                     ?      276           
  Lines                     ?    27471           
  Branches                  ?        0           
=================================================
  Hits                      ?    22136           
  Misses                    ?     5335           
  Partials                  ?        0           
Flag Coverage Δ
py310 80.48% <0.00%> (?)
py311 80.57% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sovrasov
sovrasov previously approved these changes Nov 28, 2024
@sovrasov
Copy link
Contributor

sovrasov commented Nov 28, 2024

@kprokofi do you have any experiments to compare the modified schedule for classification?

eugene123tw
eugene123tw previously approved these changes Nov 29, 2024
@kprokofi kprokofi dismissed stale reviews from eugene123tw and sovrasov via de0c0cb November 29, 2024 15:40
@kprokofi
Copy link
Collaborator Author

kprokofi commented Dec 2, 2024

After benchmarking all models, I found, that indeed updated schedule can show up to 2% of accuracy gain. However, the training time has grown drastically. For large datasets, I see 2-2.5x training time. So, I will decrease patience for classification from 10 to 5 epochs and patience for scheduler from 5 to 3.

Unfortunately, we don't have small datasets to test it on the edge cases like Geti, we have only one small dataset used for CI, but the accuracy becomes 1 after only few epochs.

@kprokofi kprokofi requested review from eugene123tw and sovrasov and removed request for eugene123tw December 2, 2024 12:40
@sovrasov
Copy link
Contributor

sovrasov commented Dec 3, 2024

@kprokofi could you update the changelog + also mention Dino V2 PR there?

@github-actions github-actions bot added the DOC Improvements or additions to documentation label Dec 3, 2024
@kprokofi
Copy link
Collaborator Author

kprokofi commented Dec 3, 2024

@kprokofi could you update the changelog + also mention Dino V2 PR there?

Done

sovrasov
sovrasov previously approved these changes Dec 3, 2024
@kprokofi
Copy link
Collaborator Author

kprokofi commented Dec 3, 2024

Classification tests:
image

eugene123tw
eugene123tw previously approved these changes Dec 4, 2024
Copy link
Contributor

@eugene123tw eugene123tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Kirill. This PR increases the early stopping patience at project level, which will affect all tasks. Let’s keep in mind that this might lead to increased accuracy but also training times in Geti, so we should monitor this closely.

@kprokofi
Copy link
Collaborator Author

kprokofi commented Dec 4, 2024

Thanks, Kirill. This PR increases the early stopping patience at project level, which will affect all tasks. Let’s keep in mind that this might lead to increased accuracy but also training times in Geti, so we should monitor this closely.

Eventually, It increases early stopping for classification only
patience=10 was always a default number in train.yaml
I also doesn't change it in templates for other tasks

image

@kprokofi kprokofi dismissed stale reviews from eugene123tw and sovrasov via cd14c7b December 4, 2024 10:44
@sovrasov sovrasov merged commit 5707bc5 into releases/2.2.0 Dec 4, 2024
3 checks passed
@sovrasov sovrasov deleted the kp/update_converter branch December 4, 2024 10:46
sovrasov added a commit that referenced this pull request Dec 19, 2024
* update for releases 2.2.0rc0

* Fix Classification explain forward issue (#3867)

Fix bug

* Fix e2e code error (#3871)

* Update test_cli.py

* Update tests/e2e/cli/test_cli.py

Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com>

* Update test_cli.py

* Update test_cli.py

---------

Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com>

* Add documentation about configurable input size (#3870)

* add docs about configurable input size

* update api usecase and fix bug

* Fix zero-shot e2e (#3876)

Fix

* Fix DeiT for multi-label classification (#3881)

Remove init_args

* Fix Semi-SL for ViT accuracy drop (#3883)

Remove init_args

* Update docs for 2.2 (#3884)

Update docs

* Fix mean and scale for segmentation task (#3885)

fix mean and scale

* Update MAPI in 2.2 (#3889)

* Bump MAPI

* Update exportable code requirements

* Improve Semi-SL for LiteHRNet (small-medium case) (#3891)

* change drop pixels value

* go safe, change only tested models

* minor

* Improve h-cls for eff models (#3893)

* Update step size for eff v2

* Update effb0 recipe

* Fix maskrcnn swin nncf acc drop (#3900)

update maskrcnn swimt model type to transformer

* Add keypoint detection recipe for single object cases (#3903)

* add rtmpose_tiny for single obj

* add rtmpose_tiny for single obj

* modify test subset name

* fix unit test

* update recipe with reset

* Improve acc drop of efficientnetv2 for h-label cls (#3907)

* Add warmup_iters for effv2

* Update max_epochs

* Fix pretrained weight cached dir for timm (#3909)

* Fix pretrained_weight for timm

* Fix unit-test

* Fix keypoint detection single obj recipe (#3915)

* add rtmpose_tiny for single obj

* modify test subset name

* fix unit test

* property for pck

* Fix cached dir for timm & hugging-face (#3914)

* Fix cached dir

* Pretrained weight download unit-test

* Fix pre-commit

* Fix wrong template id mapping for anomaly (#3916)

* Update script to allow setting otx version using env. variable (#3913)

* Fix Datamodule creation for OV in AutoConfigurator (#3920)

Fix datamodule for ov

* Update tpp file for 2.2.0 (#3921)

* Fix names for ignored scope [HOT-FIX, 2.2.0] (#3924)

fix names for ignored scope

* Fix classification rt_info (#3922)

* Restore output_raw_scores for classificaiton

* Add uts

* Fix linter

* Update label info (#3925)

add label info to init

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Fix binary classification metric task (#3928)

* Fix binary classification

* Add unit-tests

* Improve MaskRCNN SwinT NNCF (#3929)

* ignore heads and disable smooth quant

* add activations_range_estimator_params

* update changelog

* Fix get_item for Chained Tasks in Classification (#3931)

* Fix Task Chain

* Add multi-label case as well

* Add multi-label case as well2

* Add H-label case

* Correct Keyerror for h-label cls in label_groups for dm_label_categories using label's id/key (#3932)

Modify label_groups for dm_label_categories with id/key of label

* Remove datumaro attribute id from tiling, add subset names (#3933)

* remove datumaro attribute id from tiling

* add subset names

* Fix soft predictions for Semantic Segmentation (#3934)

fix soft preds

* Update STFPM config (#3935)

* Add missing pretrained weights when creating a docker image (#3938)

* Fix pre-trained weight downloader

* Remove if condition for pretrained wiehgt download

* Change default option 'full' to 'base' in otx install (#3937)

* Change option full to base for otx install

* Fix wrong code

* Fix issue

* Fix docs

* Fix auto adapt batch size in Converter (#3939)

* Enable auto adapt batch size into converter

* Fix wrong

* Fix hpo converter (#3940)

* save best hp after hpo

* add test

* Fix tiling XAI out of range (#3943)

- Fix tile merge XAI out of range

* enable model export (#3952)

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Move templates from OTX1.X to OTX2.X (#3951)

* add otx1.6 templates

* added new models

* delete entrypoints and nncf cfg

* updated some hyperparams

* fix for rtmdet_tiny

* updated converter

* Update classification templates

* Update det, r-det, vpm

* Update template.yaml

* changed warmaup value in train.yaml

---------

Co-authored-by: Kang, Harim <harim.kang@intel.com>
Co-authored-by: Kim, Sungchul <sungchul.kim@intel.com>

* Add missing tile recipes and various tile recipe changes  (#3942)

* add missing tile recipes

* Fix tiling XAI out of range (#3943)

- Fix tile merge XAI out of range

* update xai tile merge

* update rtdetr

* update tile recipes

* update rtdetr tile postprocess

* update rtdetr recipes and tile recipes

* update tile recipes

* fix rtdetr unittest

* update recipes

* refactor tile unit test

* address pr reviews

* remove unnecessary files

* update color channel

* fix image channel passing

* include tiling in cli integration test

* remove transform_bbox

---------

Co-authored-by: Vladislav Sovrasov <sovrasov.vlad@gmail.com>

* Support ImageFromBytes (#3948)

* add image_from_bytes

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* refactor code

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* allow empty anomalous masks

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

---------

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Change categories mapping logic (#3946)

* change pre-filtering logic

* Update src/otx/core/data/pre_filtering.py

Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com>

---------

Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com>

* Update for 2.2.0rc1 (#3956)

* Include Geti arrow dataset subset names (#3962)

* restrited number of output masks by tiling

* add geti subset name

* update num of max pred

* Include full image with anno in case there's no tile in tile dataset (#3964)

* include full image with anno incase there's no tile in dataset

* update test

* Add type checker in converter for callable functions (optimizer, scheduler) (#3968)

Fix converter callable functions (optimizer, scheduler)

* Update for 2.2.0rc2 (#3969)

update for 2.2.0rc2

* Fix config converter for tiling (#3973)

fix config converter for tiling

* Update for 2.2.0rc3 (#3975)

* Change sematic segmentation to consider bbox only annotations. (#3996)

* segmentation consider bbox only annotations

* add unit test

* add unit test

* update fixture

* use name attribute

* revert tox file

* update for 2.2.0rc4

---------

Co-authored-by: Yunchu Lee <yunchu.lee@intel.com>

* Relieve memory usage criteria on batch size 2 during adaptive_bs (#4009)

* release memory usage cirteria on batch size 2 during adpative_bs

* update unit test

* update unit test

* Remove background label from RT Info for segmentation task (#4011)

* remove background from rt_info

* provide another solution

* fix unit test

* Fix num_trials calculation on dataset length less than num_class (#4014)

Fix balanced sampler

* Fix out_features in HierarchicalCBAMClsHead (#4016)

Fix out_features

* Fix empty anno (#4010)

* Refactor mask_target_single function to handle unsupported ground truth mask types and provide warnings for missing ground truth masks

* Refactor bbox_overlaps function to handle unsupported ground truth mask types and provide warnings for missing ground truth masks

* Refactor export script to export multiple directories

* Refactor test_bbox_overlaps_2d to handle mismatched batch dimensions of bboxes

* Refactor bbox_overlaps function error exception

* update changelog

---------

Co-authored-by: Harim Kang <harim.kang@intel.com>

* Update for release 2.2.0rc5 (#4015)

* Prevent using too low confidence thresholds in detection (#4018)

Prevent writing too low confidence thresholds to MAPI configuration

* Update for release 2.2.0rc6 (#4027)

* Update pre-merge workflow (#4032)

* Update HPO interface (#4035)

* update hpo interface

* update unit test

* update CHANGELOG.md

* Enable keypoint detection training through config conversion (#4034)

enable keypoint det config converter

* Update for release 2.2.0rc7 (#4036)

update for release 2.2.0rc7

* Fix multilabel_accuracy of MixedHLabelAccuracy (#4042)

* Fix metric for multi-label

* Fix1

* Add CHANGELOG

* Update for release 2.2.0rc8 (#4043)

* Fix wrong indices setting in HLabelInfo (#4044)

* Fix wrong indices setting in label_info

* Add unit-test & update for releases

* Add legacy template LiteHRNet_18 template (#4049)

added legacy template

* Model templates: rename model_status value 'DISCONTINUED' to 'OBSOLETE' (#4051)

rename 'DISCONTINUED' to 'OBSOLETE' in model templates

* Enable export of feature vectors for semantic segmentation task (#4055)

* Upgrade MAPI in 2.2 (#4052)

* Update MRCNN model export to include feature vector and saliency map (#4056)

* Fix applying model's hparams when loading model from checkpoint (#4057)

* Update anomaly transforms (#4059)

* Update transforms

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Update transforms

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Update changelog

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>

* Update __init__.py

---------

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>
Co-authored-by: Emily Chun <emily.chun@intel.com>

* Bump onnx to 1.17.0 to omit CVE-2024-5187 (#4063)

* Fix incorrect all_groups order configuration in HLabelInfo (#4067)

* Fix all_labels

* Update CHAGELOG

* label_groups change

* Fix wrong model name in converter & template (#4082)

* Fix wrong

* Update CHAGELOG

* RTMDet Inst Seg Explain Mode for 2.2 (#4083)

* Explain mode for RTMDet Inst Seg

* Update changelog

* reformat changelog

* Fix rtdetr recipes (#4079)

* Fix recipes

* Update CHANGELOG

* Enable adaptive_bs with Efficientnet-V2-L model template (#4085)

Enable adaptive_bs with Efficientnet-V2-L model

* Add Keypoint Detection legacy template (#4094)

added rtmpose_template

* Revert the old workaround for detection confidence threshold (#4096)

Revert the old workaround

* OTX RC 2.2 version up (#4099)

* Update changelog

* OTX version up

* Fix linter

* Add dummy XAI to RTDETR (export mode) & disable strong aug (#4106)

* Implement warning for unsupported explain mode in DETR model and update transform probabilities to zero in RTDETR recipes

* update changelog

* Update photometric distortion probability in RTDETR recipes

* Fix task chain for Det -> Cls / Seg (#4105)

* fix linter

* return recipe back

* added roi extraction for multi cllass classification datasett

* fix linter

* add same logic to semantic seg

* added test for OTXDataset

* add clip and raise an error when coordinates are invalid.

* rewrite value error

* Disable tiling classifier toggle in configurable parameters (#4107)

* Disable tiling classifier toggle in configurable parameters

* Update changelog

* Update keypoint detection template (#4114)

* added default template

* update field

* Minor update of the Changelog for the releases/2.2 branch (#4116)

* minor update

* minor

* Version up for 2.2 release (#4120)

Version up

* Allow empty tile annotation (#4124)

* Add warnings for empty annotations in OTXTileDetTestDataset and OTXTileInstSegTestDataset

* Fix empty annotation handling in tiling

* Fix tensor type compatibility in dynamic soft label assigner and RTMDet head (#4140)

* Fix tensor type compatibility in dynamic soft label assigner and RTMDet head

* Update CHANGELOG

* Update Label Info handling (#4127)

* Update h-cls info

* Revert h-cls head to linear one

* Cosmetic changes

* Add arrow-specific labels management logic for cls

* Update export logic

* Update label info usage

* Update unit tests

* Fix linter

* Fix unit tests

* Fix linter

* Consider multilabel scenario in h-cls

* Update dataset docstring

* Add unit tests

* Don't preprocess h-cls dataset for arrow

* Fimussing labels in multilabel training

* Revert hcls head for effnet b0

* Update converter to pick up cls task

* Fix early stopping in converter patching + fix lr warmup for all tasks (#4131)

* fix converter and early stopping + fix warmup epochs

* fix linter

* fix linter2

* aligned default patience=10 for all tasks

* fix linte

* fix unit tests

* revert epoch to steps back, change templates

* fix cls templates

* fix unit test

* revert rotated det back.

* change schedule for classification

* fix linter

* update changelog

* Decouple DinoV2 for semantic segmentation (#4136)

* dinov2 decoupled. Perf tests

* added dino

* remove dinov2 backbone

* fix linter

* remove unit test

* fix integration tests

* revert perf test back

* Ensure target class indices are of type long in loss calculations (#4143)

* Ensure target class indices are of type long in loss calculations

* update changelog

* Fix arrow format reader for multiclass ROI case (#4145)

Fix arrow format reader for multiclass roi case

* Update classification in converter (#4146)

* Prepare 2.2.1 release (#4147)

prepare 2.2.1 release

* Update codeowners - releases/2.2.0 (#4155)

* Update codeowners

* newline

* Support Ellipse Shape for InstSeg algo (#4152)

* ellipse shape

* Update changelog

* update transform

* update

* Allow empty anno

* Update todo

* BC improvement (#4154)

* Add OTX version to exported models

* Forward data format to tiling dataset to fix arrow handling

* Workaround missing label_ids

* Version up

* Update release notes

* Update changelog

* merge conflicts

* resolve onflicts

* update all_label_ids

* fix the rest files

* update tests

* fix linter

* fix unit test

* fix unite tests 2

* Support Ellipse Shape for InstSeg algo (#4152)

* ellipse shape

* Update changelog

* update transform

* update

* Allow empty anno

* Update todo

* fix linter

* fix hungarian matcher

* fix augmentations tests

* fix dinov2 tiling

---------

Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com>
Co-authored-by: Yunchu Lee <yunchu.lee@intel.com>
Co-authored-by: Harim Kang <harim.kang@intel.com>
Co-authored-by: Emily Chun <emily.chun@intel.com>
Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com>
Co-authored-by: Kim, Sungchul <sungchul.kim@intel.com>
Co-authored-by: Vladislav Sovrasov <sovrasov.vlad@gmail.com>
Co-authored-by: Sooah Lee <sooah.lee@intel.com>
Co-authored-by: Eugene Liu <eugene.liu@intel.com>
Co-authored-by: Wonju Lee <wonju.lee@intel.com>
Co-authored-by: Ashwin Vaidya <ashwin.vaidya@intel.com>
Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DOC Improvements or additions to documentation OTX 2.0 TEST Any changes in tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants