-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Evaluation metrics task force
- Classification
- Binary / Probabilistic / Multi Label segmentation
- Regression / Generative tasks
- Object detection
-
Comparison possible across architectures -> network agnostic
-
Most of the metrics can be used in different setting with different input types (1d, 2d, 3d arrays) need to accommodate for different contexts.
-
Confusion matrix based evaluation measures should be available for both classification and segmentation tasks
-
Binary vs Probabilistic
- For segmentation - binary is priority
- Probabilistic transformed in multiple binary
- Measures at population level
- Volume correlations for instance
- Compare to baseline
-
Need for a report
- Include result image by image (ordered by metric with associated percentile performance over full set)
- Include aggregation of results (with a flag to aggregate specific groups of loads / characteristics)
- Choose set of metrics always to be put in the report (core metrics) allow to add additional optional ones
- Basic statistics for aggregation over population
- Need guidelines on aggregation of metrics for reporting
- Implementation of the "rank" analysis (cf Medical Image Decathlon)
-
Importance of the documentation
- The documentation should included the contexts in which the evaluation measure is appropriate
- Synonyms and direct transformations to other usual measures should be considered
- Indication of similar / highly correlated metrics
-
Thresholds for different subgroups of evaluation (smal - medium - large)
- Current pb is that we don’t know what are the most appropriate thresholds - need option to be set and default suggestion
- Relevant for segmentation and object detection
-
Definition of specific cases of probabilistic outputs / multi thresholds analysis and multi label aggregation
- Probabilistic outputs
- Fuzzy metrics
- Choosing a set of threshold and apply binary set on each
- Choosing different value of ROC
-
Multilabel
- Specific multilabel metrics
- Cost according to distance -> needs additional input
- Cost of confusion -> providing typical ways of conveying the information
- Weighted / micro or macro metrics
-
Investigation on strategies of aggregation at population level
- Volumetric correlation
- Correlation with clinical status/measure
- Aggregating in specific groups (e.g performance across specific lesion loads)
- Statistics over results (mean / median - std / IQR) min max 5% and 95% with associated case ID -> Important to suggest publication of average best and worst results
-
Literature on the different metrics
- https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-015-0068-x
- https://www.sciencedirect.com/science/article/pii/S0169260709001424?casa_token=JV5okwf00GMAAAAA:aSceh5hWT3D05PPXD-RrlpW0Kdyp7ViR4N-AzbXET0QoEGimK2Ge6ol6eI4g2SLZjOntukogMQ
- https://ieeexplore.ieee.org/document/1616166?denied=
- https://d1wqtxts1xzle7.cloudfront.net/37219940/5215ijdkp01.pdf?1428314059=&response-content-disposition=inline%3B+filename%3DA_REVIEW_ON_EVALUATION_METRICS_FOR_DATA.pdf&Expires=1594367718&Signature=Ec~5dxmTRSFY5-6eIzrkW402x7cgn0FGUIm-bPlbxb0tZiRLWr-XoDfVsRdj5IKf4pOO7VoY9yeameruCL5jva9UZkOXvb5kzqnqHux3dShm-gTNIDmDgfIdqBsrCVTEJxgiRpMPHwl9z9tm73kYsZSVMK2nvgUuNiSzCv~7~67lACFaoah0CQxkavhC0WU7ADUT-Q6C9dHWzRZphz7VwgqsRM6SHkTaWskjD3MMpSaYvyw~qFuKsCZE3fgb2hGiCuiWwJIdqoc~GOD46heBKcy4y91MIodooq-ZhRUC09SJAySFzEZolnzu29soKXPcRGw-HccUg-fuLXgmkEc4QQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA
-
List of existing sources for implementation
- https://scikit-learn.org/stable/modules/model_evaluation.html
- https://medium.com/pytorch/pytorch-lightning-metrics-35cb5ab31857
- https://github.com/NifTK/NiftyNet/tree/dev/niftynet/evaluation
- https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
Still in progress Definition of what to do in edge cases (no definition of the metric) nan values Specific tasks with particular metrics Tractography Vessel segmentation Metrics for assessment of distribution / evaluation of uncertainty
- Allow the evaluation suite to take in single pair of ref/seg images or folder of matching pairs (by subject name) - Using np array in memory for the different images - (i.e. not forcing everything to be in folders) - Develop util functions to allow folder / file loading to memory - Computation should be on CPU - ensure torch tensors are converted back to numpy arrays.
- Link of classical metrics to their trainable counterpart (GPU based if possible with possible backpropagation.
- Allow for binary or probabilistic input
- For segmentation - provide results at different thresholds (potentially predefined by user)
- Allow for multi label input
- Produce a report csv file for the evaluation with aggregate statistics over the different metrics - use pandas DataFrame to gather all results (save to csv/xls depending on evaluation (multi label / mono - label / probability thresholds…) Specify the output format as option - suggest one according to task
- Csv/xlsx/html/ for individual subject
- Potentially html for aggregation building on challengeR (going towards WebToolkit) - to discuss with dev team on best way to integrate
- Implement dice score metrics allowing for multiple options when metrics is not defined
- Add optional epsilon to handle nans if needed (both on numerator and denominator)
- Optional function to handle nans in aggregation
- Implement nan-handling functions
- For all metrics - 2 outputs - nan_handled / not nan_handled - To discuss further
- Implement Hausdorff distance using percentile as argument
- Implement binary based confusion matrix metrics
- Report on raw data from confusion matrix
- Implement GDSC
- Implement Surface dice
- Implement Average surface distance.