[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

jordanvaneetveldt · 2024-09-09T12:29:20Z

Describe the bug

I am encountering an inconsistency in the reported metrics between the validation and testing phases, despite using the same dataset (val_split_mode=ValSplitMode.SAME_AS_TEST). Specifically, the pixel_xxx metrics values vary significantly between the validation and testing stages, even though both phases are using the same data. Interestingly, the image_xxx metrics remain consistent across both phases.

I am using an EfficientAD model with a batch size of 1 for both training and validation/testing. Could there be a reason why the pixel_xxx metrics differ between validation and testing, despite the consistent dataset and batch size?

Dataset

MVTec

Model

Other (please specify in the field below)

Steps to reproduce the behavior

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
from anomalib.data.utils.split import ValSplitMode


model = EfficientAd()
data = MVTec(
    train_batch_size=1,
    eval_batch_size=1,
    val_split_mode=ValSplitMode.SAME_AS_TEST,
)

pixel_metrics = {
    "F0_5Score": {
        "class_path": "torchmetrics.FBetaScore",
        "init_args": {"task": "binary", "beta": 0.5},
    },
    "Recall": {
        "class_path": "torchmetrics.Recall",
        "init_args": {"task": "binary"},
    },
    "Precision": {
        "class_path": "torchmetrics.Precision",
        "init_args": {"task": "binary"},
    },
}
engine = Engine(
    pixel_metrics=pixel_metrics,
    threshold="ManualThreshold",
    accelerator="auto",  # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
    devices=1,
    logger=None,
    max_epochs=1,
)

engine.fit(datamodule=data, model=model)
print("Training Done")

engine.validate(datamodule=data, model=model)
print("Validation Done")

engine.test(datamodule=data, model=model)
print("Testing Done")

OS information

OS information:

OS: Ubuntu 20.04.6 LTS
Python version: 3.10.14
Anomalib version: 1.2.0.dev0
PyTorch version: 2.4.0+cu121
CUDA/cuDNN version: 12.1
GPU models and configuration: 1x GeForce RTX 3090
Any other relevant information: tochmetrics: 1.4.1

Expected behavior

The metrics from the validation and testing phases should be consistent since the same dataset is used for both

Screenshots

Pip/GitHub

GitHub

What version/branch did you use?

1.2.0.dev0

Configuration YAML

API

Logs

Calculate Validation Dataset Quantiles: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:00<00:00, 84.32it/s]
Epoch 0: 100%|█| 209/209 [00:15<00:00, 13.81it/s, train_st_step=7.730, train_ae_step=0.771, train_stae_step=0.0552, train_loss_step=8.560, pixel_F0_5Score=0.236, pixel_Precision=0.212, pixel_Recall=0.440, train_st_ep`Trainer.fit` stopped: `max_epochs=1` reached.                                                                                                                                                                          
Epoch 0: 100%|█| 209/209 [00:15<00:00, 13.61it/s, train_st_step=7.730, train_ae_step=0.771, train_stae_step=0.0552, train_loss_step=8.560, pixel_F0_5Score=0.236, pixel_Precision=0.212, pixel_Recall=0.440, train_st_ep
Training Done
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
Calculate Validation Dataset Quantiles: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:00<00:00, 85.53it/s]
Validation DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:01<00:00, 68.01it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.9642857313156128     │
│       image_F1Score       │    0.6451612710952759     │
│      pixel_F0_5Score      │    0.23632192611694336    │
│      pixel_Precision      │    0.21182596683502197    │
│       pixel_Recall        │    0.4397238790988922     │
└───────────────────────────┴───────────────────────────┘
Validation Done
F1Score class exists for backwards compatibility. It will be removed in v1.1. Please use BinaryF1Score from torchmetrics instead
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
Testing DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:16<00:00,  4.99it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│        image_AUROC        │    0.9642857313156128     │
│       image_F1Score       │    0.6451612710952759     │
│      pixel_F0_5Score      │    0.3458831310272217     │
│      pixel_Precision      │    0.9593373537063599     │
│       pixel_Recall        │    0.09721758216619492    │
└───────────────────────────┴───────────────────────────┘
Testing Done

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

jordanvaneetveldt · 2024-09-10T15:54:42Z

After debugging, I believe normalization is applied only during testing but not during validation. Could that be the case? If so, it would explain the inconsistency. Setting normalization=NormalizationMethod.NONE resolves the inconsistency, but then the reported metrics become inaccurate, as the anomaly map values are no longer constrained between 0 and 1, and TorchMetrics automatically applies a sigmoid function to normalize these values.

Additionally, if normalization is only applied during testing, wouldn't the F1AdaptiveThreshold be computed on non-normalized anomaly maps during validation and then applied on normalized maps during testing?

Samyarrahimi · 2024-09-11T09:29:50Z

The MVTec class includes a val_split_ratio argument, which defaults to 0.5. This parameter specifies the fraction of either the training or test images that will be set aside for validation. It seems that since you set the validation data as the test set and didn't specify a val_split_ratio, Anomalib used 50% of the test data for validation. This could explain the discrepancy between your validation and test set results.

here is the source code:
https://github.com/openvinotoolkit/anomalib/blob/main/src/anomalib/data/image/mvtec.py

jordanvaneetveldt · 2024-09-11T09:52:33Z

When setting val_split_mode: same_as_test, the val_split_ratio is ignored, and both datasets contain the same images.

I believe the discrepancy occurs because normalization is not applied during validation. When I set normalization=NormalizationMethod.NONE, I obtained the same results for both validation and testing. However, in this case, torchmetrics are computed using the sigmoid function, which leads to incorrect results.

The issue is not specific to EfficientAD, as I encountered the same problem with PatchCore.

alexriedel1 · 2024-09-13T10:18:16Z

During validation the normalization metrics are calculated but normalization is not applied. During testing the previous calculated metrics are applied to the results.
Normalization is applied before thresholding during validation.

So you shouldn't trust the validation results too much for precision and recall.

To use the manual threshold, you need to pass it in a dict-object structure just like you pass your metrics.

config = {
            ...    "class_path": "ManualThreshold",
            ...    "init_args": {"default_value": 0.7}
            ... }

jordanvaneetveldt · 2024-09-13T11:42:44Z

Thank you for your answer! So this mean that early stopping isn’t feasible if the validation results are unreliable? For instance, I would like to monitor precision to keep the model with the fewest false positives.

alexriedel1 · 2024-09-13T12:35:42Z

During training it should not matter, because you validate each epoch without normalization. You then can compare all the non-normalized results.
After training if you test, you will get slightly different results however.
But testing on your validation data isn't even recommended because you then have contaminated test results that appear better than they actually are.

jordanvaneetveldt · 2024-09-13T14:39:37Z

Thank you for the clarification. However, it seems that normalization during validation might still have an effect on the metrics, potentially leading to incorrect values. Using the code below, I noticed that the metrics consistently drop to a constant value after the first epoch, and the model stops training early (at epoch 5, due to the patience parameter being set to 5).

from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
from anomalib.data.utils.split import ValSplitMode
from anomalib.utils.normalization import NormalizationMethod

from lightning.pytorch.callbacks import EarlyStopping
from anomalib.callbacks.checkpoint import ModelCheckpoint

model = EfficientAd()
data = MVTec(
    train_batch_size=1,
    val_split_mode=ValSplitMode.SAME_AS_TEST,
)

callbacks = [
    EarlyStopping(
        monitor="pixel_F1Score",
        mode="max",
        patience=5,
    ),
]

pixel_metrics = {
    "F1Score": {
        "class_path": "torchmetrics.F1Score",
        "init_args": {"task": "binary"},
    },
    "Recall": {
        "class_path": "torchmetrics.Recall",
        "init_args": {"task": "binary"},
    },
    "Precision": {
        "class_path": "torchmetrics.Precision",
        "init_args": {"task": "binary"},
    },
}
engine = Engine(
    callbacks=callbacks,
    pixel_metrics=pixel_metrics,
    accelerator="auto",  # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
    devices=1,
    logger=None,
)

engine.fit(datamodule=data, model=model)
print("Training Done")

When you mention that normalization is applied before thresholding during validation, are you referring to the computation of the F1AdaptiveThreshold? During metric computation (such as F1Score, Precision, etc.), Torchmetrics automatically applies a sigmoid function to the data if it is outside the [0,1] range. However, if the F1AdaptiveThreshold is computed on the MinMax normalized data, this creates a mismatch, as the threshold will not align with the data after the sigmoid transformation.

Samyarrahimi · 2024-10-27T18:22:59Z

During validation the normalization metrics are calculated but normalization is not applied. During testing the previous calculated metrics are applied to the results. Normalization is applied before thresholding during validation.

@alexriedel1 @jordanvaneetveldt I'd appreciate it if you could help me understand this correctly.

This sentence feels ambiguous to me. Aren't the first and last sentences contradictory? I still don't fully understand the normalization and thresholding approach in the validation phase. If normalization is not applied (and only the metrics for normalization are calculated), how are classification labels determined during validation? Also, are we relying on the sigmoid function applied in torchmetrics to calculate metrics like precision and recall since the model's output is not in the [0,1] range? (#2337)

I understand that normalization is applied in the testing phase using the minimum and maximum scores from validation. However, when it comes to the threshold, the code in min_max_normalization.py doesn’t seem to use any threshold calculated during validation (if such a threshold is calculated at all, assuming I am using F1AdaptiveThreshold).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

jordanvaneetveldt commented Sep 9, 2024

jordanvaneetveldt commented Sep 10, 2024

Samyarrahimi commented Sep 11, 2024

jordanvaneetveldt commented Sep 11, 2024

alexriedel1 commented Sep 13, 2024

jordanvaneetveldt commented Sep 13, 2024

alexriedel1 commented Sep 13, 2024 •

edited

Loading

jordanvaneetveldt commented Sep 13, 2024

Samyarrahimi commented Oct 27, 2024

[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301

Comments

jordanvaneetveldt commented Sep 9, 2024

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

jordanvaneetveldt commented Sep 10, 2024

Samyarrahimi commented Sep 11, 2024

jordanvaneetveldt commented Sep 11, 2024

alexriedel1 commented Sep 13, 2024

jordanvaneetveldt commented Sep 13, 2024

alexriedel1 commented Sep 13, 2024 • edited Loading

jordanvaneetveldt commented Sep 13, 2024

Samyarrahimi commented Oct 27, 2024

alexriedel1 commented Sep 13, 2024 •

edited

Loading