-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Inconsistent pixel metrics between validation and testing on the same dataset #2301
Comments
After debugging, I believe normalization is applied only during testing but not during validation. Could that be the case? If so, it would explain the inconsistency. Setting Additionally, if normalization is only applied during testing, wouldn't the F1AdaptiveThreshold be computed on non-normalized anomaly maps during validation and then applied on normalized maps during testing? |
The MVTec class includes a val_split_ratio argument, which defaults to 0.5. This parameter specifies the fraction of either the training or test images that will be set aside for validation. It seems that since you set the validation data as the test set and didn't specify a val_split_ratio, Anomalib used 50% of the test data for validation. This could explain the discrepancy between your validation and test set results. here is the source code: |
When setting I believe the discrepancy occurs because normalization is not applied during validation. When I set The issue is not specific to |
During validation the normalization metrics are calculated but normalization is not applied. During testing the previous calculated metrics are applied to the results. So you shouldn't trust the validation results too much for precision and recall. To use the manual threshold, you need to pass it in a dict-object structure just like you pass your metrics.
|
Thank you for your answer! So this mean that early stopping isn’t feasible if the validation results are unreliable? For instance, I would like to monitor precision to keep the model with the fewest false positives. |
During training it should not matter, because you validate each epoch without normalization. You then can compare all the non-normalized results. |
Thank you for the clarification. However, it seems that normalization during validation might still have an effect on the metrics, potentially leading to incorrect values. Using the code below, I noticed that the metrics consistently drop to a constant value after the first epoch, and the model stops training early (at epoch 5, due to the patience parameter being set to 5). from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
from anomalib.data.utils.split import ValSplitMode
from anomalib.utils.normalization import NormalizationMethod
from lightning.pytorch.callbacks import EarlyStopping
from anomalib.callbacks.checkpoint import ModelCheckpoint
model = EfficientAd()
data = MVTec(
train_batch_size=1,
val_split_mode=ValSplitMode.SAME_AS_TEST,
)
callbacks = [
EarlyStopping(
monitor="pixel_F1Score",
mode="max",
patience=5,
),
]
pixel_metrics = {
"F1Score": {
"class_path": "torchmetrics.F1Score",
"init_args": {"task": "binary"},
},
"Recall": {
"class_path": "torchmetrics.Recall",
"init_args": {"task": "binary"},
},
"Precision": {
"class_path": "torchmetrics.Precision",
"init_args": {"task": "binary"},
},
}
engine = Engine(
callbacks=callbacks,
pixel_metrics=pixel_metrics,
accelerator="auto", # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
devices=1,
logger=None,
)
engine.fit(datamodule=data, model=model)
print("Training Done") When you mention that normalization is applied before thresholding during validation, are you referring to the computation of the |
@alexriedel1 @jordanvaneetveldt I'd appreciate it if you could help me understand this correctly. This sentence feels ambiguous to me. Aren't the first and last sentences contradictory? I still don't fully understand the normalization and thresholding approach in the validation phase. If normalization is not applied (and only the metrics for normalization are calculated), how are classification labels determined during validation? Also, are we relying on the sigmoid function applied in I understand that normalization is applied in the testing phase using the minimum and maximum scores from validation. However, when it comes to the threshold, the code in min_max_normalization.py doesn’t seem to use any threshold calculated during validation (if such a threshold is calculated at all, assuming I am using |
Describe the bug
I am encountering an inconsistency in the reported metrics between the validation and testing phases, despite using the same dataset (
val_split_mode=ValSplitMode.SAME_AS_TEST
). Specifically, thepixel_xxx
metrics values vary significantly between the validation and testing stages, even though both phases are using the same data. Interestingly, theimage_xxx
metrics remain consistent across both phases.I am using an EfficientAD model with a batch size of 1 for both training and validation/testing. Could there be a reason why the
pixel_xxx
metrics differ between validation and testing, despite the consistent dataset and batch size?Dataset
MVTec
Model
Other (please specify in the field below)
Steps to reproduce the behavior
OS information
OS information:
Expected behavior
The metrics from the validation and testing phases should be consistent since the same dataset is used for both
Screenshots
Pip/GitHub
GitHub
What version/branch did you use?
1.2.0.dev0
Configuration YAML
API
Logs
Code of Conduct
The text was updated successfully, but these errors were encountered: