Skip to content

evaluate/metrics/mean_iou computes recall (sensitivity) instead of IoU #421

@FlorinAndrei

Description

@FlorinAndrei

There are several issues with the mean_iou code here:

https://github.com/huggingface/evaluate/blob/c447fc8eda9c62af501bfdc6988919571050d950/metrics/mean_iou/mean_iou.py

The most important is that it actually computes recall (sensitivity) instead of IoU. The root cause appears to be the way mask is computed only from label, but then the same mask is applied to both pred_label and label (lines 144-149):

mask = label != ignore_index
mask = np.not_equal(label, ignore_index)
pred_label = pred_label[mask]
label = np.array(label)[mask]

intersect = pred_label[pred_label == label]

Because both pred_label and label are masked with pixels from label only, the result of the computation in that function is the ratio of intersection and label (recall), instead of the ratio of intersection and the union of prediction and label (IoU).

It's a subtle error that is hard to discover because both IoU and recall have values between 0 and 1, and both behave similarly in training.

The problem is, recall is higher than IoU, which then leads to an overestimate of model performance. The unfortunate side-effect is that I've wasted a lot of time training a SegFormer model based on wrong assumptions.

I've only discovered this because I wrote my own metric functions, starting from TP / TN / FP / FN, and then from those four values I've computed Sorensen-Dice (a.k.a. F1-score), precision, recall, and (on a whim) IoU. This is my code (it's not optimized, the function docstrings are wrong, but it works):

https://gist.github.com/FlorinAndrei/da9ab770b16bfc671075d04a030f548b

I was very confused initially when my IoU was different from evaluate/metrics/mean_iou. But then I noticed my recall was the same as "IoU" from evaluate/metrics/mean_iou. I've checked my code in a few different ways and I believe it is correct.

Here's a visual sample:

metrics

eval/iou_lesion is the result from evaluate/metrics/mean_iou. eval/loss is just the evaluation loss. The rest are computed by my code. eval/niou_lesion is IoU computed by my code. Notice how the library code produces identical results to the recall value from my code.

My code has only been tested with SegFormer, and only for datasets with a single class, plus background, where the label pixels are 1 and the background is 0. I have not tested it for multiclass segmentation. I have not tested reduce_labels = True.

@lvwerra @lhoestq @mariosasko @lewtun @dleve123 @NielsRogge

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions