evaluate/metrics/mean_iou computes recall (sensitivity) instead of IoU

There are several issues with the mean_iou code here:

https://github.com/huggingface/evaluate/blob/c447fc8eda9c62af501bfdc6988919571050d950/metrics/mean_iou/mean_iou.py

The most important is that it actually computes recall (sensitivity) instead of IoU. The root cause appears to be the way `mask` is computed only from `label`, but then the same `mask` is applied to both `pred_label` and `label` (lines 144-149):

```
mask = label != ignore_index
mask = np.not_equal(label, ignore_index)
pred_label = pred_label[mask]
label = np.array(label)[mask]

intersect = pred_label[pred_label == label]
```

Because both `pred_label` and `label` are masked with pixels from `label` only, the result of the computation in that function is the ratio of intersection and label (recall), instead of the ratio of intersection and the union of prediction and label (IoU).

It's a subtle error that is hard to discover because both IoU and recall have values between 0 and 1, and both behave similarly in training.

The problem is, **recall is higher than IoU**, which then **leads to an overestimate of model performance**. The unfortunate side-effect is that I've wasted a lot of time training a SegFormer model based on wrong assumptions.

I've only discovered this because I wrote my own metric functions, starting from TP / TN / FP / FN, and then from those four values I've computed Sorensen-Dice (a.k.a. F1-score), precision, recall, and (on a whim) IoU. This is my code (it's not optimized, the function docstrings are wrong, but it works):

https://gist.github.com/FlorinAndrei/da9ab770b16bfc671075d04a030f548b

I was very confused initially when my IoU was different from `evaluate/metrics/mean_iou`. But then I noticed my recall was the same as "IoU" from `evaluate/metrics/mean_iou`. I've checked my code in a few different ways and I believe it is correct.

Here's a visual sample:

![metrics](https://user-images.githubusercontent.com/901867/218621451-b1b47cc4-d8af-4970-b985-df6dcaf70259.png)

`eval/iou_lesion` is the result from `evaluate/metrics/mean_iou`. `eval/loss` is just the evaluation loss. The rest are computed by my code. `eval/niou_lesion` is IoU computed by my code. Notice how the library code produces identical results to the recall value from my code.

My code has only been tested  with SegFormer, and only for datasets with a single class, plus background, where the label pixels are 1 and the background is 0. I have not tested it for multiclass segmentation. I have not tested `reduce_labels = True`.

@lvwerra @lhoestq @mariosasko @lewtun @dleve123 @NielsRogge 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evaluate/metrics/mean_iou computes recall (sensitivity) instead of IoU #421

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

evaluate/metrics/mean_iou computes recall (sensitivity) instead of IoU #421

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions