Too slow and incorrect validation metrics #133

Adamusen · 2024-12-09T10:54:54Z

Describe the bug

There are two issues within the validation step inside yolo/tools/solver.py/ValidateModel at Lines 47-62, containing the following code:

    def validation_step(self, batch, batch_idx):
        batch_size, images, targets, rev_tensor, img_paths = batch
        H, W = images.shape[2:]
        predicts = self.post_process(self.ema(images), image_size=[W, H])
        batch_metrics = self.metric(
            [to_metrics_format(predict) for predict in predicts], [to_metrics_format(target) for target in targets]
        )

        self.log_dict(
            {
                "map": batch_metrics["map"],
                "map_50": batch_metrics["map_50"],
            },
            batch_size=batch_size,
        )
        return predicts

1, The call to self.metric() (which is a torchmetrics.detection.MeanAveragePrecision instance) at every validation step calculates the results not only for the in step provided batch, but for all values currently in its accumulator (so for every example accumulated so far from all the previous steps), resulting validation becoming progressively slower after each step. If one has more than a couple hundred validation images in the dataset, it makes validation take extremely long.

2, The collate function of the data loader fills up the "targets" tensor with dummy zeroes (boxes) for the images, which have less ground truth boxes than the image with the most labels (boxes), to be able to batch all the labels in to a single tensor. These dummy values however are not filtered out and are still there, when calculating the validation metrics. This leads to incorrect (lower than should be) metric values, since torchmetrics takes these dummy values as "missed" labels (boxes).

Proposed (and tested) solution

Change the above code snippet in Validate Model to:

    def validation_step(self, batch, batch_idx):
        batch_size, images, targets, rev_tensor, img_paths = batch
        H, W = images.shape[2:]
        predicts = self.post_process(self.ema(images), image_size=[W, H])
        self.metric.update([to_metrics_format(predict) for predict in predicts],
                           [to_metrics_format(target[target.sum(1) > 0]) for target in targets])
        return predicts

1, Change the self.metric() call to self.metric.update(), which only updates the accumulator with the new values, but does not calculate the results yet (which is done in "on_validation_epoch_end")
2, Drop the logging of per batch metric values (which were incorrect anyway, as they were cumulative values)
3, Filter out the dummy boxes from the per image "target" tensors before appending them to the metrics accumulator.

The text was updated successfully, but these errors were encountered:

Changed the self.metric() call to self.metric.update(). Removed the logging of per batch metric values. Filtered out the dummy boxes from the per image "target" tensors before appending them to the metrics accumulator. fixes WongKinYiu#133

Adamusen added the bug Something isn't working label Dec 9, 2024

Adamusen mentioned this issue Dec 9, 2024

🐛 [Fix] validation_step #134

Merged

henrytsui000 closed this as completed in #134 Jan 3, 2025

henrytsui000 closed this as completed in 8094323 Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too slow and incorrect validation metrics #133

Too slow and incorrect validation metrics #133

Adamusen commented Dec 9, 2024

Too slow and incorrect validation metrics #133

Too slow and incorrect validation metrics #133

Comments

Adamusen commented Dec 9, 2024

Describe the bug

Proposed (and tested) solution