`MeanAveragePrecision` CPU overload when using multi-device ddp #866

aaronzs · 2022-02-28T04:48:33Z

🐛 Bug

I was using pytorchlightning.Trainer in ddp mode with 8 GPUs on a HPC. There are 5k images in my validation set, and I find all the CPU cores are fully loaded (100%) for extremely long time (~2 hours) at validation_epoch_end() stage.

To Reproduce

using ddp mode with number of device > 2
using large validation/test set.

Code sample

class Model(pl.LightningModule):
    def __init__(self):
        self.det = FasterRCNN(...)
        self.map = MeanAveragePrecision(compute_on_step=False)
        ...

    ...

    def validation_step(self, batch, batch_idx):
        images, target = batch
        preds = self.det(images)
        self.map.update(preds=preds, target=target)

    def validation_epoch_end(self, outs):
        val_map_out = self.map.compute()  <-- CPU overload
        self.map.reset()

Expected behavior

The same computation takes place in N separate sub-process simultaneously. If the computations are on GPU, they should finish in roughly the same time. But current implementation moves all tensors to CPU and CPU overload occurs while N increases.

ddp with 1 GPU, number of fully loaded GPU cores 16/32, takes ~6 mins.
ddp with 2 GPU, number of fully loaded GPU cores 32/32, takes ~6 mins.
ddp with >2 GPU, number of fully loaded GPU cores 32/32, takes > 1 hours.

Environment

OS (e.g., Linux): Linux
Python: 3.8
PyTorch: 1.8.1 and 1.10.2
pytorch-lightning: 1.5.10
torchmetrics: 0.7.2

Temp solution

My temp solution is re-enable the mAP computation on GPU, although the GPU version is slower than CPU version #677

I commented the following lines and made other relevant variables on the correct device.

    def compute(self) -> dict:
        # move everything to CPU, as we are faster here
        # self.detection_boxes = [box.cpu() for box in self.detection_boxes]
        # self.detection_labels = [label.cpu() for label in self.detection_labels]
        # self.detection_scores = [score.cpu() for score in self.detection_scores]
        # self.groundtruth_boxes = [box.cpu() for box in self.groundtruth_boxes]
        # self.groundtruth_labels = [label.cpu() for label in self.groundtruth_labels]

Should we make compute_on_cpu optional? and any suggestion according to TorchMetrics' API design?

I'm also keeping my eyes on the work of improving mAP performance on both CPU and GPU. #742

The text was updated successfully, but these errors were encountered:

github-actions · 2022-02-28T04:49:20Z

Hi! thanks for your contribution!, great first issue!

aaronzs added bug / fix Something isn't working help wanted Extra attention is needed labels Feb 28, 2022

SkafteNicki mentioned this issue Feb 28, 2022

New argument compute_on_cpu #867

Merged

4 tasks

Borda assigned SkafteNicki Mar 9, 2022

Borda modified the milestones: v0.7, v0.8 Mar 14, 2022

SkafteNicki closed this as completed in #867 Apr 11, 2022

Borda added this to the v0.8 milestone May 5, 2022

Borda added the v0.7.x label Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MeanAveragePrecision` CPU overload when using multi-device ddp #866

`MeanAveragePrecision` CPU overload when using multi-device ddp #866

aaronzs commented Feb 28, 2022

github-actions bot commented Feb 28, 2022

MeanAveragePrecision CPU overload when using multi-device ddp #866

MeanAveragePrecision CPU overload when using multi-device ddp #866

Comments

aaronzs commented Feb 28, 2022

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Temp solution

github-actions bot commented Feb 28, 2022

`MeanAveragePrecision` CPU overload when using multi-device ddp #866

`MeanAveragePrecision` CPU overload when using multi-device ddp #866