Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeanAveragePrecision CPU overload when using multi-device ddp #866

Closed
aaronzs opened this issue Feb 28, 2022 · 1 comment · Fixed by #867
Closed

MeanAveragePrecision CPU overload when using multi-device ddp #866

aaronzs opened this issue Feb 28, 2022 · 1 comment · Fixed by #867
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed v0.7.x
Milestone

Comments

@aaronzs
Copy link
Contributor

aaronzs commented Feb 28, 2022

🐛 Bug

I was using pytorchlightning.Trainer in ddp mode with 8 GPUs on a HPC. There are 5k images in my validation set, and I find all the CPU cores are fully loaded (100%) for extremely long time (~2 hours) at validation_epoch_end() stage.

To Reproduce

  1. using ddp mode with number of device > 2
  2. using large validation/test set.

Code sample

class Model(pl.LightningModule):
    def __init__(self):
        self.det = FasterRCNN(...)
        self.map = MeanAveragePrecision(compute_on_step=False)
        ...

    ...

    def validation_step(self, batch, batch_idx):
        images, target = batch
        preds = self.det(images)
        self.map.update(preds=preds, target=target)

    def validation_epoch_end(self, outs):
        val_map_out = self.map.compute()  <-- CPU overload
        self.map.reset()

Expected behavior

The same computation takes place in N separate sub-process simultaneously. If the computations are on GPU, they should finish in roughly the same time. But current implementation moves all tensors to CPU and CPU overload occurs while N increases.

  • ddp with 1 GPU, number of fully loaded GPU cores 16/32, takes ~6 mins.
  • ddp with 2 GPU, number of fully loaded GPU cores 32/32, takes ~6 mins.
  • ddp with >2 GPU, number of fully loaded GPU cores 32/32, takes > 1 hours.

Environment

  • OS (e.g., Linux): Linux
  • Python: 3.8
  • PyTorch: 1.8.1 and 1.10.2
  • pytorch-lightning: 1.5.10
  • torchmetrics: 0.7.2

Temp solution

My temp solution is re-enable the mAP computation on GPU, although the GPU version is slower than CPU version #677

I commented the following lines and made other relevant variables on the correct device.

    def compute(self) -> dict:
        # move everything to CPU, as we are faster here
        # self.detection_boxes = [box.cpu() for box in self.detection_boxes]
        # self.detection_labels = [label.cpu() for label in self.detection_labels]
        # self.detection_scores = [score.cpu() for score in self.detection_scores]
        # self.groundtruth_boxes = [box.cpu() for box in self.groundtruth_boxes]
        # self.groundtruth_labels = [label.cpu() for label in self.groundtruth_labels]

Should we make compute_on_cpu optional? and any suggestion according to TorchMetrics' API design?

I'm also keeping my eyes on the work of improving mAP performance on both CPU and GPU. #742

@aaronzs aaronzs added bug / fix Something isn't working help wanted Extra attention is needed labels Feb 28, 2022
@github-actions
Copy link

Hi! thanks for your contribution!, great first issue!

@Borda Borda modified the milestones: v0.7, v0.8 Mar 14, 2022
@Borda Borda added this to the v0.8 milestone May 5, 2022
@Borda Borda added the v0.7.x label Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v0.7.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants