-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue for own mAP implementation #677
Comments
I've come across a profiling tool that measures time spent per line. I'll share a modified version of your code @tkupek import time
import torch
import torch.distributed
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import Dataset, DataLoader
from torchmetrics.detection.map import MAP
BATCH_SIZE = 32
class RandomDataset(Dataset):
def __init__(self, size, num_samples):
self.len = num_samples
self.data = torch.randn(num_samples, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
train = RandomDataset(32, 40 * BATCH_SIZE)
train = DataLoader(train, batch_size=BATCH_SIZE)
val = RandomDataset(32, 40 * BATCH_SIZE)
val = DataLoader(val, batch_size=BATCH_SIZE)
# mockups for MAP compatible data
mock_preds = [
dict(
boxes=torch.Tensor([[258.0, 41.0, 606.0, 285.0]]),
scores=torch.Tensor([0.536]),
labels=torch.IntTensor([0]),
)
]
mock_target = [
dict(
boxes=torch.Tensor([[214.0, 41.0, 562.0, 285.0]]),
labels=torch.IntTensor([0]),
)
]
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
self.val_map = MAP(class_metrics=True, dist_sync_on_step=True)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
# ignore real outputs and add mockup preds to metric
preds = []
target = []
for n in range(batch.size(0)):
x = mock_preds[0]
preds.append(
{
"boxes": x["boxes"].to(self.device),
"labels": x["labels"].to(self.device),
"scores": x["scores"].to(self.device),
}
)
x = mock_target[0]
target.append({"boxes": x["boxes"].to(self.device), "labels": x["labels"].to(self.device)})
self.val_map.update(preds=preds, target=target)
return {"x": loss}
def on_validation_epoch_start(self) -> None:
self.val_map.reset()
def on_validation_epoch_end(self) -> None:
if self.trainer.global_step != 0:
print(f"\nRunning val metric on {len(self.val_map.groundtruth_boxes)} samples")
start = time.time()
result = self.val_map.compute() # GPUs get stuck here
end = time.time()
print(f"Total time: {end-start}")
print(f"Time per sample {(time.time() - start) / len(self.val_map.groundtruth_boxes)}")
def configure_optimizers(self):
return [torch.optim.SGD(self.layer.parameters(), lr=0.1)]
# pip install line_profiler
# Line profiler is NOT accurate for CUDA code.
import contextlib
from typing import List, Callable
import line_profiler
class WrappedLineProfiler(line_profiler.LineProfiler):
"""Measures time for executing code in the specified profiling_functions.
More info: https://github.com/pyutils/line_profiler
Call the print_stats() method after profiling to get results"""
def __init__(self, profiling_functions: List[Callable]):
super().__init__(*profiling_functions)
@contextlib.contextmanager
def __call__(self):
self.enable() # Start measuring time
yield # profiling_functions are expected to run here
self.disable() # Stop measuring time
model = BoringModel()
trainer = Trainer(max_epochs=1, strategy="ddp", gpus=None)
profiling_functions = [
MAP.compute,
MAP._calculate,
MAP._evaluate_image,
MAP._find_best_gt_match,
]
profiler = WrappedLineProfiler(profiling_functions)
with profiler():
trainer.fit(model, train, val)
profiler.print_stats() This outputs
I used this information to find that line 438 I'm running a M1-macbook so haven't tried anything on the GPU. |
@OlofHarrysson this is good insight! I can test this today on a GPU. |
@tkupek Please do :) Note that profiling CUDA calls with this profiler can be incorrect. Actions performed after CUDA operations, e.g. moving a tensor from gpu to cpu with tensor.cpu(), will often incorrectly be attributed to the .cpu() line while Python is actually blocked/waiting for the CUDA call to finish. You basically have to sprinkle in a bunch of |
I ran the profiler on a GPU.
Your suggestion helped a bit, but there is a little overall effect:
|
Ok. By looking at the time spent on different code parts, there's a section that can be reworked to speed up the calculations by almost ~2 if eval_imgs = [
self._evaluate_image(img_id, class_id, area, max_detections, ious)
for class_id in class_ids
for area in area_ranges
for img_id in img_ids
] That code takes a lot of time and is computed two times from calls in the # Code is called here
overall, map, mar = self._calculate(self._get_classes())
# And also here
for class_id in self._get_classes():
_, cls_map, cls_mar = self._calculate([class_id])
map_per_class_list.append(cls_map.map)
mar_max_dets_per_class_list.append(cls_mar[f"mar_{self.max_detection_thresholds[-1]}"]) Seems like an easy fix. But measuring time on this data can be a bit misleading as there is always one pred and one GT box. Results would differ for different data. It would be good to measure the time on e.g. coco-eval that contains 5000 images with predictions from a standard model. At any rate, I think the code could be reworked to run faster. @staticmethod
def _find_best_gt_match(
thr: int, nb_gt: int, gt_matches: Tensor, idx_iou: float, gt_ignore: Tensor, ious: Tensor, idx_det: int
) -> int:
previously_matched = gt_matches[idx_iou]
# Remove previously matched or ignored gts
remove_mask = previously_matched | gt_ignore
gt_ious = ious[idx_det] * ~remove_mask
match_idx = gt_ious.argmax().item()
if gt_ious[match_idx] > thr:
return match_idx
return -1 |
I do have a real-world detection model + data on hand where I can test your improvements. Will hopefully find some time in the next 1-2 weeks. |
I performed the tests on a real-world dataset and the CUDA issue confirms: CPU
CUDA
I will now test the performance impact of your suggestions and try to get insights from the profiler. |
🐛 Bug
The new mAP implementation that replaced the
pycocotools
implementation seems to have a performance issue. In my measurements it is 10-15x slower.These are are some performance measurements I did on a CPU and single GPU (same machine):
New implementation
Preview implementation (pycocotools)
To Reproduce
Steps to reproduce the behavior:
Run mAP implementation as in the example.
Code sample
Find the measurement code with BoringModel here:
Expected behavior
Performance should at least be on par with the pycocotools implementation.
Best case, it should be faster, especially on GPU.
Environment
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: