[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346

masahi · 2021-01-27T04:19:57Z

This PR adds two new graph rewrite to optimize Faster RCNN / MaskRCNN. Happy to split them into two PRs if preferred.

The first one is to exploit the fact that in PyTorch detection models, NMS is always followed by post NMS topk, as shown below.

https://github.com/pytorch/vision/blob/8ebfd2f5d5f1792ce2cf5a2329320f604530a68e/torchvision/models/detection/rpn.py#L272-L275

We can extract that topk parameter and use it as max_out_size parameter in our NMS. This brings a good speed up 4.51 milli sec -> 4.11 milli sec, and further speed up is easily expected if we had TIR while loop (cc @tqchen)

The second is to replace the repeated scatter loop in

https://github.com/pytorch/vision/blob/6315358dd06e3a2bcbe9c1e8cdaa10898ac2b308/torchvision/ops/poolers.py#L20-L29

with something like this:

indices_per_level = []
for level in range(num_scales):
    idx_in_level = torch.where(levels == level)[0]
    indices_per_level.append(idx_in_level)

stacked_features = torch.cat(roi_align_results, dim=0)
stacked_indices = torch.cat(indices_per_level, dim=0)
argsort_indices = torch.argort(stacked_indices)
return stacked_features[argsort_indices, :]

i.e., we are able to remove torch.zeros (which turns out very expensive, due to too much any_dim generated by Relay) and repeated 4D scatters (which is slow because scatters cannot be parallelized well). Instead, we can do concat, argsort, and batched gather to get equivalent result, which is much more efficient. This transformation is not at all obvious, I think this is a great example of the power of graph rewrite. It cuts more than 10 milli seconds from MaskRCNN / FasterRCNN.

Unfortunately I expect this PR is hard to review, let me know if you have any questions. I tried to give detailed comments to aid understanding.

This concludes the series of PRs I did to optimize MaskRCNN on GPU + VM, here is the current numbers. Surprisingly, NVPTX generates much better code for the dynamic injective ops, which is one of the bottlenecks in MaskRCNN due to a certain limitation in Relay + TE (too many unnecessary any_dim generated). I hope we can discuss this performance result further in the forum.

TVM result is obtained after auto scheduling. In MaskRCNN, there are some dynamic batch 
conv2d and conv2d transpose that cannot be tuned and hence extremely slow. 

cublas is always required to get reasonable result, since there is one big dynamic dense op
that is extremely slow with the default schedule. 

GTX 1070 ti, on an input (1, 3, 300, 300)

Faster RCNN
Torch: 0.0738 sec
TVM with cuda target + cublas: 0.0712 sec
TVM with nvptx target + cublas: 0.0708 sec

MaskRCNN
Torch: 0.115 sec
TVM with cuda target + cublas: 0.166 sec
TVM with nvptx target + cublas: 0.135 sec

please review @zhiics @kevinthesun @mbrookhart @jwfromm @anijain2305 @trevor-m

trevor-m

Nice, it looks good to me.

I have one question regarding the topk_after_batch_nms_pattern. It seems the topk slice will no longer get applied to the true branch. Before rewrite, it was applied to the result of the if statement - both branches. After rewrite, it is folded into NMS via max_output_size, but that is only in the false branch. Would that cause problems?

masahi · 2021-01-27T22:03:14Z

No, the If there is for guarding against the case where there is no boxes, see

https://github.com/pytorch/vision/blob/8ebfd2f5d5f1792ce2cf5a2329320f604530a68e/torchvision/ops/boxes.py#L78-L79

So applying topk to an empty tensor is nop anyway.

trevor-m · 2021-01-27T22:09:39Z

No, the If there is for guarding against the case where there is no boxes, see

https://github.com/pytorch/vision/blob/8ebfd2f5d5f1792ce2cf5a2329320f604530a68e/torchvision/ops/boxes.py#L78-L79

So applying topk to an empty tensor is nop anyway.

Got it, thanks! I guess the pattern does not guarantee that the true branch is for that 0 box case, but since this rewrite is only meant to be used for this particular model it is fine.

zhiics

thanks for the effort. lgtm. just a nitpick, feel free to ignore.

python/tvm/relay/frontend/pytorch_utils.py

* add post nms topk to max_out_size rewrite * add argsort conversion * scatter pattern first cut * matching seems to working * dup matching fixed * add converter * conversion seems working * add reshape, use take * remove pytorch argsort converter * update test * add doc

masahi added 11 commits January 27, 2021 12:25

add post nms topk to max_out_size rewrite

2acf9fa

add argsort conversion

54a067c

scatter pattern first cut

3778b89

matching seems to working

63659e5

dup matching fixed

60f2f4a

add converter

2e07a2f

conversion seems working

32f9a5e

add reshape, use take

1ebbcde

remove pytorch argsort converter

2e69e5a

update test

21e6c24

add doc

5a13030

trevor-m approved these changes Jan 27, 2021

View reviewed changes

tqchen approved these changes Jan 27, 2021

View reviewed changes

zhiics approved these changes Jan 27, 2021

View reviewed changes

python/tvm/relay/frontend/pytorch_utils.py Show resolved Hide resolved

zhiics merged commit 4006bde into apache:main Jan 27, 2021

tqchen mentioned this pull request Jan 28, 2021

[TEST][FLAKY] test_detection_models #7363

Closed

masahi mentioned this pull request Jan 29, 2021

[TEST] Another attempt to fix flaky segfaults from torch detection test #7371

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346

[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346

masahi commented Jan 27, 2021 •

edited

Loading

trevor-m left a comment •

edited

Loading

masahi commented Jan 27, 2021

trevor-m commented Jan 27, 2021 •

edited

Loading

zhiics left a comment

[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346

[Torch] More graph rewrites for Faster RCNN / MaskRCNN #7346

Conversation

masahi commented Jan 27, 2021 • edited Loading

trevor-m left a comment • edited Loading

Choose a reason for hiding this comment

masahi commented Jan 27, 2021

trevor-m commented Jan 27, 2021 • edited Loading

zhiics left a comment

Choose a reason for hiding this comment

masahi commented Jan 27, 2021 •

edited

Loading

trevor-m left a comment •

edited

Loading

trevor-m commented Jan 27, 2021 •

edited

Loading