[TOPI] Improve memory layout inside GPU NMS kernel #7257

masahi · 2021-01-12T03:26:57Z

Motivation

Currently, our NMS API expects input and output data to be in a format [batch, num_anchors, 6], where 6 is for 4 bbox coordinates, score and class id. This format is highly inefficient for class-aware NMS, because to check if two bboxes are the same class, we need to load other 5 values. More discussion in #7154 (comment) and other comments in that PR.

This PR improved the memory layout of buffers used in the NMS kernel. Specifically, it unpacks [batch, num_anchors, 6] input into one [batch, num_anchors, 4] buffer for bboxes, and two [batch, num_anchors] buffers for scores and class ids.

Speed up

This change is expected to give a good speed up for the cases when the number valid boxes is large and there are class ids involved. For example, although the change in #7154 brings only modest speed up, combined with this PR the speed up becomes good, as shown below:

PyTorch NMS workload from FasterRCNN / MaskRCNN (4500 boxes):

Current main : 5.78 milli sec
After rewrite in #7154 to restore class aware NMS: 5.23 milli sec
With memory layout improvement in this PR and the rewrite in #7154: 4.51 milli sec

Other consideration

For other cases, it is likely that there would be no speed up, but it should be no worse. For example, it doesn't improve NMS workload from Gluon SSD. When return_indices=False, which I think only applies to MXNet, there is additional concatenation of three buffers at the end. For Gluon SSD, this concat takes 40 micro sec on GTX 1070 ti, while the main NMS kernel takes 200 micro sec. If people think this additional concat is expensive, we really need to rethink our NMS API: The same concat is already happenning in all other frontends, because all frameworks other than MXNet gives NMS inputs in an unpacked form and concat is necessary to satisfy our NMS API.

please review @mbrookhart @Laurawly @vinx13 @trevor-m

mbrookhart

I'm glad this speeds up NMS, that confirms some of our suspicions from [Torch] Restore class-aware NMS for detection models by graph rewrite #7154
I don't love that after this PR, ever framework but MXNET will have an op of the form: concat->split->nms->concat->split. Can we talk about moving the split and concat steps out of this op and reworking the frameworks for a new API?

mbrookhart · 2021-01-12T16:32:13Z

Even after this, I think we will still need a loop over classes for ONNX and TF, since ONNX explicitly and TF implicitly need max_output_boxes_per_class, while this op even with class id will return max_output_boxes for all classes.

masahi · 2021-01-12T20:01:43Z

I don't quite follow, maybe you are missing something?

First, this PR doesn't change our NMS API, it only changes the buffer layouts used internally.

Second, the final concat is only required for MXNet, which uses return_indices=False. Our NMS returns something completely different depending on return_indices flag : If True, it returns a big output tensor packed with bboxes, scores and class ids, with invalid entries indicated by -1.

(The valid entries are supposed to move to the top, if invalid_to_bottom flag is True. But our GPU NMS kernel ignores this argument and the output is not reordered. This is another difference with CPU implementation, I think this is a bug)

tvm/python/tvm/topi/cuda/nms.py

Lines 762 to 763 in 4e8cc4f

    
               invalid_to_bottom : optional, boolean 
        
                   Whether to move all valid bounding boxes to the top.

If return_indices=True, which applies to TF, ONNX, and PyTorch, we only return survived box indices, so there is no need to concat bboxes, scores, and class ids. For this case, there is zero additional overhead after this PR, it is just that concat before NMS done by frontends are now completely pointless.

mbrookhart

I see, yep.

We don't seem to have a test for invalid_to_bottom, it's False by default:

tvm/python/tvm/relay/op/vision/nms.py

Line 70 in 4e8cc4f

invalid_to_bottom=False,

and the only test that mentions it sets it to False (in test_any). Grep-ing around the codebase, the only place it's ever set as True is in the MXNet frontend:

tvm/python/tvm/relay/frontend/mxnet.py

Line 1050 in 4e8cc4f

invalid_to_bottom=True,

and I don't see a unit test for that op.

It might be a bug because it never runs outside of the gluon ssd tutorial, which doesn't check box order:

tvm/tutorials/frontend/deploy_ssd_gluoncv.py

Lines 111 to 127 in 926a315

    
           for target in ["llvm", "cuda"]: 
        
               ctx = tvm.context(target, 0) 
        
               if ctx.exist: 
        
                   lib = build(target) 
        
                   class_IDs, scores, bounding_boxs = run(lib, ctx) 
        
           ###################################################################### 
        
           # Display result 
        
           ax = utils.viz.plot_bbox( 
        
               img, 
        
               bounding_boxs.asnumpy()[0], 
        
               scores.asnumpy()[0], 
        
               class_IDs.asnumpy()[0], 
        
               class_names=block.classes, 
        
           ) 
        
           plt.show()

We should probably open an issue for it, or add a unit test and fix the kernel.

I'd still like to see us change the NMS API to better reflect what's going on internally (i.e., remove the concat/split for TF/ONNX/Pytorch), but that doesn't need to be this PR.

masahi · 2021-01-12T20:47:30Z

Yes, ideally I want to update our NMS to be closer to TF/ONNX/PyTorch, and let MXNet frontend handle split and concat, rather than the other way around (what we have now). Current API is over complicated due to the need to support both styles. If we can assume that return_indices is always True, we can clean up our API a lot. For example, invalid_to_bottom argument only makes sense for MXNet. We don't need coord_start, score_index, and id_index arguments, if inputs are only unpacked.

Supporting max_output_boxes_per_class needs change in the implementation as well. We need to count the number of survived boxes per class. But that's the only change I think, it is definitely doable without writing another kernel.

commit fe8fda81774c2e1a4d434179f62e3a299e084cb7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Wed Dec 30 20:31:29 2020 +0900 fix write by a single thread commit 0c21e36 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 04:32:18 2020 +0900 minor improvement when topk is available commit 68c6866 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 04:10:24 2020 +0900 finish concat output commit 37d7a19 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 03:59:28 2020 +0900 fixed topk handling commit 1913f97 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:34:24 2020 +0900 more refactoring commit 70c65f0 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:27:15 2020 +0900 unpack input data commit 3a27397 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:22:16 2020 +0900 slight change to initialization commit 9b42008 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:50:36 2020 +0900 add some comments, remove check the check on negative class id commit 0aa375d Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:39:49 2020 +0900 leave a TODO on write by only one thread commit d75ee0a Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:13:04 2020 +0900 temp disable write by only thread 0 commit 20b5630 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 26 10:06:43 2020 +0900 use one block two avoid global sync issue commit dd1e230 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 26 07:59:19 2020 +0900 make NMS inner loop parallel fix write by a single thread

masahi · 2021-01-13T10:18:54Z

Thanks @mbrookhart

commit fe8fda81774c2e1a4d434179f62e3a299e084cb7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Wed Dec 30 20:31:29 2020 +0900 fix write by a single thread commit 0c21e36 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 04:32:18 2020 +0900 minor improvement when topk is available commit 68c6866 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 04:10:24 2020 +0900 finish concat output commit 37d7a19 Author: Masahiro Masuda <masahi129@gmail.com> Date: Tue Dec 29 03:59:28 2020 +0900 fixed topk handling commit 1913f97 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:34:24 2020 +0900 more refactoring commit 70c65f0 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:27:15 2020 +0900 unpack input data commit 3a27397 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 21:22:16 2020 +0900 slight change to initialization commit 9b42008 Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:50:36 2020 +0900 add some comments, remove check the check on negative class id commit 0aa375d Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:39:49 2020 +0900 leave a TODO on write by only one thread commit d75ee0a Author: Masahiro Masuda <masahi129@gmail.com> Date: Mon Dec 28 19:13:04 2020 +0900 temp disable write by only thread 0 commit 20b5630 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 26 10:06:43 2020 +0900 use one block two avoid global sync issue commit dd1e230 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 26 07:59:19 2020 +0900 make NMS inner loop parallel fix write by a single thread

masahi changed the title ~~Unpack NMS inputs into bbox, scores and class ids~~ [TOPI] Improve memory layout inside GPU NMS kernel Jan 12, 2021

masahi mentioned this pull request Jan 12, 2021

[Torch] Restore class-aware NMS for detection models by graph rewrite #7154

Merged

mbrookhart reviewed Jan 12, 2021

View reviewed changes

mbrookhart approved these changes Jan 12, 2021

View reviewed changes

masahi force-pushed the nms-cuda-data-unpack branch from 48beb10 to 03fd7ba Compare January 13, 2021 06:46

masahi merged commit 1f2b40f into apache:main Jan 13, 2021

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Improve memory layout inside GPU NMS kernel #7257

[TOPI] Improve memory layout inside GPU NMS kernel #7257

masahi commented Jan 12, 2021 •

edited

Loading

mbrookhart left a comment

mbrookhart commented Jan 12, 2021

masahi commented Jan 12, 2021 •

edited

Loading

mbrookhart left a comment

masahi commented Jan 12, 2021 •

edited

Loading

masahi commented Jan 13, 2021

	for target in ["llvm", "cuda"]:
	ctx = tvm.context(target, 0)
	if ctx.exist:
	lib = build(target)
	class_IDs, scores, bounding_boxs = run(lib, ctx)

	######################################################################
	# Display result

	ax = utils.viz.plot_bbox(
	img,
	bounding_boxs.asnumpy()[0],
	scores.asnumpy()[0],
	class_IDs.asnumpy()[0],
	class_names=block.classes,
	)
	plt.show()

[TOPI] Improve memory layout inside GPU NMS kernel #7257

[TOPI] Improve memory layout inside GPU NMS kernel #7257

Conversation

masahi commented Jan 12, 2021 • edited Loading

Motivation

Speed up

Other consideration

mbrookhart left a comment

Choose a reason for hiding this comment

mbrookhart commented Jan 12, 2021

masahi commented Jan 12, 2021 • edited Loading

mbrookhart left a comment

Choose a reason for hiding this comment

masahi commented Jan 12, 2021 • edited Loading

masahi commented Jan 13, 2021

masahi commented Jan 12, 2021 •

edited

Loading

masahi commented Jan 12, 2021 •

edited

Loading

masahi commented Jan 12, 2021 •

edited

Loading