Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] enable exporting to onnx for PointRend #4977

Closed

Conversation

DmitriySidnev
Copy link

@DmitriySidnev DmitriySidnev commented Apr 15, 2021

This PR with PR#953 in mmcv enables exporting to ONNX for PointRend based models. Affected models:

@CLAassistant
Copy link

CLAassistant commented Apr 15, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #4977 (602b06c) into master (6ef9605) will decrease coverage by 0.65%.
The diff coverage is 1.16%.

❗ Current head 602b06c differs from pull request most recent head 0bd53cb. Consider uploading reports for the commit 0bd53cb to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4977      +/-   ##
==========================================
- Coverage   65.19%   64.54%   -0.66%     
==========================================
  Files         276      267       -9     
  Lines       21265    20618     -647     
  Branches     3534     3484      -50     
==========================================
- Hits        13864    13308     -556     
+ Misses       6647     6546     -101     
- Partials      754      764      +10     
Flag Coverage Δ
unittests 64.54% <1.16%> (-0.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...det/models/roi_heads/mask_heads/mask_point_head.py 31.53% <0.00%> (-1.18%) ⬇️
mmdet/models/roi_heads/point_rend_roi_head.py 13.66% <1.33%> (-5.97%) ⬇️
mmdet/core/evaluation/eval_hooks.py 51.70% <0.00%> (-20.53%) ⬇️
mmdet/models/roi_heads/test_mixins.py 50.58% <0.00%> (-9.81%) ⬇️
mmdet/models/dense_heads/rpn_test_mixin.py 77.41% <0.00%> (-6.46%) ⬇️
mmdet/models/detectors/cornernet.py 94.87% <0.00%> (-5.13%) ⬇️
mmdet/models/roi_heads/mask_heads/fcn_mask_head.py 65.88% <0.00%> (-3.89%) ⬇️
mmdet/models/roi_heads/base_roi_head.py 85.29% <0.00%> (-2.21%) ⬇️
mmdet/core/bbox/coder/yolo_bbox_coder.py 58.97% <0.00%> (-2.01%) ⬇️
mmdet/core/export/onnx_helper.py 30.76% <0.00%> (-1.49%) ⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ef9605...0bd53cb. Read the comment docs.

@RunningLeon
Copy link
Collaborator

@DmitriySidnev Thanks for the contribution. Please list the models you have verified in the PR description.

@QingChuanWS
Copy link
Contributor

QingChuanWS commented Apr 19, 2021

Thank you for the contribution, but I have some problems when running your PR, could you provide me with the version of pytorch you used to test these models?

@DmitriySidnev
Copy link
Author

@QingChuanWS, hi! I use pytorch 1.8.1.

@ZwwWayne ZwwWayne changed the base branch from master to onnx April 21, 2021 03:09
@QingChuanWS
Copy link
Contributor

Hi, @DmitriySidnev. Have you verified whether the visualized result of the modified point_rend under pytorch is correct?

@ZwwWayne ZwwWayne changed the base branch from onnx to master April 26, 2021 12:12
@DmitriySidnev
Copy link
Author

@QingChuanWS, I have verified both metrics and visualized results.

@RunningLeon
Copy link
Collaborator

@DmitriySidnev Could you merge with master and test if it is OK for Point Rend model exporting to ONNX with batch and dynamic shape if possible?

@DmitriySidnev
Copy link
Author

@RunningLeon, I have checked exporting to ONNX with dynamic shape and batch. After fixing a bug in mmcv/ops/point_sample/bilinear_grid_sample function it works. But there is the line if torch.onnx.is_in_onnx_export() and num_imgs == 1: in my code that brakes logic with batch during exporting. It can be simply fixed via exporting with input tensor with first dimension (batch_size) = 2. The result onnx graph is valid.

@RunningLeon
Copy link
Collaborator

@DmitriySidnev Hi, please allow me to push for this PR.

ERROR: Permission to DmitriySidnev/mmdetection.git denied to RunningLeon.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@DmitriySidnev
Copy link
Author

@RunningLeon, hi! Access permissions granted.

@RangiLyu
Copy link
Member

Hi~
I tried to test point_rend_r50_caffe_fpn_mstrain_1x_coco onnx model with onnxruntime, but run out of memory.

error message:


Non-zero status code returned while running ScatterElements node. Name:'ScatterElements_2702' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:305 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 1605632000

Traceback (most recent call last):
  File "tools/deployment/test.py", line 132, in <module>
    main()
  File "tools/deployment/test.py", line 111, in main
    args.show_score_thr)
  File "/home/rangilyu/Projects/mmdetection/mmdet/apis/test.py", line 27, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rangilyu/Projects/mmcv/mmcv/parallel/data_parallel.py", line 42, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rangilyu/Projects/mmcv/mmcv/runner/fp16_utils.py", line 95, in new_func
    return old_func(*args, **kwargs)
  File "/home/rangilyu/Projects/mmdetection/mmdet/models/detectors/base.py", line 169, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/rangilyu/Projects/mmdetection/mmdet/core/export/model_wrappers.py", line 74, in forward_test
    self.sess.run_with_iobinding(self.io_binding)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 229, in run_with_iobinding
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running ScatterElements node. Name:'ScatterElements_2702' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:305 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 1605632000

'ScatterElements_2702' failed to allocate memory for requested buffer of size 1605632000.

Why does this node ScatterElements_2702 need so much memory?

@RangiLyu
Copy link
Member

Hi~
I tried to test point_rend_r50_caffe_fpn_mstrain_1x_coco onnx model with onnxruntime, but run out of memory.

error message:


Non-zero status code returned while running ScatterElements node. Name:'ScatterElements_2702' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:305 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 1605632000

Traceback (most recent call last):
  File "tools/deployment/test.py", line 132, in <module>
    main()
  File "tools/deployment/test.py", line 111, in main
    args.show_score_thr)
  File "/home/rangilyu/Projects/mmdetection/mmdet/apis/test.py", line 27, in single_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rangilyu/Projects/mmcv/mmcv/parallel/data_parallel.py", line 42, in forward
    return super().forward(*inputs, **kwargs)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rangilyu/Projects/mmcv/mmcv/runner/fp16_utils.py", line 95, in new_func
    return old_func(*args, **kwargs)
  File "/home/rangilyu/Projects/mmdetection/mmdet/models/detectors/base.py", line 169, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/rangilyu/Projects/mmdetection/mmdet/core/export/model_wrappers.py", line 74, in forward_test
    self.sess.run_with_iobinding(self.io_binding)
  File "/home/rangilyu/anaconda3/envs/mmonnx/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 229, in run_with_iobinding
    self._sess.run_with_iobinding(iobinding._iobinding, run_options)
RuntimeError: Error in execution: Non-zero status code returned while running ScatterElements node. Name:'ScatterElements_2702' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:305 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 1605632000

'ScatterElements_2702' failed to allocate memory for requested buffer of size 1605632000.

Why does this node ScatterElements_2702 need so much memory?

Find a similar issue in onnxruntime(microsoft/onnxruntime#7612). Switched to a lower version of onnxruntime and solved this problem.

@RangiLyu
Copy link
Member

I have tested the performance of the ONNX model with onnxruntime. Here is the result:

Model Type box mAP box AP50 box AP75 mask mAP mask AP50 mask AP75
point_rend_r50_caffe_fpn_mstrain_1x_coco PyTorch 38.4 59.0 41.8 36.3 56.9 38.7
point_rend_r50_caffe_fpn_mstrain_1x_coco ONNX 37.9 58.4 41.5 34.9 55.9 36.8

The performance degradation of mask is large.

@DmitriySidnev
Copy link
Author

@RangiLyu, could you please share scripts and commands which you use for export and test the model?

@RangiLyu
Copy link
Member

@RangiLyu, could you please share scripts and commands which you use for export and test the model?

python tools/deployment/pytorch2onnx.py configs/point_rend/point_rend_r50_caffe_fpn_mstrain_1x_coco.py checkpoints/point_rend_r50_caffe_fpn_mstrain_1x_coco-1bcb5fb4.pth --output-file  checkpoints/point_rend_r50_caffe_fpn_mstrain_1x_coco-1bcb5fb4.onnx --dynamic-export

and

python tools/deployment/test.py configs/point_rend/point_rend_r50_caffe_fpn_mstrain_1x_coco.py checkpoints/point_rend_r50_caffe_fpn_mstrain_1x_coco-1bcb5fb4.onnx --out work_dirs/point_rend_onnx/result.pkl --eval bbox segm

@DmitriySidnev
Copy link
Author

@RangiLyu, I could not find the reason of metrics degradation. Only one thing that in my view may be related to the problem is a constant spatial resolution of output masks (by default it is 800x1216) in onnx graph. But I am not sure about it. On the other hand significant drop of segmentation metrics probably to be due to lower detection quality.

@RunningLeon
Copy link
Collaborator

@DmitriySidnev @RangiLyu Could refer to here for possible reasons.

Mask AP of Mask R-CNN drops by 1% for ONNXRuntime. The main reason is that the predicted masks are directly interpolated to original image in PyTorch, while they are at first interpolated to the preprocessed input image of the model and then to original image in ONNXRuntime.

@RunningLeon
Copy link
Collaborator

@DmitriySidnev , @RangiLyu #5197 may be a possible reason.

@AronLin
Copy link
Contributor

AronLin commented May 25, 2021

I merged this PR with the latest master branch, and tested the performance of the ONNX model with onnxruntime.

Here is the result:

Model Type box mAP box AP50 box AP75 mask mAP mask AP50 mask AP75
point_rend_r50_caffe_fpn_mstrain_1x_coco PyTorch 38.4 59.0 41.8 36.3 56.9 38.7
point_rend_r50_caffe_fpn_mstrain_1x_coco ONNX 38.4 59.0 41.8 35.2 56.5 37.1

It seems all right.

@RunningLeon RunningLeon requested a review from ZwwWayne May 25, 2021 07:21
@RunningLeon
Copy link
Collaborator

Kindly ping@ZwwWayne

@RunningLeon
Copy link
Collaborator

@DmitriySidnev Hi, could you fix conflicts and refactor the onnx export according to #5205? Thanks a lot.

@ZwwWayne ZwwWayne mentioned this pull request Jun 3, 2021
11 tasks
@RangiLyu RangiLyu self-requested a review June 9, 2021 05:32
Copy link
Member

@RangiLyu RangiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Jun 16, 2021

I think the design can be better. Can we put the complete ONNX export logic to an onnx_export function and separate them from the original inference code?
See example here.

@DmitriySidnev
Copy link
Author

@ZwwWayne, I was worried about the functionality and now I really do not have enough free time to refactor the code. Can anyone else do this?

@ZwwWayne
Copy link
Collaborator

@ZwwWayne, I was worried about the functionality and now I really do not have enough free time to refactor the code. Can anyone else do this?

OK, we will put effort on that. Thanks for the efforts.

AronLin added a commit to AronLin/mmdetection that referenced this pull request Jun 23, 2021
@ZwwWayne
Copy link
Collaborator

Merged in #5440

@ZwwWayne ZwwWayne closed this Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants