PostProcessCuda is very slow using my model #71

GuillaumeAnoufa · 2022-10-06T14:11:52Z

System:
Ubuntu 20.04
Last version of OpenPcDet
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA GeForce RTX 2080 with Max-Q Design
Capbility: 7.5
Global memory: 7982MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)

Hello,

I exported my pointpillar weights trained on custom data. The only change compared to the example model in parameters is the fact that it only uses 1 class instead of 3.
I had to change a few things in tools/simplifier_onnx.py for the exporter to work with other than 3 classes:

Code changes to work with 1 class
I changed the signature of simplify_postprocess(onnx_model) to simplify_postprocess(onnx_model, num_classes)
and changed 3 other lines.

-  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 18))
-  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 42))
-  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 12))
+  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 2 * num_classes * num_classes))
+  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 14 * num_classes))
+  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 4 * num_classes))

The exporter works but when testing the demo with this model:
---- RUN TIME ----
load file: ../data/data_velo/000001.bin
find points num: 18630
find pillar_num: 6815
TIME: generateVoxels: 0.038048 ms.
TIME: generateFeatures: 0.053024 ms.
TIME: doinfer: 30.2525 ms.
TIME: doPostprocessCuda: 57528.1 ms.
TIME: pointpillar: 57558.6 ms.
Bndbox objs: 4158
Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

This model works perfectly fine in pytorch.

As you can see the post process part takes a long time and outputs thousands of bounding boxes.
Issue #43 references a similar problem seemingly solved by an update but I am currently using the most updated version of this repo.

Do you have an idea what could cause this issue?

I can upload my .pth file or my onnx file if you want to try and reproduce this.

Best regards,

The text was updated successfully, but these errors were encountered:

mazm0002 · 2022-10-12T05:08:54Z

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

GuillaumeAnoufa · 2022-10-12T09:09:46Z

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

Unfortunately I have no solution yet :(. If you find any lead, please tell me about it !
The problem happens both on my PC (Nvidia 2080) and my Nx Xavier.

GuillaumeAnoufa · 2022-10-20T15:23:59Z

Hello,
Can someone help on this matter please ?

rjwb1 · 2022-11-23T17:08:41Z

@GuillaumeAnoufa I am experiencing the same issue. I suspect that it is related to this line as changing the values will still seemingly build the model correct without errors.

CUDA-PointPillars/tool/simplifier_onnx.py

Line 29 in 092affc

op_attrs["dense_shape"] = np.array([496,432])

I am also using a single class detector but I am also using a different pointcloud range and voxel size. I am going to train the model with 3 classes to verify if this is an issue with the number of class etc or the pointcloud range

rjwb1 · 2022-11-23T17:09:48Z

@byte-deve Hi do you know what are each of these numbers are a product of? '496' and '432'

rjwb1 · 2022-11-23T19:40:42Z

Hi, i realised this are the size of the feature grid

rjwb1 · 2022-11-27T16:35:13Z

@GuillaumeAnoufa I now have this working with a fully custom model, if you still need support you can @ me :)

mazm0002 · 2022-11-28T23:42:30Z

@rjwb1 hey, I'm having the same issues with setting up a custom model, would really appreciate some guidance :)
This is my model and dataset config for reference:

################## MODEL CONFIG #####################
DATA_CONFIG:
BASE_CONFIG: cfgs/dataset_configs/mydata_dataset_only_cone.yaml
POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1]
DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range
REMOVE_OUTSIDE_BOXES: True

    - NAME: shuffle_points
      SHUFFLE_ENABLED: {
        'train': True,
        'test': False
      }

    - NAME: transform_points_to_voxels
      VOXEL_SIZE: [0.16, 0.16, 4]
      MAX_POINTS_PER_VOXEL: 100
      MAX_NUMBER_OF_VOXELS: {
        'train': 20000,
        'test': 60000 #16000
      }
DATA_AUGMENTOR:
    DISABLE_AUG_LIST: ['placeholder','gt_sampling']
    AUG_CONFIG_LIST:
        - NAME: gt_sampling
          USE_ROAD_PLANE: False
          DB_INFO_PATH:
              - "data"
          PREPARE: {
             filter_by_min_points: ['Cone:7'],
             filter_by_difficulty: [-1],
          }

          SAMPLE_GROUPS: ['Cone:200']
          NUM_POINT_FEATURES: 5
          DATABASE_WITH_FAKELIDAR: False
          REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
          LIMIT_WHOLE_SCENE: False

        - NAME: random_world_flip
          ALONG_AXIS_LIST: ['x']

        - NAME: random_world_rotation
          WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]

        - NAME: random_world_scaling
          WORLD_SCALE_RANGE: [0.95, 1.05]

        - NAME: random_world_frustum_dropout
          INTENSITY_RANGE: [ 0, 0.2 ]
          DIRECTION: [ 'top' ]

        - NAME: random_local_frustum_dropout
          INTENSITY_RANGE: [ 0, 0.2 ]
          DIRECTION: [ 'top' ]

MODEL:
NAME: PointPillar

VFE:
    NAME: PillarVFE
    WITH_DISTANCE: False
    USE_ABSLOTE_XYZ: True
    USE_NORM: True
    NUM_FILTERS: [64]

MAP_TO_BEV:
    NAME: PointPillarScatter
    NUM_BEV_FEATURES: 64

BACKBONE_2D:
    NAME: BaseBEVBackbone
    LAYER_NUMS: [3, 5, 5]
    LAYER_STRIDES: [2, 2, 2]
    NUM_FILTERS: [64, 128, 256]
    UPSAMPLE_STRIDES: [1, 2, 4]
    NUM_UPSAMPLE_FILTERS: [128, 128, 128]

DENSE_HEAD:
    NAME: AnchorHeadSingle
    CLASS_AGNOSTIC: False

    USE_DIRECTION_CLASSIFIER: True
    DIR_OFFSET: 0.78539
    DIR_LIMIT_OFFSET: 0.0
    NUM_DIR_BINS: 2

    ANCHOR_GENERATOR_CONFIG: [
        {
          'class_name': 'Cone',
          'anchor_sizes': [ [ 0.3, 0.3, 0.6 ] ],
          'anchor_rotations': [ 0, 1.57 ],
          'anchor_bottom_heights': [ -0.7 ],
          'align_center': False,
          'feature_map_stride': 2,
          'matched_threshold': 0.6,
          'unmatched_threshold': 0.4
        }
    ]

    TARGET_ASSIGNER_CONFIG:
        NAME: AxisAlignedTargetAssigner
        POS_FRACTION: -1.0
        SAMPLE_SIZE: 512
        NORM_BY_NUM_EXAMPLES: False
        MATCH_HEIGHT: False
        BOX_CODER: ResidualCoder

    LOSS_CONFIG:
        LOSS_WEIGHTS: {
            'cls_weight': 1.0,
            'loc_weight': 2.0,
            'dir_weight': 0.2,
            'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
        }

POST_PROCESSING:
    RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
    SCORE_THRESH: 0.3
    OUTPUT_RAW_SCORE: False

    EVAL_METRIC: kitti

    NMS_CONFIG:
        MULTI_CLASSES_NMS: False
        NMS_TYPE: nms_gpu
        NMS_THRESH: 0.01
        NMS_PRE_MAXSIZE: 300
        NMS_POST_MAXSIZE: 100

OPTIMIZATION:
BATCH_SIZE_PER_GPU: 3
NUM_EPOCHS: 80

OPTIMIZER: adam_onecycle
LR: 0.003
WEIGHT_DECAY: 0.01
MOMENTUM: 0.9

MOMS: [0.95, 0.85]
PCT_START: 0.4
DIV_FACTOR: 10
DECAY_STEP_LIST: [35, 45]
LR_DECAY: 0.1
LR_CLIP: 0.0000001

LR_WARMUP: False
WARMUP_EPOCH: 1

########################## DATASET CONFIG ########################
FILTER_MIN_POINTS_IN_GT: 1
POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1] # xmin, ymin, zmin, xmax, ymax, zmax

DATA_SPLIT: {
'train': train,
'test': val
}

INFO_PATH: {
'train': [mydata_infos_train.pkl],
'test': [mydata_infos_val.pkl],
}

TRAINING_CATEGORIES: {
'Cone': 'Cone',
}

FOV_POINTS_ONLY: False

DATA_AUGMENTOR:
DISABLE_AUG_LIST: ['placeholder','gt_sampling']
AUG_CONFIG_LIST:
- NAME: gt_sampling
USE_ROAD_PLANE: False
DB_INFO_PATH:
- mydata_dbinfos_train.pkl
PREPARE: {
filter_by_min_points: ['Cone:20'],
filter_by_difficulty: [-1],
}

      SAMPLE_GROUPS: ['Cone:200']
      NUM_POINT_FEATURES: 5
      DATABASE_WITH_FAKELIDAR: False
      REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
      LIMIT_WHOLE_SCENE: True

    - NAME: random_world_flip
      ALONG_AXIS_LIST: ['x', 'y']

    - NAME: random_world_rotation
      WORLD_ROT_ANGLE: [-3.14159265, 3.114159265]

    - NAME: random_world_scaling
      WORLD_SCALE_RANGE: [0.95, 1.05]

POINT_FEATURE_ENCODING: {
encoding_type: absolute_coordinates_encoding,
used_feature_list: ['x', 'y', 'z', 'intensity'],
src_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp'],
}

DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range
REMOVE_OUTSIDE_BOXES: True

- NAME: shuffle_points
  SHUFFLE_ENABLED: {
    'train': True,
    'test': False
  }

- NAME: transform_points_to_voxels
  VOXEL_SIZE: [0.16, 0.16, 4]
    #[0.05, 0.05, 0.06]
  MAX_POINTS_PER_VOXEL: 5
  MAX_NUMBER_OF_VOXELS: {
    'train': 16000,
    'test': 40000
  }

GRAD_NORM_CLIP: 10

rjwb1 · 2022-11-30T02:25:05Z

@mazm0002 hi there, does the model train successfully and work in PyTorch? What stage of the process are you having trouble with?

mazm0002 · 2022-11-30T06:46:14Z

@rjwb1 Yea so I can train successfully and get the required outputs I expect. Then I use the onnx exporter tool to convert the model to onnx and run it with the demo feeding it custom test data (that works fine in PyTorch). TensorRT engine generates fine, but then when it actually did detections, they take a long time to process and there are way too many bounding boxes and most of them incorrect. Think the issue is probably in the onnx conversion, was wondering if you could let me know what you had to change in the tool to get it working for 1 class and custom data/model config. Thanks a lot for the help!

rjwb1 · 2022-11-30T11:18:18Z

@mazm0002 Hi, I too experienced this and it was due to some hard coded parameters inside the exporter. I also used this useful tool to inspect my generated onnx file to ensure it was similar to the default one:

https://netron.app/

https://github.com/lutzroeder/netron

Can you show me what values you have here or is it default?

CUDA-PointPillars/tool/simplifier_onnx.py

Lines 29 to 45 in 092affc

    
               op_attrs["dense_shape"] = np.array([496,432]) 
        
               return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs) 
        
           def loop_node(graph, current_node, loop_time=0): 
        
             for i in range(loop_time): 
        
               next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0] 
        
               current_node = next_node 
        
             return next_node 
        
           def simplify_postprocess(onnx_model): 
        
             print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...") 
        
             graph = gs.import_onnx(onnx_model) 
        
             cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 18)) 
        
             box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 42)) 
        
             dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 12))

rjwb1 · 2022-11-30T12:54:29Z

I think maybe with your model it should look like this?

 op_attrs["dense_shape"] = np.array([384,256]) 

 return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs) 

 def loop_node(graph, current_node, loop_time=0): 
   for i in range(loop_time): 
     next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0] 
     current_node = next_node 
   return next_node 
  
 def simplify_postprocess(onnx_model): 
   print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...") 
   graph = gs.import_onnx(onnx_model) 

 cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 192, 128, 2)) 
 box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 192, 128, 18)) 
 dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 192, 128, 4))

rjwb1 · 2022-11-30T12:56:12Z

The size of the scatter plugin array should be equal to the dimensions of the voxel grid

rjwb1 · 2022-11-30T12:56:39Z

I will open a PR to parameterise these values properly :)

rjwb1 · 2022-11-30T14:13:03Z

@mazm0002 can you try exporting with the changes I have made in #77

rjwb1 · 2022-12-01T01:39:57Z

As you are using additional pointcloud attributes (5 instead of 4) this may require further parameters

GuillaumeAnoufa · 2022-12-14T09:46:26Z

Hello @rjwb1 thanks for your inputs !
I added your changes but it didn't seem to solve my problem unfortunately. The shape of my model did not change using your PR because I already used the default grid size.

My config only has a few changes from the default config:
POINT_CLOUD_RANGE: [0, -39.68, -3, 69.12, 39.68, 1] -> POINT_CLOUD_RANGE: [0, -39.68, -1, 69.12, 39.68, 7]
VOXEL_SIZE: [0.16, 0.16, 4] -> VOXEL_SIZE: [0.16, 0.16, 8]
The biggest change is the fact that I am using a single class instead of 3.

My exported model shape seems accurate but I am still experiencing these very long post-processing.
Below, a picture of the output shape of the exported model: (rest of the model is exactly the same as the example one)

rjwb1 · 2022-12-14T09:51:23Z

Hmmm, I am also using a single class... would you mind sending a copy of your cfg file and I will see if I can reproduce this

GuillaumeAnoufa · 2022-12-14T10:08:17Z

Sure: pointpillar2.txt
I changed the BASE_CONFIG to the default one. I don't think the BASE_CONFIG matters here since everything is redefined in the actual config.

rjwb1 · 2022-12-14T10:17:55Z

@GuillaumeAnoufa looks almost identical to mine. Strange... I guess I also have my score thresh set to 0.4 and my nms thresh to 0.1 in my Params.h. This could reduce post processing latency?

GuillaumeAnoufa · 2022-12-14T10:34:21Z

@rjwb1 It doesn't seem to change anything.

I tried exporting the default "pointpillar_7728.pth" model with the default config and just reducing the number of classes from 3 to 1 and experience the same issue on the default data.
Changing the number of class from 3 to 1 seem to be causing the bug on my code.

load file: ../data/data_velo/000001.bin
find points num: 18630
find pillar_num: 6815
TIME: generateVoxels: 0.03072 ms.
TIME: generateFeatures: 0.045824 ms.
TIME: doinfer: 15.7839 ms.
TIME: doPostprocessCuda: 64484.9 ms.
TIME: pointpillar: 64500.8 ms.
Bndbox objs: 4646
Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

Changing the number of classes in the config file results in a abnomarly high number predicted bounding boxes objects

GuillaumeAnoufa · 2022-12-14T11:07:19Z

@rjwb1 If you try exporting the default model with this config file(which is the default one but with a single class): pointpillar_1class.txt and infer on the default velodyne data do you experience slow post processing ?
I know this exported model should not work anyway since the model has been trained for 3 classes but I would like to know if it is reproducible. Thanks a lot for your help :)

GuillaumeAnoufa · 2022-12-14T15:18:23Z

I forgot to copy the generated param.h and recompile after changing the model...
Post processing time is back to normal, sorry for the inconvenience 😭

rjwb1 · 2022-12-14T15:20:17Z

@GuillaumeAnoufa no worries, glad you found the solution 👍🏼

mx2013713828 · 2023-02-23T08:51:29Z

@mazm0002 can you try exporting with the changes I have made in #77

Hi, thanks for your work, I change the files according your pr#77, and I moved parms.h and also recompiled.
But the inference is still very slow in 'doPostprocessCuda' .
my model have 4 classes and tested in OpenPCDet correctly, could you give me some ideas?I would appreciate it very much！

mx2013713828 · 2023-02-27T01:16:43Z

Could you tell me how you solved your question?
I also meet this problem ,I found it generate more than 1 millon boxes before nms ,so the postprocess is very slow.
I change my code following @rjwb1 ,but not works.

big773 · 2023-03-20T02:37:27Z

I can export my custom model to onnx, but the result seems incorrect,can you give me some advice

zzt007 · 2023-04-29T07:22:17Z

@rjwb1
hello , thanks for your guidance very much.
I changed the paramters just like u did , but the problem went from slow post-processing to cuda error: illegal memory access .
I also try to use my own model which detect only one class,and also add the ROS,
So I sincerely hope u can tell me how to solve the problem ,it brothers me a few days.

mx2013713828 · 2024-07-30T01:18:25Z

I can export my custom model to onnx, but the result seems incorrect,can you give me some advice

do you solve your problem? I also get incorrect result when I change the voxel size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostProcessCuda is very slow using my model #71

PostProcessCuda is very slow using my model #71

GuillaumeAnoufa commented Oct 6, 2022 •

edited

Loading

mazm0002 commented Oct 12, 2022

GuillaumeAnoufa commented Oct 12, 2022 •

edited

Loading

GuillaumeAnoufa commented Oct 20, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 27, 2022

mazm0002 commented Nov 28, 2022 •

edited

Loading

rjwb1 commented Nov 30, 2022

mazm0002 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022 •

edited

Loading

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Dec 1, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022 •

edited

Loading

GuillaumeAnoufa commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

mx2013713828 commented Feb 23, 2023

mx2013713828 commented Feb 27, 2023

big773 commented Mar 20, 2023

zzt007 commented Apr 29, 2023

mx2013713828 commented Jul 30, 2024

PostProcessCuda is very slow using my model #71

PostProcessCuda is very slow using my model #71

Comments

GuillaumeAnoufa commented Oct 6, 2022 • edited Loading

mazm0002 commented Oct 12, 2022

GuillaumeAnoufa commented Oct 12, 2022 • edited Loading

GuillaumeAnoufa commented Oct 20, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 23, 2022

rjwb1 commented Nov 27, 2022

mazm0002 commented Nov 28, 2022 • edited Loading

rjwb1 commented Nov 30, 2022

mazm0002 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022 • edited Loading

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Nov 30, 2022

rjwb1 commented Dec 1, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022 • edited Loading

GuillaumeAnoufa commented Dec 14, 2022

GuillaumeAnoufa commented Dec 14, 2022

rjwb1 commented Dec 14, 2022

mx2013713828 commented Feb 23, 2023

mx2013713828 commented Feb 27, 2023

big773 commented Mar 20, 2023

zzt007 commented Apr 29, 2023

mx2013713828 commented Jul 30, 2024

GuillaumeAnoufa commented Oct 6, 2022 •

edited

Loading

GuillaumeAnoufa commented Oct 12, 2022 •

edited

Loading

mazm0002 commented Nov 28, 2022 •

edited

Loading

rjwb1 commented Nov 30, 2022 •

edited

Loading

GuillaumeAnoufa commented Dec 14, 2022 •

edited

Loading