Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostProcessCuda is very slow using my model #71

Open
GuillaumeAnoufa opened this issue Oct 6, 2022 · 29 comments
Open

PostProcessCuda is very slow using my model #71

GuillaumeAnoufa opened this issue Oct 6, 2022 · 29 comments

Comments

@GuillaumeAnoufa
Copy link

GuillaumeAnoufa commented Oct 6, 2022

System:
Ubuntu 20.04
Last version of OpenPcDet
GPU has cuda devices: 1
----device id: 0 info----
GPU : NVIDIA GeForce RTX 2080 with Max-Q Design
Capbility: 7.5
Global memory: 7982MB
Const memory: 64KB
SM in a block: 48KB
warp size: 32
threads in a block: 1024
block dim: (1024,1024,64)
grid dim: (2147483647,65535,65535)

Hello,

I exported my pointpillar weights trained on custom data. The only change compared to the example model in parameters is the fact that it only uses 1 class instead of 3.
I had to change a few things in tools/simplifier_onnx.py for the exporter to work with other than 3 classes:

Code changes to work with 1 class
I changed the signature of simplify_postprocess(onnx_model) to simplify_postprocess(onnx_model, num_classes)
and changed 3 other lines.

-  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 18))
-  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 42))
-  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 12))
+  cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 2 * num_classes * num_classes))
+  box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 14 * num_classes))
+  dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 4 * num_classes))

The exporter works but when testing the demo with this model:
---- RUN TIME ----
load file: ../data/data_velo/000001.bin
find points num: 18630
find pillar_num: 6815
TIME: generateVoxels: 0.038048 ms.
TIME: generateFeatures: 0.053024 ms.
TIME: doinfer: 30.2525 ms.
TIME: doPostprocessCuda: 57528.1 ms.
TIME: pointpillar: 57558.6 ms.
Bndbox objs: 4158
Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

This model works perfectly fine in pytorch.

As you can see the post process part takes a long time and outputs thousands of bounding boxes.
Issue #43 references a similar problem seemingly solved by an update but I am currently using the most updated version of this repo.

Do you have an idea what could cause this issue?

I can upload my .pth file or my onnx file if you want to try and reproduce this.

Best regards,

@mazm0002
Copy link

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

@GuillaumeAnoufa
Copy link
Author

GuillaumeAnoufa commented Oct 12, 2022

Have you found a solution to this? I'm having a similar issue using my own model with one class, except it just gets stuck on inference (after the find pillar_num line). I've also noticed one of the cores on the Xavier is being maxed out while this is happening.

Unfortunately I have no solution yet :(. If you find any lead, please tell me about it !
The problem happens both on my PC (Nvidia 2080) and my Nx Xavier.

@GuillaumeAnoufa
Copy link
Author

Hello,
Can someone help on this matter please ?

@rjwb1
Copy link

rjwb1 commented Nov 23, 2022

@GuillaumeAnoufa I am experiencing the same issue. I suspect that it is related to this line as changing the values will still seemingly build the model correct without errors.

op_attrs["dense_shape"] = np.array([496,432])

I am also using a single class detector but I am also using a different pointcloud range and voxel size. I am going to train the model with 3 classes to verify if this is an issue with the number of class etc or the pointcloud range

@rjwb1
Copy link

rjwb1 commented Nov 23, 2022

@byte-deve Hi do you know what are each of these numbers are a product of? '496' and '432'

@rjwb1
Copy link

rjwb1 commented Nov 23, 2022

Hi, i realised this are the size of the feature grid

@rjwb1
Copy link

rjwb1 commented Nov 27, 2022

@GuillaumeAnoufa I now have this working with a fully custom model, if you still need support you can @ me :)

@mazm0002
Copy link

mazm0002 commented Nov 28, 2022

@rjwb1 hey, I'm having the same issues with setting up a custom model, would really appreciate some guidance :)
This is my model and dataset config for reference:

################## MODEL CONFIG #####################
DATA_CONFIG:
BASE_CONFIG: cfgs/dataset_configs/mydata_dataset_only_cone.yaml
POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1]
DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range
REMOVE_OUTSIDE_BOXES: True

    - NAME: shuffle_points
      SHUFFLE_ENABLED: {
        'train': True,
        'test': False
      }

    - NAME: transform_points_to_voxels
      VOXEL_SIZE: [0.16, 0.16, 4]
      MAX_POINTS_PER_VOXEL: 100
      MAX_NUMBER_OF_VOXELS: {
        'train': 20000,
        'test': 60000 #16000
      }
DATA_AUGMENTOR:
    DISABLE_AUG_LIST: ['placeholder','gt_sampling']
    AUG_CONFIG_LIST:
        - NAME: gt_sampling
          USE_ROAD_PLANE: False
          DB_INFO_PATH:
              - "data"
          PREPARE: {
             filter_by_min_points: ['Cone:7'],
             filter_by_difficulty: [-1],
          }

          SAMPLE_GROUPS: ['Cone:200']
          NUM_POINT_FEATURES: 5
          DATABASE_WITH_FAKELIDAR: False
          REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
          LIMIT_WHOLE_SCENE: False

        - NAME: random_world_flip
          ALONG_AXIS_LIST: ['x']

        - NAME: random_world_rotation
          WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]

        - NAME: random_world_scaling
          WORLD_SCALE_RANGE: [0.95, 1.05]

        - NAME: random_world_frustum_dropout
          INTENSITY_RANGE: [ 0, 0.2 ]
          DIRECTION: [ 'top' ]

        - NAME: random_local_frustum_dropout
          INTENSITY_RANGE: [ 0, 0.2 ]
          DIRECTION: [ 'top' ]

MODEL:
NAME: PointPillar

VFE:
    NAME: PillarVFE
    WITH_DISTANCE: False
    USE_ABSLOTE_XYZ: True
    USE_NORM: True
    NUM_FILTERS: [64]

MAP_TO_BEV:
    NAME: PointPillarScatter
    NUM_BEV_FEATURES: 64

BACKBONE_2D:
    NAME: BaseBEVBackbone
    LAYER_NUMS: [3, 5, 5]
    LAYER_STRIDES: [2, 2, 2]
    NUM_FILTERS: [64, 128, 256]
    UPSAMPLE_STRIDES: [1, 2, 4]
    NUM_UPSAMPLE_FILTERS: [128, 128, 128]

DENSE_HEAD:
    NAME: AnchorHeadSingle
    CLASS_AGNOSTIC: False

    USE_DIRECTION_CLASSIFIER: True
    DIR_OFFSET: 0.78539
    DIR_LIMIT_OFFSET: 0.0
    NUM_DIR_BINS: 2

    ANCHOR_GENERATOR_CONFIG: [
        {
          'class_name': 'Cone',
          'anchor_sizes': [ [ 0.3, 0.3, 0.6 ] ],
          'anchor_rotations': [ 0, 1.57 ],
          'anchor_bottom_heights': [ -0.7 ],
          'align_center': False,
          'feature_map_stride': 2,
          'matched_threshold': 0.6,
          'unmatched_threshold': 0.4
        }
    ]

    TARGET_ASSIGNER_CONFIG:
        NAME: AxisAlignedTargetAssigner
        POS_FRACTION: -1.0
        SAMPLE_SIZE: 512
        NORM_BY_NUM_EXAMPLES: False
        MATCH_HEIGHT: False
        BOX_CODER: ResidualCoder

    LOSS_CONFIG:
        LOSS_WEIGHTS: {
            'cls_weight': 1.0,
            'loc_weight': 2.0,
            'dir_weight': 0.2,
            'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
        }

POST_PROCESSING:
    RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
    SCORE_THRESH: 0.3
    OUTPUT_RAW_SCORE: False

    EVAL_METRIC: kitti

    NMS_CONFIG:
        MULTI_CLASSES_NMS: False
        NMS_TYPE: nms_gpu
        NMS_THRESH: 0.01
        NMS_PRE_MAXSIZE: 300
        NMS_POST_MAXSIZE: 100

OPTIMIZATION:
BATCH_SIZE_PER_GPU: 3
NUM_EPOCHS: 80

OPTIMIZER: adam_onecycle
LR: 0.003
WEIGHT_DECAY: 0.01
MOMENTUM: 0.9

MOMS: [0.95, 0.85]
PCT_START: 0.4
DIV_FACTOR: 10
DECAY_STEP_LIST: [35, 45]
LR_DECAY: 0.1
LR_CLIP: 0.0000001

LR_WARMUP: False
WARMUP_EPOCH: 1

########################## DATASET CONFIG ########################
FILTER_MIN_POINTS_IN_GT: 1
POINT_CLOUD_RANGE: [0, -30.72, -3, 40.96, 30.72, 1] # xmin, ymin, zmin, xmax, ymax, zmax

DATA_SPLIT: {
'train': train,
'test': val
}

INFO_PATH: {
'train': [mydata_infos_train.pkl],
'test': [mydata_infos_val.pkl],
}

TRAINING_CATEGORIES: {
'Cone': 'Cone',
}

FOV_POINTS_ONLY: False

DATA_AUGMENTOR:
DISABLE_AUG_LIST: ['placeholder','gt_sampling']
AUG_CONFIG_LIST:
- NAME: gt_sampling
USE_ROAD_PLANE: False
DB_INFO_PATH:
- mydata_dbinfos_train.pkl
PREPARE: {
filter_by_min_points: ['Cone:20'],
filter_by_difficulty: [-1],
}

      SAMPLE_GROUPS: ['Cone:200']
      NUM_POINT_FEATURES: 5
      DATABASE_WITH_FAKELIDAR: False
      REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
      LIMIT_WHOLE_SCENE: True

    - NAME: random_world_flip
      ALONG_AXIS_LIST: ['x', 'y']

    - NAME: random_world_rotation
      WORLD_ROT_ANGLE: [-3.14159265, 3.114159265]

    - NAME: random_world_scaling
      WORLD_SCALE_RANGE: [0.95, 1.05]

POINT_FEATURE_ENCODING: {
encoding_type: absolute_coordinates_encoding,
used_feature_list: ['x', 'y', 'z', 'intensity'],
src_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp'],
}

DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range
REMOVE_OUTSIDE_BOXES: True

- NAME: shuffle_points
  SHUFFLE_ENABLED: {
    'train': True,
    'test': False
  }

- NAME: transform_points_to_voxels
  VOXEL_SIZE: [0.16, 0.16, 4]
    #[0.05, 0.05, 0.06]
  MAX_POINTS_PER_VOXEL: 5
  MAX_NUMBER_OF_VOXELS: {
    'train': 16000,
    'test': 40000
  }

GRAD_NORM_CLIP: 10

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

@mazm0002 hi there, does the model train successfully and work in PyTorch? What stage of the process are you having trouble with?

@mazm0002
Copy link

@rjwb1 Yea so I can train successfully and get the required outputs I expect. Then I use the onnx exporter tool to convert the model to onnx and run it with the demo feeding it custom test data (that works fine in PyTorch). TensorRT engine generates fine, but then when it actually did detections, they take a long time to process and there are way too many bounding boxes and most of them incorrect. Think the issue is probably in the onnx conversion, was wondering if you could let me know what you had to change in the tool to get it working for 1 class and custom data/model config. Thanks a lot for the help!

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

@mazm0002 Hi, I too experienced this and it was due to some hard coded parameters inside the exporter. I also used this useful tool to inspect my generated onnx file to ensure it was similar to the default one:

https://netron.app/

https://github.com/lutzroeder/netron

Can you show me what values you have here or is it default?

op_attrs["dense_shape"] = np.array([496,432])
return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs)
def loop_node(graph, current_node, loop_time=0):
for i in range(loop_time):
next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0]
current_node = next_node
return next_node
def simplify_postprocess(onnx_model):
print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...")
graph = gs.import_onnx(onnx_model)
cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 248, 216, 18))
box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 248, 216, 42))
dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 248, 216, 12))

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

I think maybe with your model it should look like this?

 op_attrs["dense_shape"] = np.array([384,256]) 

 return self.layer(name="PPScatter_0", op="PPScatterPlugin", inputs=inputs, outputs=outputs, attrs=op_attrs) 

 def loop_node(graph, current_node, loop_time=0): 
   for i in range(loop_time): 
     next_node = [node for node in graph.nodes if len(node.inputs) != 0 and len(current_node.outputs) != 0 and node.inputs[0] == current_node.outputs[0]][0] 
     current_node = next_node 
   return next_node 
  
 def simplify_postprocess(onnx_model): 
   print("Use onnx_graphsurgeon to adjust postprocessing part in the onnx...") 
   graph = gs.import_onnx(onnx_model) 

 cls_preds = gs.Variable(name="cls_preds", dtype=np.float32, shape=(1, 192, 128, 2)) 
 box_preds = gs.Variable(name="box_preds", dtype=np.float32, shape=(1, 192, 128, 18)) 
 dir_cls_preds = gs.Variable(name="dir_cls_preds", dtype=np.float32, shape=(1, 192, 128, 4))

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

The size of the scatter plugin array should be equal to the dimensions of the voxel grid

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

I will open a PR to parameterise these values properly :)

@rjwb1
Copy link

rjwb1 commented Nov 30, 2022

@mazm0002 can you try exporting with the changes I have made in #77

@rjwb1
Copy link

rjwb1 commented Dec 1, 2022

As you are using additional pointcloud attributes (5 instead of 4) this may require further parameters

@GuillaumeAnoufa
Copy link
Author

Hello @rjwb1 thanks for your inputs !
I added your changes but it didn't seem to solve my problem unfortunately. The shape of my model did not change using your PR because I already used the default grid size.

My config only has a few changes from the default config:
POINT_CLOUD_RANGE: [0, -39.68, -3, 69.12, 39.68, 1] -> POINT_CLOUD_RANGE: [0, -39.68, -1, 69.12, 39.68, 7]
VOXEL_SIZE: [0.16, 0.16, 4] -> VOXEL_SIZE: [0.16, 0.16, 8]
The biggest change is the fact that I am using a single class instead of 3.

My exported model shape seems accurate but I am still experiencing these very long post-processing.
Below, a picture of the output shape of the exported model: (rest of the model is exactly the same as the example one)
my_model_output

@rjwb1
Copy link

rjwb1 commented Dec 14, 2022

Hmmm, I am also using a single class... would you mind sending a copy of your cfg file and I will see if I can reproduce this

@GuillaumeAnoufa
Copy link
Author

Sure: pointpillar2.txt
I changed the BASE_CONFIG to the default one. I don't think the BASE_CONFIG matters here since everything is redefined in the actual config.

@rjwb1
Copy link

rjwb1 commented Dec 14, 2022

@GuillaumeAnoufa looks almost identical to mine. Strange... I guess I also have my score thresh set to 0.4 and my nms thresh to 0.1 in my Params.h. This could reduce post processing latency?

@GuillaumeAnoufa
Copy link
Author

GuillaumeAnoufa commented Dec 14, 2022

@rjwb1 It doesn't seem to change anything.

I tried exporting the default "pointpillar_7728.pth" model with the default config and just reducing the number of classes from 3 to 1 and experience the same issue on the default data.
Changing the number of class from 3 to 1 seem to be causing the bug on my code.

load file: ../data/data_velo/000001.bin
find points num: 18630
find pillar_num: 6815
TIME: generateVoxels: 0.03072 ms.
TIME: generateFeatures: 0.045824 ms.
TIME: doinfer: 15.7839 ms.
TIME: doPostprocessCuda: 64484.9 ms.
TIME: pointpillar: 64500.8 ms.
Bndbox objs: 4646
Saved prediction in: ../eval/kitti/object/pred_velo/000001.txt

Changing the number of classes in the config file results in a abnomarly high number predicted bounding boxes objects

@GuillaumeAnoufa
Copy link
Author

@rjwb1 If you try exporting the default model with this config file(which is the default one but with a single class): pointpillar_1class.txt and infer on the default velodyne data do you experience slow post processing ?
I know this exported model should not work anyway since the model has been trained for 3 classes but I would like to know if it is reproducible. Thanks a lot for your help :)

@GuillaumeAnoufa
Copy link
Author

I forgot to copy the generated param.h and recompile after changing the model...
Post processing time is back to normal, sorry for the inconvenience 😭

@rjwb1
Copy link

rjwb1 commented Dec 14, 2022

@GuillaumeAnoufa no worries, glad you found the solution 👍🏼

@mx2013713828
Copy link

@mazm0002 can you try exporting with the changes I have made in #77

Hi, thanks for your work, I change the files according your pr#77, and I moved parms.h and also recompiled.
But the inference is still very slow in 'doPostprocessCuda' .
my model have 4 classes and tested in OpenPCDet correctly, could you give me some ideas?I would appreciate it very much!

@mx2013713828
Copy link

Could you tell me how you solved your question?
I also meet this problem ,I found it generate more than 1 millon boxes before nms ,so the postprocess is very slow.
I change my code following @rjwb1 ,but not works.

@big773
Copy link

big773 commented Mar 20, 2023

I can export my custom model to onnx, but the result seems incorrect,can you give me some advice

@zzt007
Copy link

zzt007 commented Apr 29, 2023

@rjwb1
hello , thanks for your guidance very much.
I changed the paramters just like u did , but the problem went from slow post-processing to cuda error: illegal memory access .
I also try to use my own model which detect only one class,and also add the ROS,
So I sincerely hope u can tell me how to solve the problem ,it brothers me a few days.

@mx2013713828
Copy link

I can export my custom model to onnx, but the result seems incorrect,can you give me some advice

do you solve your problem? I also get incorrect result when I change the voxel size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants