Skip to content

[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

License

Notifications You must be signed in to change notification settings

yichen928/SparseFusion

Repository files navigation

[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

video

Abstract

We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed.

[paper link] [Chinese summary (自动驾驶之心)]

Updates

[2023-8-21] Much better training GPU memory efficiency (45GB -> 29GB) with no hurt to the performance and speed!

[2023-7-13] 🔥SparseFusion has been accepted to ICCV 2023!🔥

[2023-3-21] We release the first version code of SparseFusion.

Overview

teaser

Compared to existing fusion algorithms, SparseFusion achieves state-of-the-art performance as well as the fastest inference speed on nuScenes test set. †: Official repository of AutoAlignV2 uses flip as test-time augmentation. ‡: We use BEVFusion-base results in the official repository of BEVFusion to match the input resolutions of other methods. $\S:$ Swin-T is adopted as image backbone.

NuScene Performance

We do not use any test-time augmentations or model ensembles to get these results. We have released the configure files and pretrained checkpoints to reproduce our results.

Validation Set

Image Backbone Point Cloud Backbone mAP NDS Link
ResNet50 VoxelNet 70.5 72.8 config/ckpt
Swin-T VoxelNet 71.0 73.1 config/ckpt

Test Set

Image Backbone Point Cloud Backbone mAP NDS
ResNet50 VoxelNet 72.0 73.8

Usage

Installation

  • We test our code on an environment with CUDA 11.5, python 3.7, PyTorch 1.7.1, TorchVision 0.8.2, NumPy 1.20.0, and numba 0.48.0.

  • We use mmdet==2.10.0, mmcv==1.2.7 for our code. Please refer to their official instructions for installation.

  • You can install mmdet3d==0.11.0 directly from our repo by

    cd SparseFusion
    pip install -e .
    
  • We use spconv==2.3.3. Please follow the official instruction to install it based on your CUDA version.

    pip install spconv-cuxxx 
    # e.g. pip install spconv-cu114	
    
  • You also need to install the deformable attention module with the following command.

    pip install ./mmdet3d/models/utils/ops
    

Data Preparation

Download nuScenes full dataset from the official website. You should have a folder structure like this:

SparseFusion
├── mmdet3d
├── tools
├── configs
├── data
│   ├── nuscenes
│   │   ├── maps
│   │   ├── samples
│   │   ├── sweeps
│   │   ├── v1.0-test
|   |   ├── v1.0-trainval

Then, you can select either of the two ways to preprocess the data.

  1. Run the following two commands sequentially.

    python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes
    python tools/combine_view_info.py
    
  2. Alternatively, you may directly download our preprocessed data from Google Drive, and put these files in data/nuscenes.

Initial Weights

Please download the initial weights for model training, and put them in checkpoints/.

Train & Test

In our default setting, we train the model with 4 GPUs.

# training
bash tools/dist_train.sh configs/sparsefusion_nusc_voxel_LC_r50.py 4 --work-dir work_dirs/sparsefusion_nusc_voxel_LC_r50

# test
bash tools/dist_test.sh configs/sparsefusion_nusc_voxel_LC_r50.py ${CHECKPOINT_FILE} 4 --eval=bbox

Note: We use A6000 GPUs (48GB per-GPU memory) for model training. The training of SparseFusion model (ResNet50 backbone) requires ~29 GB per-GPU memory.

Contact

If you have any questions, feel free to open an issue or contact us at yichen_xie@berkeley.edu.

Acknowledgments

We sincerely thank the authors of mmdetection3d, TransFusion, BEVFusion, MSMDFusion, and DeepInteraction for providing their codes or pretrained weights.

Reference

If you find our work useful, please consider citing the following paper:

@article{xie2023sparsefusion,
  title={SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection},
  author={Xie, Yichen and Xu, Chenfeng and Rakotosaona, Marie-Julie and Rim, Patrick and Tombari, Federico and Keutzer, Kurt and Tomizuka, Masayoshi and Zhan, Wei},
  journal={arXiv preprint arXiv:2304.14340},
  year={2023}
}

About

[ICCV 2023] SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published