Skip to content

[ECCV2024] Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Notifications You must be signed in to change notification settings

Ivan-Tang-3D/Any2Point

Repository files navigation

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Official implementation of 'Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding'.

[2023.5] We release ICCV2023 'ViewRefer3D', a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities with LLM.

[2023.9] We release AAAI2024 'Point-PEFT', adapting 3D pre-trained Models with 1% parameters to downstream tasks .

[2024.5] The results of Any2Point on ShapeNetPart will be released soon!

[2024.7] Any2Point has been accepted by ECCV 2024!


Introduction

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method.

Main Results

We report the pre-training modality (Pre-train), the number of learnable parameters (#Param) on the "PB-T50-RS" split of ScanObjectNN (SCAN.) and ModelNet40 (MN.). * indicates utilizing the voting strategy.

Method Pre-train #Param(M) SCAN.(%) MN.(%)
PointNet N/A 3.5 68.0 89.2
PointNet++ N/A 1.5 77.9 90.7
DGCNN N/A 1.8 78.1 92.9
PointMLP N/A 12.6 85.4 94.1
Point-PN N/A 0.8 87.1 93.8
PointNeXt N/A 1.4 87.7 94.0
Point-BERT 3D 22.1 83.1 92.7
Point-MAE 3D 22.1 85.2 93.2
Point-M2AE 3D 15.3 86.4 93.4
P2P-HorNet 2D 1.2 89.3 94.0*
ACT 3D+2D 22.1 88.2 93.7
I2P-MAE 3D+2D 12.9 90.1 93.7
ReCon 3D+2D+Language 43.6 90.6 94.1
Any2Point (Audio) Audio 0.8 87.0 92.7
Any2Point (2D) 2D 0.8 87.7 93.2
Any2Point (Language) Language 0.9 91.9 94.3

Ckpt Release

Real-world shape classification on the PB-T50-RS split of ScanObjectNN:

Method Logs Acc. Ckpts
Any2Point-Lang-CLIP Language_CLIP_Scan.log 91.9% Language_CLIP_Scan.pth
Any2Point-Vision-DINOV2 Vision_DINOV2_Scan.log 87.7% Vision_DINOV2_Scan.pth
Any2Point-Audio-ImageBind Audio_imagebind_scan.log 87.0% Audio_imagebind_scan.pth

Synthetic shape classification on the ModelNet40:

Method Logs Acc. Ckpts
Any2Point-Lang-CLIP Language_CLIP_ModelNet.log 94.3% Language_CLIP_ModelNet.pth
Any2Point-Vision-DINOV2 Vision_DINOV2_ModelNet.log 93.2% Vision_DINOV2_ModelNet.pth
Any2Point-Audio-ImageBind Audio_imagebind_ModelNet.log 92.7% Audio_imagebind_ModelNet.pth

Get Started

Installation

Create a conda environment and install basic dependencies:

git clone https://github.com/Ivan-Tang-3D/Any2Point.git
cd Any2Point

conda create -n Any2Point python=3.7
conda activate Any2Point

# Install the according versions of torch and torchvision
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch

conda install -c pyg pytorch-cluster pytorch-scatter pytorch-sparse -y
pip install torch-geometric==2.0

source install.sh

Dataset

For pre-training and fine-tuning, please follow DATASET.md to install ModelNet40, ScanObjectNN, and ShapeNetPart datasets, referring to Point-BERT. Specially Put the unzip folder under data/. The Language Part Training just occupies 26GB Memory.

The final directory structure should be:

│Any2Point/
├──Any2Point_CLIP_Lang/
├──ckpts/
├──data/
│   ├──ModelNet/
│   ├──ScanObjectNN/
├──...

Fine-tuning

Please download the CLIP_pre-train.pth, DINOV2_pre-train.pth and ImageBind_audio_pre-train.pth into the ckpts/ folder.

For the PB-T50-RS split of ScanObjectNN, run:

Any2Point_CLIP_Lang

cd Any2Point_CLIP_Lang
sh fine_tune.sh

Any2Point_DINOV2_Vision

cd Any2Point_DINOV2_Vision
sh fine_tune.sh

Any2Point_ImageBind_audio

cd Any2Point_ImageBind_audio
sh fine_tune.sh

For the ModelNet40, run:

Any2Point_CLIP_Lang

cd Any2Point_clip_lang_modelnet
sh fine_tune.sh

Any2Point_DINOV2

cd Any2Point_DINOV2_modelnet
sh fine_tune.sh

Any2Point_ImageBind

cd Any2Point_ImageBind_Modelnet
sh fine_tune.sh

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{tang2024any2point,
  title={Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding},
  author={Tang, Yiwen and Liu, Jiaming and Wang, Dong and Wang, Zhigang and Zhang, Shanghang and Zhao, Bin and Li, Xuelong},
  journal={arXiv preprint arXiv:2404.07989},
  year={2024}
}

Acknowledgement

This repo benefits from Pix4Point, Point-NN, PointTransformerV2, Openpoints. Thanks for their wonderful works.

About

[ECCV2024] Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published