A curated list of research papers in 3D visual grounding. (Contact: jhj20 at mails.tsinghua.edu.cn)
[2022/04/15]: Create this repository.
[2022/05/25]: Expend the scope to 3D-Vision-and-Language, e.g., 3D Visual Grounding, 3D Dense Caption and 3D Question Answering.
- Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website]
- Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website]
- Huang, Pin-Hao, et al. Text-guided graph neural networks for referring 3d instance segmentation. AAAI 2021. [Paper] [Code]
- Feng, Mingtao, et al. Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud. CVPR 2021. [Paper] [Code]
- Liu, Haolin, et al. Refer-It-in-RGBD: A Bottom-Up Approach for 3D Visual Grounding in RGBD Images. CVPR 2021. [Paper] [Code] [Website]
-
Yang, Zhengyuan, et al. SAT: 2D Semantics Assisted Training for 3D Visual Grounding. ICCV 2021, Oral. [Paper] [Code]
Personal Notes:
- Use corresponding 2D image data(ROI feature, label, bbox coordinates and camera pose) to assist 3D grounding.
- Very solid experiments.
-
Yuan, Zhihao, et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring . ICCV 2021. [Paper] [Code]
-
Zhao, Lichen, et al. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. ICCV 2021. [Paper] [Code]
Personal Notes:
- The novelty of this paper comes from the coordinate-guied contextual aggregation module.
- He, Dailan, et al. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. ACM-MM 2021. [Paper]
-
Huang, Shijia, et al. Multi-View Transformer for 3D Visual Grounding. CVPR 2022. [Paper] [Code]
Personal Notes:
- Rotating the center xyz of objects to provide view-related positional information before going through a Tranformer decoder.
- SOTA results on Nr3D and Sr3D, good reuslts on ScanRefer.
-
Luo, Junyu, et al. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection. CVPR 2022, Oral. [Paper] [Code]
Personal Notes:
- First single stage work in 3D Visual Grounding !!!
- The general idea is similar to the iterative shrinking work in 2D Visual Grounding, but the design is more elegant.
-
Cai, Daigang, et al. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR 2022.
-
ReferIt3D(Nr3D, Sr3D/Sr3D+): Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website] [Leaderboard]
Dataset Statistics:
- Natural Reference in 3D (Nr3D)
- Spatial Reference in 3D (Sr3D)
-
ScanRefer: Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website] [Leaderboard]
Dataset Statistics:
- On average, there are 13.81 objects, 64.48 descriptions per scene, and 4.67 descriptions per object.
- Average length of descriptions is 20.27. Frequency of object attributes: spatial language (98.7%), color (74.7%), shape terms (64.9%), and size information (14.2%).
- CVPR 2021 1st Workshop on Language for 3D Scenes. [Website]
- Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Code]
- Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]
-
ScanQA: Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Data Preparation]
-
SQA3D: Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]
Pending...