Official implementation of VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning.
The paper has been accepted by AAAI 2024.
VSFormer is able to identify inliers and recover camera poses accurately. Firstly, highly abstract visual cues of a scene are obtained with the cross attention between local features of two-view images. Then, these visual cues and correspondences are modeled by a joint visual-spatial fusion module, simultaneously embedding visual cues into correspondences for pruning. Additionally, to mine the consistency of correspondences, a novel module that combines the KNN-based graph and the transformer, effectively captures both local and global contexts.
We recommend using Anaconda or Miniconda. To setup the environment, follow the instructions below.
conda create -n vsformer python=3.8 --yes
conda activate vsformer
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=11.0 -c pytorch --yes
python -m pip install -r requirements.txt
Follow the instructions provided here for downloading and preprocessing datasets.
The packaged dataset should be put in the data_dump/
and directory structure should be:
- If you have multiple gpus, it is recommended to use
for training.
# train by multiple gpus
CUDA_VISIBLE_DEVICES=0,1 nohup python -u -m torch.distributed.launch --nproc_per_node=2 --use_env >./logs/vsformer_yfcc.txt 2>&1 &
# train by single gpu
nohup python -u >./logs/vsformer_yfcc.txt 2>&1 &
- Evaluation
This repo benefits from OANet and CLNet. Thanks for their wonderful works.
Thanks for citing our paper:
title={VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning},
author={Liao, Tangfei and Zhang, Xiaoqin and Zhao, Li and Wang, Tao and Xiao, Guobao},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},