The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching. By focusing on obtaining patch-level embeddings by Vision Transformer backbone and establishing patch-to-patch correspondences, our approach uses a voting mechanism to assess overlap scores for potential database images, thereby providing a nuanced image retrieval metric in challenging scenarios.
torch == 2.2.2
Python == 3.10.13
OmegaConf == 2.3.0
h5py == 3.11.0
tqdm == 4.66.2
faiss == 1.8.0
Here we visualize the patch matches on one example image pair found by the trained encoder, VOP. No preparation needed, the images and the model will be downloaded automatically, try it! Feel free to change the image paths to play with your own data.
Step 1. dump the image pairs, save the GT information (e.g., R, K), pretrained Dino V2 [CLS] tokens and patch embeddings (e.g., large in 1024 dim).
Step 2. load the trained encoders to build our own embeddings, eg, in 256-dim, run the retrieval process (CLS tokens for prefiltering, and VOP for reranking.) and save the retrieved image pair list.
Step 3. verify the retrieved image pairs by sending them for relative pose estimation or hloc for localization.
Here are the instructions for testing each data used in our paper and how to test your own data.
💥 important: before data dumping, create/update an original dirs for the specific dataset in dump_datasets/data_dirs.yaml.
dataset_dirs:
inloc:<src_path>
[Inloc]
- download the cutouts (db images) and format the data to database/cutouts/; download the query images into query/iphone7/.
- dump the data and perform image retrieval to get the most overlapping image list. (top-40 on InLoc)
python dump_data.py -ds inloc
python retrieve.py -ds inloc -k 40 -m 09 -v 3 -r 0.3 -pre 100 -cls 1
- install and run hloc to localize the query images.
python inloc_localization.py --loc_pairs outputs/inloc/09/cls_100/top40_overlap_pairs_w_auc.txt -m 09 -ds inloc
- submit the result poses to the long-term visual localization benchmark.
[Megadepth]
-
download the data from glue-factory: images, scene_info.
-
dump the data and perform image retrieval to get the most overlapping image list.
python dump_data.py -ds megadepth
python register.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth
- run RANSAC on those pairs to estimate relative poses.
python relative_pose.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth
- optional tests: recall@1, 5, 10.
python recall.py -k 5 -m 09 -v 4 -r 0.2 -pre 20 -cls -ds megadepth
Note: use -v 4 -r 0.2
for recall@10; -v 0 -r 0.01
for recall@1.
[ETH3D]
- download ETH3D data (5.6G).
- dump the data and perform image retrieval to get the most overlapping image list.
python dump_data.py -ds eth3d
python register.py -k 5 -m 09 -v 3 -r 0.3 -pre 20 -cls -ds eth3d
- run RANSAC on those pairs to estimate relative poses.
python relative_pose.py -k 5 -m 09 -v 3 -r 0.3 -pre 20 -cls -ds eth3d
[Your own data]
-
specify the data dir of your data in data_dirs.yaml, and put the dump script into here to load the images, scene information (K, pose, etc.), and query and data base image lists if needed.
-
run retrieve.py to retrieve the queries if there are query and db images split; while register.py is the case we retrieve each image in the pool from the rest.
-
run relative_pose.py for relative pose estimation; or inloc_localization.py to localize the queries by the retrieved db images.
-
download depths of Megadepth to build the training supervision from here.
-
customize the configs and start training.
python -m gluefactory.train 09 --conf train_configs/09.yaml
Here the training is based on glue-factory, we provide details of the configurations we focus on.
data:
# choose the data augmentation type: 'flip, dark, lighglue'
photometric: {
"name": "flip",
"p": 0.95,
# 'difficulty': 1.0, # currently unused
}
model:
matcher:
name: overlap_predictor # our model
add_voting_head: true # whether to train by the constastive loss on the patch-level negative/positive matches
add_cls_tokens: false # whether to train the global embeddings
attentions: false # whether to use the attentionsfor supervison
input_dim: 1024 # the dimension of the pretrained Dino features
train:
dropout_prob: 0.5 # dropout probability
[Useful configs]
--radius, radius for radius knn search
--cls, default=False, action True, whether to use CLS tokens as prefilter
--pre_filter, default=20, the number of db images prefiltered for reranking.
--weighted, default=True, action True, whether to use TF-IDF weights for voting scores.
--vote, vote methods.
--k, top-k retrievals.
--overwrite, default=False, action True, overwrite the dumped data, retrieved image list or relative pose, etc.
--num_workers, default=8, change it to fit your machine.
[Acknowledgement]
glue-factory long-term visual localization benchmark pre-commit