Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why out_l = torch.cat((out_l_1, out_l), dim=1) ? #11

Open
wpumain opened this issue Mar 29, 2023 · 3 comments
Open

why out_l = torch.cat((out_l_1, out_l), dim=1) ? #11

wpumain opened this issue Mar 29, 2023 · 3 comments

Comments

@wpumain
Copy link

wpumain commented Mar 29, 2023

out_l = torch.cat((out_l_1, out_l), dim=1)

out_l = torch.cat((out_l_1, out_l), dim=1)

This operation, out_l = torch.cat((out_l_1, out_l), dim=1), is actually using a concept similar to ResNet's skip connections, where features from previous layers are combined with features from later layers.Right?

Why not directly perform element-wise addition, which would also maintain the data dimensionality?

Why isn't there a similar skip connection for the output of transformer_encoder2 at the same time?

@BIT-MJY
Copy link
Owner

BIT-MJY commented Apr 9, 2023

Why not directly perform element-wise addition, which would also maintain the data dimensionality?

Actually, we explicitly concatenate the features along the channel to increase the embedding dimension of the sentence-like input. We think this operation can improve the distinguishability of spatial features and the experimental results also support this.

Why isn't there a similar skip connection for the output of transformer_encoder2 at the same time?

Nice question. We have chosen not to use the concat-based skip connection in order to optimize running efficiency. However, our experiments have shown that the addition-based skip connection actually results in worse place recognition performance, and this warrants further analysis. It's possible that the triplet loss function provides a relatively "soft" constraint for training the place recognition network. Although the addition-based skip connection can accelerate the reduction of triplet loss values during training, it may not lead to a corresponding increase in recall on the test set.

Btw, thanks for your such interest in our work, and you can follow our latest place recognition work CVTNet which can provide even better recognition results.

@wpumain
Copy link
Author

wpumain commented Apr 9, 2023

Thank you for your help with the features. I have learned a lot from your excellent SeqOT. Can I ask you three more questions?
1.The NetVlad you used is not the original NetVlad method, right? Because you did not perform clustering when calculating NetVlad. What is the basis for your calculation?
2.In general VPR tasks, after backbone processing, when performing feature aggregation, the dimension of C is usually increased or maintained. In SeqOT, before performing NetVlad feature aggregation, the tensor dimension is (7,512,2700,1). In NetVlad, the dimension is directly reduced from (7,32768=512*64) to (7,256). That is, after feature fusion, the dimension of C does not increase, but decreases (from 512 to 256), which is different from the general VPR task. Intuitively, the higher the dimension, the more accurate the information represented, and the easier it is to achieve SOTA. At the same time, after feature fusion, a feature vector should increase the dimension of the fused feature vector to better represent the information of multiple feature vectors before fusion with fewer feature vectors. However, you achieved SOTA by doing this. How should this be understood?
3.How was [Fig. 5: The t-SNE visualization of place clustering] generated? Where did the data for OT, Output of MSM, and SeqOT come from?

@wpumain
Copy link
Author

wpumain commented Apr 9, 2023

The NetVlad aggregation method you are using is more convenient than the original NetVlad algorithm because it does not require caching of features extracted from the model backbone, and therefore does not require clustering operations based on these features. However, what is the mathematical basis for your approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants