-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many detaild questions #13
Comments
Hi, thanks for your questions. Here are the responses below.
It may happen that the network ignore one or more of the additional inputs and rely on others during training. To prevent that we did the permutation invariant pooling.
The value of 'm' affects our network architecture, where we have 'm' encoders. So when we say, for example, m=7, that means we have 7 encoders, and ofc that 6 additional images are used for training and testing.
m = 1 means that only the query image is used as an input with no additional images
This needs more investigation, but probably one of the reasons is that m=13, for example, requires 13 encoders and increases our model capacity thus leads to some overfitting.
Fine-tuning on testing set would definitely help and may lead to better results. But the target of our paper is to avoid any further training/tuning .. kind of taking on a challenge, but in practical I would argue that fine-tuning on a small set is still practical and may lead to better results. |
Thanks for your great work! it has indeed sparked a lot of inspiration for me. However, there are several aspects that I would like to discuss further:
The paper mentioned: "To allow the network to reason about the set of additional input images in a way that is insensitive to their ordering, we adopt the permutation invariant pooling approach of Aittala et al."
1. Could you elaborate on why insensitivity to ordering is crucial? Specifically, I'm curious whether a sufficiently large training dataset would inherently cover all potential orderings.
Regarding the number of additional unlabeld images (m), it appears that were used in both the training and testing stages. From the ablation study, it seems that various values of m were only tested on the test camera, as illustrated in Table 4. I have a question about this:
2. During the training process, did you experiment with varying quantities for 'm', or was there a consistent fixed number applied throughout, for example, 8?
When m equals 1, I understand that this means only the query image is used during testing. If so, my question is:
4. Could you clarify whether m=1 only signifies the zero-shot condition, i.e., just inferring, or does it mean that the single query image is used for self-calibration, followed by parameter fixation, and then inference?
5. From the results shown in Table 4, it doesn't seem that the results improve as m increases(i.e., error(m=13)>error(m=7)). Could you provide some insights into this?
6. Have you considered using additional labeled images for fine-tuning? If so, would this lead to better results than the current method?
Thank you for taking the time to answer these questions. Your responses will be greatly beneficial to my understanding.
The text was updated successfully, but these errors were encountered: