Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few questions about your implementations #18

Open
JGuillaumin opened this issue Nov 9, 2024 · 1 comment
Open

Few questions about your implementations #18

JGuillaumin opened this issue Nov 9, 2024 · 1 comment

Comments

@JGuillaumin
Copy link

Hi,
First thank your very much for your work. It adds a huge improvement to DETR family.
And your paper was really well explained and written.
Also thank you for publishing your code & models, it was very easy to run it.

I have few questions about the implementation :

  • What is the difference between between models/ and impl_a/models/ ? (I compared few files, I only identified some typo changes, but I don't want to miss something)
  • Does the model and the training process is compatible with fp16 precision ?
  • In DeformableAttention, do you use reference point or bbox reference ? (reference_points is (bs, len_q, n_levels, 2) or (bs, len_q, n_levels, 4) ?)
  • What is the role of self.im2col_step = 64 in MSDeformAttn ?
  • Also in class MSDeformAttn(nn.Module) :
attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)
attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)

From what I understood of attention mechanism, normally schematically : attention_weights = softmax(dot_product(proj_q(Q), proj_k(K))), but we have attention_weights = softmax(proj_q(Q)), where proj_q is self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points).

@ZhaoChuyang
Copy link
Collaborator

Hi,

Thank you for your interest to our work. For your questions:

What is the difference between between models/ and impl_a/models/  ?

impl_a is the implementation (a) of mixed supervision as illustrated by Figure 4 (a) in our paper. The main difference lies in the deformable_detr.py (L122-L125, L198-L219) for impl_a we did not change the architecture of decoder layers and adds an auxiliary predictors for one-to-many predictions. More details are available in Section 3.3 of our paper.

Does the model and the training process is compatible with fp16 precision ?

We did not test it under fp16 precision, it depends on the whether the MS-Deform operators, which we directly borrowed from Deformable-DETR supports fp16 precision. Maybe you can check in the original implementation repository, and as I know some third party implementation has supported fp16 for MS-Deform operators.

In DeformableAttention, do you use reference point or bbox reference ?

We use the reference points.

What is the role of self.im2col_step = 64 in MSDeformAttn ?

The im2col_step may relates to some memory efficiency operations in the tensor operations as implemented by Deformable-DETR. I did not investigate it in detail.

And the attention weights computed in MSDeformAttn is also different from what vanilla attention operation in PyTorch, the attention weights is computed by Q and reference points, as described in the paper of Deformable-DETR, you can check it for more details.

I hope this clears up your questions. Feel free to reach out if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants