Few questions about your implementations #18

JGuillaumin · 2024-11-09T08:03:10Z

Hi,
First thank your very much for your work. It adds a huge improvement to DETR family.
And your paper was really well explained and written.
Also thank you for publishing your code & models, it was very easy to run it.

I have few questions about the implementation :

What is the difference between between models/ and impl_a/models/ ? (I compared few files, I only identified some typo changes, but I don't want to miss something)
Does the model and the training process is compatible with fp16 precision ?
In DeformableAttention, do you use reference point or bbox reference ? (reference_points is (bs, len_q, n_levels, 2) or (bs, len_q, n_levels, 4) ?)
What is the role of self.im2col_step = 64 in MSDeformAttn ?
Also in class MSDeformAttn(nn.Module) :

attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)
attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)

From what I understood of attention mechanism, normally schematically : attention_weights = softmax(dot_product(proj_q(Q), proj_k(K))), but we have attention_weights = softmax(proj_q(Q)), where proj_q is self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points).

The text was updated successfully, but these errors were encountered:

ZhaoChuyang · 2024-11-10T07:37:05Z

Hi,

Thank you for your interest to our work. For your questions:

What is the difference between between models/ and impl_a/models/ ?

impl_a is the implementation (a) of mixed supervision as illustrated by Figure 4 (a) in our paper. The main difference lies in the deformable_detr.py (L122-L125, L198-L219) for impl_a we did not change the architecture of decoder layers and adds an auxiliary predictors for one-to-many predictions. More details are available in Section 3.3 of our paper.

Does the model and the training process is compatible with fp16 precision ?

We did not test it under fp16 precision, it depends on the whether the MS-Deform operators, which we directly borrowed from Deformable-DETR supports fp16 precision. Maybe you can check in the original implementation repository, and as I know some third party implementation has supported fp16 for MS-Deform operators.

In DeformableAttention, do you use reference point or bbox reference ?

We use the reference points.

What is the role of self.im2col_step = 64 in MSDeformAttn ?

The im2col_step may relates to some memory efficiency operations in the tensor operations as implemented by Deformable-DETR. I did not investigate it in detail.

And the attention weights computed in MSDeformAttn is also different from what vanilla attention operation in PyTorch, the attention weights is computed by Q and reference points, as described in the paper of Deformable-DETR, you can check it for more details.

I hope this clears up your questions. Feel free to reach out if you have other questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few questions about your implementations #18

Few questions about your implementations #18

JGuillaumin commented Nov 9, 2024

ZhaoChuyang commented Nov 10, 2024

Few questions about your implementations #18

Few questions about your implementations #18

Comments

JGuillaumin commented Nov 9, 2024

ZhaoChuyang commented Nov 10, 2024