-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Mysterious Dimension Swapping in BEVFusion's TransfusionHead #3020
Comments
In addition, reverting this swapping, in other words, init-ing the heatmap as shape feature_map_size[0], feature_map_size[1] can accomplish to train a model. |
Are you training a camera-only model? |
No, I’m dealing with Lidar-only and LC-fusion models. But this bug will remain there with cam-only BEVFusion, as the vtransform outputs a Bev feature map with the same spatial shape as the SCN. |
In fact, this operation is present in the original code of transfusion head in BEVFusion. However, the mit-bevfusion and the BEVFusion from NeurIPS 2022 differ in the final step of vtransform. The outputs X and Y from their vtransform are reversed |
Thanks for pointing out. Yet what do you mean by "reversed"? Do you mean that the vtransform output is in a spatial shape of [y,x] rather than the [x, y] as the lidar feature from the SCN? I'm a little bit confused since if that's the case, the two feature maps won't be able to be stacked and processed by the 2D pts_backbone. |
I'm confused about this too, but in the vtransform of the NeurIPS 2022 bevfusion, the output is [y,x], and in its (and BevDet's) transfusion head the position you mention is also [y,x]. |
Have you tried a non-square bev LCFusion? Does the 2d backbone of the bev model accept input properly? |
Yes, that's exactly the case I'm encountering. I have sparse_shape=[960, 1088, 41], which corresponds to x, y, z in lidar coord. The x, y, z bound of LSS is adjusted accordingly. In this case 2d backbone (pts_backbone) does accept the feature maps in a proper manner. |
@cxnaive Actually there is another confusing snippet, which however might be a hint to understand these ambiguous spatial shapes: mmdetection3d/projects/BEVFusion/bevfusion/transfusion_head.py Lines 727 to 731 in fe25f7a
Say sparse_shape (xyz) is [960, 1088 41]. The predicted center (which should be [x, y] as it is used to form a LidarInstance3DBbox) is reversed when used to index the hotspot in the heatmap. That means the heatmap should also be reversed (as it is now) , shaped as [1088/8, 960/8]. But the SCN outputs a feature map of [960/8, 1088/8] when using xyz voxelization, manifested in: mmdetection3d/projects/BEVFusion/bevfusion/sparse_encoder.py Lines 144 to 146 in fe25f7a
This is a conflict, but the operations in TransfusionHead are confusingly correct (except the heatmap coord swapping); using 'center' instead of 'center[[1,0]]' to index an heatmap of shape [960/8, 1088/8] ruins the training and the model never converges. |
draw_heatmap_gaussian(heatmap[gt_labels_3d[idx]], center_int, radius) is the original version of transfusion head. The original version should correspond to BEV features in the format [Y, X], while center_int[[1,0]] corresponds to [X, Y] |
So, the grid_size should also be reversed in the same way, but this was forgotten in the transfusion head of this version. Alternatively, consider using the original transfusion head but reversing the BEV features. |
https://github.com/open-mmlab/mmdetection3d/blob/fe25f7a51d36e3702f961e198894580d83c4387b/mmdet3d/models/utils/gaussian.py#L46C5-L53C69 |
This solves most of the confusion. That's why initializing the heatmap as shape feature_map_size[0], feature_map_size[1] (i.e.. the same as the bev feature) conforms to everything else, right? |
Yes, you can check the implementation of CenterHead in CenterPoint within mmdet3D, which also uses [Y, X] for BEV features. However, the BEV features obtained from the sparse encoder in BEVFusion are [X, Y] |
Prerequisite
Task
I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.4, V11.4.152
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 1.10.1+cu113
PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.2+cu113
OpenCV: 4.10.0
MMEngine: 0.10.4
MMDetection: 3.2.0+d509b75
Reproduces the problem - code sample
bash tools/dist_train.sh [configs] 1
Reproduces the problem - command or script
bash tools/dist_train.sh [configs] 1
Reproduces the problem - error message
File "/usr/local/lib/python3.8/dist-packages/mmdet3d/models/detectors/base.py", line 75, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/workspace/bevfusion/bevfusion.py", line 301, in loss
bbox_loss = self.bbox_head.loss(feats, batch_data_samples)
File "/workspace/bevfusion/transfusion_head.py", line 761, in loss
loss = self.loss_by_feat(preds_dicts, batch_gt_instances_3d)
File "/workspace/bevfusion/transfusion_head.py", line 786, in loss_by_feat
loss_heatmap = self.loss_heatmap(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/gaussian_focal_loss.py", line 176, in forward
loss_reg = self.loss_weight * gaussian_focal_loss(
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/utils.py", line 121, in wrapper
loss = loss_func(pred, target, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmdet/models/losses/gaussian_focal_loss.py", line 35, in gaussian_focal_loss
pos_loss = -(pred + eps).log() * (1 - pred).pow(alpha) * pos_weights
RuntimeError: The size of tensor a (136) must match the size of tensor b (120) at non-singleton dimension 3
Additional information
Here I use a custom dataset with non-square bev feature (i.e., the sparse shape is [960, 1088, z], making the bev feature map of spatial shape [120, 136]). When passing the sparse shape as "grid_size", which is used in:
mmdetection3d/projects/BEVFusion/bevfusion/transfusion_head.py
Lines 701 to 702 in fe25f7a
and the feature_map_size is then used to create the heatmap. Here the X and Y dimensions are swapped, making it of spatial shape [136, 120] following the code below:
mmdetection3d/projects/BEVFusion/bevfusion/transfusion_head.py
Lines 703 to 704 in fe25f7a
The problem is finally triggered at
mmdetection3d/projects/BEVFusion/bevfusion/transfusion_head.py
Lines 785 to 790 in fe25f7a
This problem won't happen with nuscenes since it has square bev feature and the swap means nothing. Conclusively, such swap when intializing the heatmap is ambiguous and makes it impossible to have the same shape as the bev feature.
The text was updated successfully, but these errors were encountered: