Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Chat-Scene achieve understanding of the relative spatial relationships between objects without absolute coordinate inputs? #48

Open
zbqq opened this issue Dec 3, 2024 · 5 comments

Comments

@zbqq
Copy link

zbqq commented Dec 3, 2024

In your implementation, the 3D features obtained by UNI3D and 2D features obtained by DINOV2 drop the absolute coordinate information of objects.

@ZzZZCHS
Copy link
Owner

ZzZZCHS commented Dec 3, 2024

Hi, thanks for your interest!

You’re correct that the current 3D/2D features lack explicit coordinate information. Only the DINOv2 features might capture some local spatial relationships within the images. The strong performance on datasets like ScanRefer could be attributed to their detailed attribute descriptions of the target object. The model can be further improved by incorporating encoders capable of capturing coordinate information.

@zbqq
Copy link
Author

zbqq commented Dec 3, 2024

Thanks for your reply!

@zbqq
Copy link
Author

zbqq commented Dec 3, 2024

By the way, when will the DINOv2 feature extraction code be released?

@ZzZZCHS
Copy link
Owner

ZzZZCHS commented Dec 3, 2024

Probably end of this month.

@cy94
Copy link

cy94 commented Jan 9, 2025

Hi @ZzZZCHS thanks for releasing your code and data. Any idea when you can release the DINOv2 feature extraction code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants