You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I don't know why we need to extract the language embeddings first. Doesn't it work like we give box prompts and images and then the model outputs segmentation masks and labels? Why we need language embeddings? How does it work with the model?
The text was updated successfully, but these errors were encountered:
The language embeddings serve as a "dictionary." During the inference, a visual embedding will be extracted to match the "dictionary." You can refer to the code and paper for details. Please let me know if you have any other questions.
Hi, I don't know why we need to extract the language embeddings first. Doesn't it work like we give box prompts and images and then the model outputs segmentation masks and labels? Why we need language embeddings? How does it work with the model?
The text was updated successfully, but these errors were encountered: