You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.
Hey, thx so much for sharing this repo!
since r3m is trained via contrastive learning, it should have learned to align the visual representation to text embeddings. So based on this, I do wonder if is there any efficient approach that when using r3m, for a given visual representation, we can further decode its textual grounding.
I think one approach is to use a pre-trained captioning model to generate captions, then further infer the description, what do you think of it?
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hey, thx so much for sharing this repo!
since r3m is trained via contrastive learning, it should have learned to align the visual representation to text embeddings. So based on this, I do wonder if is there any efficient approach that when using r3m, for a given visual representation, we can further decode its textual grounding.
I think one approach is to use a pre-trained captioning model to generate captions, then further infer the description, what do you think of it?
The text was updated successfully, but these errors were encountered: