-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[examples
] Add Matryoshka evaluation plot
#2564
Conversation
examples
] Add Matryoshka evaluation plot
Hello! This is really excellent work! I really appreciate this addition. I've had a quick look at my original plotting script & found the one remaining difference: I used
|
Thanks, happy to help! Btw, how do users currently do truncation during inference? For example, do they always remember to do I was wondering if you'd be open to adding that |
I was actually considering adding a parameter to
|
Another option is to introduce a new A potential gotcha w/ an The advantage of an instance attribute to adding a new + model = SentenceTransformer(model_name, output_dims=64)
# or
+ model.output_dims = 64 —while the latter requires changing every - embeddings = model.encode(texts)
+ embeddings = model.encode(texts, output_dims=64) I'd imagine that there aren't too many applications where the the number of dimensions varies. So setting an instance attribute is simple and won't cause problems in those cases. But maybe it's better to wait and see what a common application of truncation looks like, as you suggested. Another advantage to the instance attribute is that for tools which internally call The disadvantage to an instance attribute is that it introduces state that the user might need to keep track of, which can make for annoying bugs. A solution to this is to add a context manager which internally sets and then resets the |
Good thinking, I am open to either solution at this time. A notable question is: Do people use the same model instance for multiple matryoshka embedding dimensions. If yes, then an What do you think? Personally, I think if someone wants multiple embedding dimensions, e.g. 512 and 128, then they're best off embedding to 512 and then getting the 128-dimensional embeddings by truncating the 512-dimensional ones. So they'll always just use one embedding dimension as output from the model. But then there's also a bit of precedent with the new
|
Ah, yeah that's the most important question. I'd hope that people keep things simple and tie the dimension to the whole model instance. From the Matryoshka paper (Figures 3, 6, and 7) and your STSb experiment, it seems like Matryoshka-trained models preserve enough statistical performance that it's unlikely that one task in an application requires dimension
Good point, I missed that. One difference w/ We could even consider adding both an instance attribute and an |
I'll merge this, as I think this is ready, but I am still interested in the discussion of how to best allow users of ST (whether directly or via third party applications) to use Matryoshka models for inference. |
Hello,
The docs have a useful evaluation plot. I couldn't find code to reproduce it. This PR adds a script to reproduce it almost exactly—I think the only difference is the title position.
Running the script—
—saves the following plot in
./plot.png
:Here's the script help message:
It requires
matplotlib
anddatasets
.