Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[examples] Add Matryoshka evaluation plot #2564

Merged
merged 6 commits into from
Apr 3, 2024
Merged

[examples] Add Matryoshka evaluation plot #2564

merged 6 commits into from
Apr 3, 2024

Conversation

kddubey
Copy link
Contributor

@kddubey kddubey commented Mar 30, 2024

Hello,

The docs have a useful evaluation plot. I couldn't find code to reproduce it. This PR adds a script to reproduce it almost exactly—I think the only difference is the title position.

Running the script—

python matryoshka_eval_stsb.py plot.png

—saves the following plot in ./plot.png:

plot

Here's the script help message:

usage: matryoshka_eval_stsb.py [-h] [--model_names MODEL_NAMES [MODEL_NAMES ...]]
                               [--dimensions DIMENSIONS [DIMENSIONS ...]]
                               plot_filename

This script evaluates embedding models truncated at different dimensions on the STS
benchmark.

positional arguments:
  plot_filename         Where to save the plot of results

options:
  -h, --help            show this help message and exit
  --model_names MODEL_NAMES [MODEL_NAMES ...]
                        List of models which can be loaded using
                        sentence_transformers.SentenceTransformer(). Default:
                        tomaarsen/mpnet-base-nli-matryoshka tomaarsen/mpnet-base-
                        nli
  --dimensions DIMENSIONS [DIMENSIONS ...]
                        List of dimensions to truncate to and evaluate. Default:
                        768 512 256 128 64

It requires matplotlib and datasets.

@kddubey kddubey changed the title Add Matryoshka evaluation plot [examples] Add Matryoshka evaluation plot Mar 30, 2024
@tomaarsen
Copy link
Collaborator

Hello!

This is really excellent work! I really appreciate this addition. I've had a quick look at my original plotting script & found the one remaining difference: I used fig.tight_layout() before saving the file. I've added that to your PR as well, so now the plots should be pretty much identical. I think this is ready to merge now.

  • Tom Aarsen

@kddubey
Copy link
Contributor Author

kddubey commented Apr 2, 2024

Thanks, happy to help!

Btw, how do users currently do truncation during inference? For example, do they always remember to do model.encode(texts)[..., :dim] throughout their code?

I was wondering if you'd be open to adding that truncate_embeddings context manager (or a permanent modifier) which monkey-patches model.encode to truncate the embeddings. This way, the user doesn't need to remember to truncate every model.encode call, and maybe the model is easier to use in other packages which internally call model.encode.

@tomaarsen
Copy link
Collaborator

Btw, how do users currently do truncation during inference? For example, do they always remember to do model.encode(texts)[..., :dim] throughout their code?

I was wondering if you'd be open to adding that truncate_embeddings context manager (or a permanent modifier) which monkey-patches model.encode to truncate the embeddings. This way, the user doesn't need to remember to truncate every model.encode call, and maybe the model is easier to use in other packages which internally call model.encode.

I was actually considering adding a parameter to model.encode, e.g. matryoshka_dims=... or output_dims=... which performs this embedding truncation automatically. However, I was waiting to see whether Matryoshka models would catch on at all. If not, then this would only end up being confusing & tricky to maintain.
That said, I think it would make sense to add it now. I'm not sure about what the best name for the argument is, though.

  • Tom Aarsen

@kddubey
Copy link
Contributor Author

kddubey commented Apr 2, 2024

Another option is to introduce a new SentenceTransformer instance attribute called output_dims: int | None = None which determines the truncation level (defaults to no truncation). The code change to this package would involve adding a new __init__ kwarg, and then modifying encode to output embeddings[..., :self.output_dims] instead of embeddings.

A potential gotcha w/ an encode keyword argument is that model.get_sentence_embedding_dimension() now technically returns the full dimension instead of what model.encode(texts, output_dims=64) will return (64). With an instance attribute, model.get_sentence_embedding_dimension() can return model.output_dims if it's set. It's hard to say whether this is expected behavior though. Just throwing this idea out there :-)

The advantage of an instance attribute to adding a new encode keyword argument is that the former requires a single code change by the user—

+ model = SentenceTransformer(model_name, output_dims=64)
# or
+ model.output_dims = 64

—while the latter requires changing every model.encode call in the user's application:

- embeddings = model.encode(texts)
+ embeddings = model.encode(texts, output_dims=64)

I'd imagine that there aren't too many applications where the the number of dimensions varies. So setting an instance attribute is simple and won't cause problems in those cases. But maybe it's better to wait and see what a common application of truncation looks like, as you suggested.

Another advantage to the instance attribute is that for tools which internally call model.encode (like the EmbeddingSimilarityEvaluator, or external packages which rely on sentence-transformers), no code changes need to be made to support truncation. Only the user needs to set the attribute.

The disadvantage to an instance attribute is that it introduces state that the user might need to keep track of, which can make for annoying bugs. A solution to this is to add a context manager which internally sets and then resets the output_dims attribute.

@tomaarsen
Copy link
Collaborator

tomaarsen commented Apr 2, 2024

Good thinking, I am open to either solution at this time. A notable question is: Do people use the same model instance for multiple matryoshka embedding dimensions. If yes, then an encode parameter makes sense, if not, then an initialization parameter makes sense.

What do you think?

Personally, I think if someone wants multiple embedding dimensions, e.g. 512 and 128, then they're best off embedding to 512 and then getting the 128-dimensional embeddings by truncating the 512-dimensional ones. So they'll always just use one embedding dimension as output from the model.

But then there's also a bit of precedent with the new precision parameter which makes it more intuitive to add the parameter to encode 🤔

  • Tom Aarsen

@kddubey
Copy link
Contributor Author

kddubey commented Apr 2, 2024

Do people use the same model instance for multiple matryoshka embedding dimensions

Ah, yeah that's the most important question. I'd hope that people keep things simple and tie the dimension to the whole model instance.

From the Matryoshka paper (Figures 3, 6, and 7) and your STSb experiment, it seems like Matryoshka-trained models preserve enough statistical performance that it's unlikely that one task in an application requires dimension d while another requires d' for the same model. In case a user wants both d and d', they're free to ignore output_dims and truncate themselves. So the instance attribute solution wouldn't be damning, just a convenience.

But then there's also a bit of precedent with the new precision parameter which makes it more intuitive to add the parameter to encode 🤔

Good point, I missed that. One difference w/ precision is that (to my knowledge) there isn't an analog to model.get_sentence_embedding_dimension(). The existence of the model.get_sentence_embedding_dimension() method suggests that the state of a SentenceTransformer object includes the output dimension of its encodings, so perhaps encode and model.get_sentence_embedding_dimension() should be tied together. With output_dims as an encode parameter, there's a risk of mismatch between output_dims and model.get_sentence_embedding_dimension(), which may be unexpected.

We could even consider adding both an instance attribute and an encode parameter. This is done in most transformers.PreTrainedModels I've seen: in a transformers.PreTrainedModel, there's usually a few model.config.somethings which are also in model.__call__(..., something=None, ...). I'm not really for this solution in SentenceTransformer b/c (1) it makes the user think about where to set output_dims and (2) setting it in an encode call would contradict model.get_sentence_embedding_dimension(). Just worth throwing out there.

@tomaarsen
Copy link
Collaborator

I'll merge this, as I think this is ready, but I am still interested in the discussion of how to best allow users of ST (whether directly or via third party applications) to use Matryoshka models for inference.

@tomaarsen tomaarsen merged commit 444e8d4 into UKPLab:master Apr 3, 2024
9 checks passed
@kddubey kddubey deleted the matryoshka-eval-stsb branch April 3, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants