[`examples`] Add Matryoshka evaluation plot #2564

kddubey · 2024-03-30T13:57:30Z

Hello,

The docs have a useful evaluation plot. I couldn't find code to reproduce it. This PR adds a script to reproduce it almost exactly—I think the only difference is the title position.

Running the script—

python matryoshka_eval_stsb.py plot.png

—saves the following plot in ./plot.png:

Here's the script help message:

usage: matryoshka_eval_stsb.py [-h] [--model_names MODEL_NAMES [MODEL_NAMES ...]]
                               [--dimensions DIMENSIONS [DIMENSIONS ...]]
                               plot_filename

This script evaluates embedding models truncated at different dimensions on the STS
benchmark.

positional arguments:
  plot_filename         Where to save the plot of results

options:
  -h, --help            show this help message and exit
  --model_names MODEL_NAMES [MODEL_NAMES ...]
                        List of models which can be loaded using
                        sentence_transformers.SentenceTransformer(). Default:
                        tomaarsen/mpnet-base-nli-matryoshka tomaarsen/mpnet-base-
                        nli
  --dimensions DIMENSIONS [DIMENSIONS ...]
                        List of dimensions to truncate to and evaluate. Default:
                        768 512 256 128 64

It requires matplotlib and datasets.

tomaarsen · 2024-04-02T10:41:55Z

Hello!

This is really excellent work! I really appreciate this addition. I've had a quick look at my original plotting script & found the one remaining difference: I used fig.tight_layout() before saving the file. I've added that to your PR as well, so now the plots should be pretty much identical. I think this is ready to merge now.

Tom Aarsen

kddubey · 2024-04-02T11:08:31Z

Thanks, happy to help!

Btw, how do users currently do truncation during inference? For example, do they always remember to do model.encode(texts)[..., :dim] throughout their code?

I was wondering if you'd be open to adding that truncate_embeddings context manager (or a permanent modifier) which monkey-patches model.encode to truncate the embeddings. This way, the user doesn't need to remember to truncate every model.encode call, and maybe the model is easier to use in other packages which internally call model.encode.

tomaarsen · 2024-04-02T11:23:52Z

Btw, how do users currently do truncation during inference? For example, do they always remember to do model.encode(texts)[..., :dim] throughout their code?

I was wondering if you'd be open to adding that truncate_embeddings context manager (or a permanent modifier) which monkey-patches model.encode to truncate the embeddings. This way, the user doesn't need to remember to truncate every model.encode call, and maybe the model is easier to use in other packages which internally call model.encode.

I was actually considering adding a parameter to model.encode, e.g. matryoshka_dims=... or output_dims=... which performs this embedding truncation automatically. However, I was waiting to see whether Matryoshka models would catch on at all. If not, then this would only end up being confusing & tricky to maintain.
That said, I think it would make sense to add it now. I'm not sure about what the best name for the argument is, though.

Tom Aarsen

kddubey · 2024-04-02T12:18:34Z

Another option is to introduce a new SentenceTransformer instance attribute called output_dims: int | None = None which determines the truncation level (defaults to no truncation). The code change to this package would involve adding a new __init__ kwarg, and then modifying encode to output embeddings[..., :self.output_dims] instead of embeddings.

A potential gotcha w/ an encode keyword argument is that model.get_sentence_embedding_dimension() now technically returns the full dimension instead of what model.encode(texts, output_dims=64) will return (64). With an instance attribute, model.get_sentence_embedding_dimension() can return model.output_dims if it's set. It's hard to say whether this is expected behavior though. Just throwing this idea out there :-)

The advantage of an instance attribute to adding a new encode keyword argument is that the former requires a single code change by the user—

+ model = SentenceTransformer(model_name, output_dims=64)
# or
+ model.output_dims = 64

—while the latter requires changing every model.encode call in the user's application:

- embeddings = model.encode(texts)
+ embeddings = model.encode(texts, output_dims=64)

I'd imagine that there aren't too many applications where the the number of dimensions varies. So setting an instance attribute is simple and won't cause problems in those cases. But maybe it's better to wait and see what a common application of truncation looks like, as you suggested.

Another advantage to the instance attribute is that for tools which internally call model.encode (like the EmbeddingSimilarityEvaluator, or external packages which rely on sentence-transformers), no code changes need to be made to support truncation. Only the user needs to set the attribute.

The disadvantage to an instance attribute is that it introduces state that the user might need to keep track of, which can make for annoying bugs. A solution to this is to add a context manager which internally sets and then resets the output_dims attribute.

tomaarsen · 2024-04-02T13:06:39Z

Good thinking, I am open to either solution at this time. A notable question is: Do people use the same model instance for multiple matryoshka embedding dimensions. If yes, then an encode parameter makes sense, if not, then an initialization parameter makes sense.

What do you think?

Personally, I think if someone wants multiple embedding dimensions, e.g. 512 and 128, then they're best off embedding to 512 and then getting the 128-dimensional embeddings by truncating the 512-dimensional ones. So they'll always just use one embedding dimension as output from the model.

But then there's also a bit of precedent with the new precision parameter which makes it more intuitive to add the parameter to encode 🤔

Tom Aarsen

kddubey · 2024-04-02T13:51:47Z

Do people use the same model instance for multiple matryoshka embedding dimensions

Ah, yeah that's the most important question. I'd hope that people keep things simple and tie the dimension to the whole model instance.

From the Matryoshka paper (Figures 3, 6, and 7) and your STSb experiment, it seems like Matryoshka-trained models preserve enough statistical performance that it's unlikely that one task in an application requires dimension d while another requires d' for the same model. In case a user wants both d and d', they're free to ignore output_dims and truncate themselves. So the instance attribute solution wouldn't be damning, just a convenience.

But then there's also a bit of precedent with the new precision parameter which makes it more intuitive to add the parameter to encode 🤔

Good point, I missed that. One difference w/ precision is that (to my knowledge) there isn't an analog to model.get_sentence_embedding_dimension(). The existence of the model.get_sentence_embedding_dimension() method suggests that the state of a SentenceTransformer object includes the output dimension of its encodings, so perhaps encode and model.get_sentence_embedding_dimension() should be tied together. With output_dims as an encode parameter, there's a risk of mismatch between output_dims and model.get_sentence_embedding_dimension(), which may be unexpected.

We could even consider adding both an instance attribute and an encode parameter. This is done in most transformers.PreTrainedModels I've seen: in a transformers.PreTrainedModel, there's usually a few model.config.somethings which are also in model.__call__(..., something=None, ...). I'm not really for this solution in SentenceTransformer b/c (1) it makes the user think about where to set output_dims and (2) setting it in an encode call would contradict model.get_sentence_embedding_dimension(). Just worth throwing out there.

tomaarsen · 2024-04-03T13:44:22Z

I'll merge this, as I think this is ready, but I am still interested in the discussion of how to best allow users of ST (whether directly or via third party applications) to use Matryoshka models for inference.

Add Matryoshka evaluation plot

cbe5722

kddubey changed the title ~~Add Matryoshka evaluation plot~~ [examples] Add Matryoshka evaluation plot Mar 30, 2024

kddubey and others added 5 commits March 30, 2024 07:07

reformat

5ac0a62

add link

acc4661

fix names and types

d0ac058

Add 'tight_layout' to figure

1cf27af

Reformat title

a96777a

tomaarsen merged commit 444e8d4 into UKPLab:master Apr 3, 2024
9 checks passed

kddubey deleted the matryoshka-eval-stsb branch April 3, 2024 13:58

kddubey mentioned this pull request Apr 3, 2024

[feat] Add truncation support #2573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`examples`] Add Matryoshka evaluation plot #2564

[`examples`] Add Matryoshka evaluation plot #2564

kddubey commented Mar 30, 2024 •

edited

Loading

tomaarsen commented Apr 2, 2024

kddubey commented Apr 2, 2024

tomaarsen commented Apr 2, 2024

kddubey commented Apr 2, 2024 •

edited

Loading

tomaarsen commented Apr 2, 2024 •

edited

Loading

kddubey commented Apr 2, 2024

tomaarsen commented Apr 3, 2024

[examples] Add Matryoshka evaluation plot #2564

[examples] Add Matryoshka evaluation plot #2564

Conversation

kddubey commented Mar 30, 2024 • edited Loading

tomaarsen commented Apr 2, 2024

kddubey commented Apr 2, 2024

tomaarsen commented Apr 2, 2024

kddubey commented Apr 2, 2024 • edited Loading

tomaarsen commented Apr 2, 2024 • edited Loading

kddubey commented Apr 2, 2024

tomaarsen commented Apr 3, 2024

[`examples`] Add Matryoshka evaluation plot #2564

[`examples`] Add Matryoshka evaluation plot #2564

kddubey commented Mar 30, 2024 •

edited

Loading

kddubey commented Apr 2, 2024 •

edited

Loading

tomaarsen commented Apr 2, 2024 •

edited

Loading