Support for matryoshka indexing #131

npip99 · 2024-09-05T21:50:47Z

CREATE INDEX ix_chunk_embedding
ON chunk USING diskann (embedding) WITH (num_dimensions=1999);

NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
ERROR:  assertion failed: dimensions > 0 && dimensions < 2000

The error above is a bit of a shame.

If my vector is a Vector(3072), it would be nice to support matryoshka by allowing the dimension of the index to be < 2000, even if the source vector has a larger dimension. I believe the above SQL code should execute successfully, since I'm only indexing a subvector of the original vector.

For now, I have a generated column and calculate it based on my desired subvector, but this takes physical space on disk, when ideally it should be computed on the fly. And, it means that I have to rerank manually by the full vector, rather than the index automatically handling it (Not a big deal).

If it could support e.g. this notation, then the num_dimensions attribute wouldn't be necessary anymore, and solve both problems (But I think supporting that notation might be overkill, I'm not sure).

The text was updated successfully, but these errors were encountered:

cevian · 2024-09-23T20:22:28Z

Oh yeah this seems to be something we overlooked

cevian added the bug Something isn't working label Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for matryoshka indexing #131

Support for matryoshka indexing #131

npip99 commented Sep 5, 2024 •

edited

Loading

cevian commented Sep 23, 2024

Support for matryoshka indexing #131

Support for matryoshka indexing #131

Comments

npip99 commented Sep 5, 2024 • edited Loading

cevian commented Sep 23, 2024

npip99 commented Sep 5, 2024 •

edited

Loading