Managing multiple/changing/experimental encodings #338

iandanforth-alation · 2021-12-22T19:29:56Z

iandanforth-alation
Dec 22, 2021

We've had success using Elastiknn as part of our search pipeline to implement semantic search, however as we roll out the feature new constraints are pushing us to experiment with different models to create the vector representations of documents.

Reindexing is costly

We use a unified index and 'doc' which is a superset of fields across all objects indexed in ES. Any object indexed in ES has fields relevant to that object type filled out in the 'doc' and the rest are empty.

We regularly add fields to this doc but rarely (if ever) remove or alter fields in such a way as to require reindexing.

It can take multiple days to do a full re-index for some customers.

We currently only encode and index-via-elastiknn a small portion of the total content indexed ES, so re-indexing this subset isn't too costly.

There is a speed vs quality tradeoff

Our txt->vector encoding system is constrained by several factors so we may need to allow customers to make an indexing speed vs retrieval quality tradeoff.

In addition we would like to continually experiment with variations on models to improve our location along this tradeoff curve.

Discussion

We would love to be able to swap encoding models at will, however if the dims property of the mapping for a given encoding field changes this would require a full reindex. If we want to experiment with other mapping changes (we're using lsh w/ cosine) this would also require a reindex. (I guess I don't know this for sure, maybe I'm wrong here.)

We've considered

adding new fields as we go along to accommodate any variations in our mappings
creating a mapping with a large dims property so we can use the output of any model that outputs vectors <= dims (with zero padding)

In addition we've considered more fundamental changes to our search pipeline, but I'd like to hear thoughts from the group here as to how they handle changes and experimentation in scenarios like this.

Answered by alexklibisz

Dec 27, 2021

add new field + mapping
index relevant content into this field

Yes, exactly.

update existing field data in previous field to be empty to recover disk space (haven't tried this so no idea if it would propagate to the underlying indices that elastiknn uses)

Elastiknn just stores all of its data in a plain Lucene index. Vectors are stored as binary blobs and LSH hashes as Lucene terms. So storage would propagate the same way as deleting a regular ES field would propagate. It probably won't immediately release space, but after you merge segments to prune deleted docs/fields you'll get the space back.

Worth mentioning that there is really no elastiknn-specific magic behind the elastiknn_*_…

View full answer

alexklibisz · 2021-12-22T21:39:44Z

alexklibisz
Dec 22, 2021
Maintainer

If I'm understanding correctly, it sounds like you are looking for a way to store vectors representing the same data, but "encoded" with possibly different dimensions and different model parameters? For example, you take the same text document, and encode it to two vectors, one with 10 dimensions and some LSH params (L=50,k=4), one with 100 dimensions and LSH params (L=20,k=3). Is that a correct summary?

If so, the only way to do this is to have two different doc fields in your mapping. In the example above, you might have two fields: vec_d10_L50_k4 and vec_d100_L20_k3. The dimension and model params are strictly stored as part of the mapping, so there is no way to have both of these disguised as one field.

1 reply

iandanforth-alation Dec 23, 2021
Author

Ok, so over say a span of 6-12 months let's say we have 10 experiments in the encoding/model pipeline that would necessitate a new mapping. (Different dims, or search strategy in elastiknn).

We can't re-use the existing field so we'd have to adopt a pattern such as:

add new field + mapping
index relevant content into this field
update existing field data in previous field to be empty to recover disk space (haven't tried this so no idea if it would propagate to the underlying indices that elastiknn uses)

And we'd do this for each of the 10 breaking modifications. Does this sound correct?

Thanks for your time and input!

alexklibisz · 2021-12-27T17:32:52Z

alexklibisz
Dec 27, 2021
Maintainer

add new field + mapping
index relevant content into this field

Yes, exactly.

update existing field data in previous field to be empty to recover disk space (haven't tried this so no idea if it would propagate to the underlying indices that elastiknn uses)

Elastiknn just stores all of its data in a plain Lucene index. Vectors are stored as binary blobs and LSH hashes as Lucene terms. So storage would propagate the same way as deleting a regular ES field would propagate. It probably won't immediately release space, but after you merge segments to prune deleted docs/fields you'll get the space back.

Worth mentioning that there is really no elastiknn-specific magic behind the elastiknn_*_vector datatypes. So all of the above can actually be reasoned about the same as any other Elasticsearch schema-evolution. From ES and Lucene's perspectives, experimenting with vector dims and LSH params is fundamentally the same as something like changing a field datatype from text to keyword.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing multiple/changing/experimental encodings #338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Managing multiple/changing/experimental encodings #338

iandanforth-alation Dec 22, 2021

Reindexing is costly

There is a speed vs quality tradeoff

Discussion

Replies: 2 comments · 1 reply

alexklibisz Dec 22, 2021 Maintainer

iandanforth-alation Dec 23, 2021 Author

alexklibisz Dec 27, 2021 Maintainer

iandanforth-alation
Dec 22, 2021

Replies: 2 comments 1 reply

alexklibisz
Dec 22, 2021
Maintainer

iandanforth-alation Dec 23, 2021
Author

alexklibisz
Dec 27, 2021
Maintainer