Managing multiple/changing/experimental encodings #338
-
We've had success using Elastiknn as part of our search pipeline to implement semantic search, however as we roll out the feature new constraints are pushing us to experiment with different models to create the vector representations of documents. Reindexing is costlyWe use a unified index and 'doc' which is a superset of fields across all objects indexed in ES. Any object indexed in ES has fields relevant to that object type filled out in the 'doc' and the rest are empty. We regularly add fields to this doc but rarely (if ever) remove or alter fields in such a way as to require reindexing. It can take multiple days to do a full re-index for some customers. We currently only encode and index-via-elastiknn a small portion of the total content indexed ES, so re-indexing this subset isn't too costly. There is a speed vs quality tradeoffOur txt->vector encoding system is constrained by several factors so we may need to allow customers to make an indexing speed vs retrieval quality tradeoff. In addition we would like to continually experiment with variations on models to improve our location along this tradeoff curve. DiscussionWe would love to be able to swap encoding models at will, however if the We've considered
In addition we've considered more fundamental changes to our search pipeline, but I'd like to hear thoughts from the group here as to how they handle changes and experimentation in scenarios like this. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
If I'm understanding correctly, it sounds like you are looking for a way to store vectors representing the same data, but "encoded" with possibly different dimensions and different model parameters? For example, you take the same text document, and encode it to two vectors, one with 10 dimensions and some LSH params (L=50,k=4), one with 100 dimensions and LSH params (L=20,k=3). Is that a correct summary? If so, the only way to do this is to have two different doc fields in your mapping. In the example above, you might have two fields: |
Beta Was this translation helpful? Give feedback.
-
Yes, exactly.
Elastiknn just stores all of its data in a plain Lucene index. Vectors are stored as binary blobs and LSH hashes as Lucene terms. So storage would propagate the same way as deleting a regular ES field would propagate. It probably won't immediately release space, but after you merge segments to prune deleted docs/fields you'll get the space back. Worth mentioning that there is really no elastiknn-specific magic behind the |
Beta Was this translation helpful? Give feedback.
Yes, exactly.
Elastiknn just stores all of its data in a plain Lucene index. Vectors are stored as binary blobs and LSH hashes as Lucene terms. So storage would propagate the same way as deleting a regular ES field would propagate. It probably won't immediately release space, but after you merge segments to prune deleted docs/fields you'll get the space back.
Worth mentioning that there is really no elastiknn-specific magic behind the
elastiknn_*_…