-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for more than one inner_hit when searching nested vectors #104006
Add support for more than one inner_hit when searching nested vectors #104006
Conversation
…ssages-nested-knn-search
…ssages-nested-knn-search
Hi @benwtrent, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other major way of handling this change would be figure out if NestedInnerHitSubContext
could work. I am not 100% sure. I am still trying to figure out if that is even possible, but I spent a day+ twiddling with that and gave up.
@jimczi let me know what you think. I will leave this as it is for now and try to see if I can get the subcontext stuff working.
@@ -134,3 +137,93 @@ setup: | |||
- match: {hits.hits.0._id: "3"} | |||
- match: {hits.hits.0.fields.name.0: "rabbit.jpg"} | |||
- match: {hits.hits.0.inner_hits.nested.hits.hits.0.fields.nested.0.paragraph_id.0: "0"} | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test shows the current bug around global top-k gathering in DFS. I still need to figure out how to get the global k docs diversified over the parents given the children documents.
@@ -161,6 +161,7 @@ public static List<DfsKnnResults> mergeKnnResults(SearchRequest request, List<Df | |||
|
|||
List<DfsKnnResults> mergedResults = new ArrayList<>(request.source().knnSearch().size()); | |||
for (int i = 0; i < request.source().knnSearch().size(); i++) { | |||
//TODO how do we merge via diversified top-k parents? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple ideas here. My main thought is that kNNResults will not only include the child document ScoreDocs, but can also include an int[]
of the matching parent documents. I would need some special logic here, but when combining shard level results, we can ensure that we combine the top k
parent documents given the individual passage scores
* @param innerHitBuilder The inner hit builder to set as current inner hit builder | ||
* @return The previous inner hit builder | ||
*/ | ||
public InnerHitBuilder nextLevelInnerHits(InnerHitBuilder innerHitBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a way to tell the top-level & knn Query how many child documents we are wanting to gather. Since kNN is eager and returns only a wrapper around ScoreDoc
values, we need to know how many total documents we care about.
server/src/main/java/org/elasticsearch/search/fetch/subphase/InnerHitsContext.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/vectors/ChildBlockJoinVectorScorerProvider.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/vectors/ChildBlockJoinVectorScorerProvider.java
Outdated
Show resolved
Hide resolved
this.childFilterIterator = childFilterIterator; | ||
this.previouslyFoundChildren = previouslyFoundChildren; | ||
this.parentBitSet = parentBitSet; | ||
this.queue = new HitQueue(numChildrenPerParent, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may ask "Ben, why not just score all and return all vectors because you gotta iterate all the children anyways to find the next nearest". Well, there are degenerate cases we want to avoid. What if one vector has 10k children?
...src/main/java/org/elasticsearch/search/vectors/ESDiversifyingChildrenByteKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search
The nested inner hit sub context should be the way to go imo. |
@jimczi it seems to me the only way to do this (with inner sub context or without), is for there to be a new query that wraps our regular queries and scores the individual child docs given some previously calculated parent docs. I have something like this working locally, I will see if it is cleaner to put that through the subcontext building process or not. |
See my PR, we only need a custom nested query builder (to override extractInnerHits) and an exact knn query builder. |
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one general comment. I don't think we need to mix the query phase results with the fetch phase/inner hits. It's ok to recompute the similarity for each children in the fetch phase, that's how inner hits works.
* This query is used to score vector child documents given the results of a {@link DiversifyingChildrenByteKnnVectorQuery} | ||
* or {@link DiversifyingChildrenFloatKnnVectorQuery}. | ||
*/ | ||
public class ESDiversifyingChildrenKnnVectorQuery extends Query { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this query is needed? Is it to avoid scoring the children returned by the query phase?
I don't think we should use any of the child result of the query phase to compute the inner hits. Ideally we can just have a nested query builder that uses a brute force knn query to score all children. We don't need to bother with the DiversifyingChildren... queries in the fetch phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi, for knn query, there is no way I could find to interject information without leaking the abstraction up to the nested query that it somewhere contains kNN results. We don't rewrite to a KnnScoreAndDoc
query when using knn query. We just use what Lucene gives us, so we need to wrap the query lower down which is what I did here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch, thanks I forgot this part. I am conflicted because this solution involves running the approximate query twice. The first time during the query phase and then another time during the inner hits fetch sub-phase. This is more an issue at the Lucene query level since we rely on the rewrite to do the heavy lifting and we have no way to determine the context where the rewrite is executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am conflicted because this solution involves running the approximate query twice.
Ah, so nested
will run the query again, completely from scratch when gathering the inner_hits? That is indeed frustrating.
I will think a bit more on this. Maybe there is something we can do in Elasticsearch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looking at the top-level call extract is done here:
InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders); |
This has access to a SearchExecutionContext
and such. Which indicates to me we could have an additional thing that NestedQueryBuilder
could call like rewriteForInnerHits
. Multi-queries would have to push this down (but I think it would be easy enough), but then KnnVectorQueryBuilder
could satisfy this interface and return a separate query builder that is a BruteForceQueryBuilder like you have.
What do you think?
EDIT: We may not need anything more than a "rewriteToInnerHitsBuilder" that defaults to return this
. I am not sure including a SearchExecutionContext
is even necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I reverted something I shouldn't have. I will address and push again. I think this will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, @jimczi I have it working. Need to add more tests but the gist of it is this:
- For top-level knn, we end up rewriting within the DFS phase. But, we still need to ensure we only match the global top-k, which is determined by the global top-k nearest single nested vector. So, I added
NestedKnnScoreDocQueryBuilder
, which scores all children of the given parent vector. - For query-level knn, we need some way for kNN to do score all matching documents. I added a new interface called
rewriteForInnerHits()
, which returns a newExactKnnQueryBuilder
so that the nested context can score all the children documents correctly.
I want to add some test coverage for NestedKnnScoreDocQueryBuilder
& ExactKnnQueryBuilder
. But what do you think? This is a mix between our two methods. This way query-level kNN doesn't do the entire graph search again with inner-hits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benwtrent I like the idea. Another option would be to add a new QueryRewriteContext
so that we don't need to implement the new functions on all multi-queries. The advantage would be that it would also work on custom queries added via plugin without requiring any change?
Just for completeness we could also consider having a knn queries that executes at the scorerSupplier level (we know the leadCost there so we could pick between exact and approximate) but that would be a bigger change and would also change the expectation (each segment would return topN results).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to add a new QueryRewriteContext so that we don't need to implement the new functions on all multi-queries.
I thought of that option as well. I will see if it will work. I am not sure at what point the fetch will rewrite these queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi added a "InnerHitsRewriteContext" and rewrite with it in the SearchService and call extractInnerHits on the rewritten query.
…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search
…ssages-nested-knn-search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the iterations @benwtrent !
new ByteKnnVectorFieldSource(name()), | ||
new ConstKnnByteVectorValueSource(bytes) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:Can we use a FieldExistsQuery
to filter out documents without a value? :
yield new BooleanQuery.Builder()
.add(new FieldExistsQuery(name()), BooleanClause.Occur.FILTER)
.add(new ByteVectorSimilarityFunction(
vectorSimilarityFunction,
new ByteKnnVectorFieldSource(name()),
new ConstKnnByteVectorValueSource(bytes)), BooleanClause.Occur.SHOULD)
.build();
That would remove the need to add a new ExactKnnQuery
.
server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
Show resolved
Hide resolved
…ssages-nested-knn-search
server/src/main/java/org/elasticsearch/search/vectors/KnnScoreDocQueryBuilder.java
Show resolved
Hide resolved
if (query != null) { | ||
InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders); | ||
QueryBuilder rewrittenForInnerHits = Rewriteable.rewrite(query, innerHitsRewriteContext, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused how much rewrite is happening here. I think we should do instead:
query = Rewriteable.rewrite(query, innerHitsRewriteContext, true);
Otherwise we again need to rewrite the query on line 1246.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. These are separate rewrites. One is only for inner-hits extraction, the other is for querying. These could match different documents. The queries that are rewritten to between them could be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?
Inner hit query execution only happens during fetch. However no matter the phase, InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders);
is called. We should have a larger rewrite where InnerHitContextBuilder
only occurs during the fetch phase.
This is why I don't use the rewrittenForInnerHits
query anywhere but within extractInnerHits
.
), | ||
BooleanClause.Occur.SHOULD | ||
) | ||
.setMinimumNumberShouldMatch(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't dosetMinimumNumberShouldMatch(1)
for the equivalent query for FLOAT
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, copy paste error. It should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent @jimczi Thanks for your work!
I was following how this solution is evolving, and I very much like how short and elegant is the final version!
Yes, indeed that's a cool feature to use different Also for kNN query we can have additional nested filters that the top level kNN search doesn't support. |
Well, in its current form, it does not. We can re-introduce it later. |
…ssages-nested-knn-search
Many thanks, @benwtrent! |
This looks great! 😍 Thank you for working on this feature @benwtrent! |
Thank you @benwtrent! As far as other score_mode options, I would be very interested in being able to do "sum" to score a document higher based on the summation of multiple close passages. |
Yeah, that is indeed a concern. I could see us providing a way to gather the kNN vectors by nearest passage, and then allowing the inner hits to be re-ranked by some other score_mode, but doing the initial search with a different mode would get tricky. As for right now, if you are using dot-product or maximum-inner-product, why not index a "sum" vector that is the summation of every passage vector? I would have to do some napkin math to make sure this works... This python code seems to indicate that this would work in theory:
|
I think that the idea of doing the max (nearest passage) and then reranking based on sum would get us most of the way there. I was considering doing that on my own post-query when returning n inner-hits, but doing it across the k nearest candidates within ES would be even better. As far as summing the vectors, I think that would be fine for the case of fixed length documents (i.e. same number of passage vectors). However, if the size of the documents (and passages per document) varies wildly then summation will tend to overly favor longer documents. This would be less of an issue in the case of keyword search with score_mode:sum where most of the document will have no hits. In the vector case all of the passages will contribute something to the overall score. This may be asking too much, but being able to sum over the contribution of the top n inner hits (where n >= min passages per doc) would be the optimal outcome. I realize that this type of custom ranking may need to be performed post query or possibly in the max and then rerank mentioned above. |
…RewriteContext The DFS and highlight phases require rewriting the Lucene query outside of the query phase. However, if the query contains a k-NN query, this triggers a nearest neighbor search on the entire shard, which is unnecessary in these phases since computing top-N results is not required. This change builds upon elastic#104006, applying the same transformation used for nested inner hits. As a result, DFS and highlight phases avoid wasting time and resources on costly nearest neighbor searches. Note: The explain and matched query phases are also affected but still require the nearest neighbor search for accurate results, so they remain unchanged for now.
This commit adds the ability to gather more than one inner_hit when searching nested kNN.
Global kNN example
Results in
kNN Query example
With a kNN query, this opens an interesting door, which allows for multiple inner_hit scoring schemes.
Nearest by max passage only
closes: #102950