Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for more than one inner_hit when searching nested vectors #104006

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jan 5, 2024

This commit adds the ability to gather more than one inner_hit when searching nested kNN.

Global kNN example

POST test/_search
{
    "_source": false,
    "fields": [
        "name"
    ],
    "knn": {
        "field": "nested.vector",
        "query_vector": [
            -0.5,
            90,
            -10,
            14.8,
            -156
        ],
        "k": 3,
        "num_candidates": 3,
        "inner_hits": {
            "size": 2,
            "fields": [
                "nested.paragraph_id"
            ],
            "_source": false
        }
    }
}

Results in

{
    "took": 66,
    "timed_out": false,
    "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.009090909,
        "hits": [
            {
                "_index": "test",
                "_id": "2",
                "_score": 0.009090909,
                "fields": {
                    "name": [
                        "moose.jpg"
                    ]
                },
                "inner_hits": {
                    "nested": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 0.009090909,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_id": "2",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 0
                                    },
                                    "_score": 0.009090909,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "0"
                                                ]
                                            }
                                        ]
                                    }
                                },
                                {
                                    "_index": "test",
                                    "_id": "2",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 1
                                    },
                                    "_score": 0.004968944,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "2"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "test",
                "_id": "3",
                "_score": 0.0021519717,
                "fields": {
                    "name": [
                        "rabbit.jpg"
                    ]
                },
                "inner_hits": {
                    "nested": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 0.0021519717,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_id": "3",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 0
                                    },
                                    "_score": 0.0021519717,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "0"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

kNN Query example

With a kNN query, this opens an interesting door, which allows for multiple inner_hit scoring schemes.

Nearest by max passage only

POST test/_search
{
    "size": 3,
    "query": {
        "nested": {
            "path": "nested",
            "score_mode": "max",
            "query": {
                "knn": {
                    "field": "nested.vector",
                    "query_vector": [
                        -0.5,
                        90,
                        -10,
                        14.8,
                        -156
                    ],
                    "num_candidates": 5
                }
            },
            "inner_hits": {
                "size": 2,
                "_source": false,
                "fields": [
                    "nested.paragraph_id"
                ]
            }
        }
    }
}

closes: #102950

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

Copy link
Member Author

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other major way of handling this change would be figure out if NestedInnerHitSubContext could work. I am not 100% sure. I am still trying to figure out if that is even possible, but I spent a day+ twiddling with that and gave up.

@jimczi let me know what you think. I will leave this as it is for now and try to see if I can get the subcontext stuff working.

@@ -134,3 +137,93 @@ setup:
- match: {hits.hits.0._id: "3"}
- match: {hits.hits.0.fields.name.0: "rabbit.jpg"}
- match: {hits.hits.0.inner_hits.nested.hits.hits.0.fields.nested.0.paragraph_id.0: "0"}
---
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test shows the current bug around global top-k gathering in DFS. I still need to figure out how to get the global k docs diversified over the parents given the children documents.

@@ -161,6 +161,7 @@ public static List<DfsKnnResults> mergeKnnResults(SearchRequest request, List<Df

List<DfsKnnResults> mergedResults = new ArrayList<>(request.source().knnSearch().size());
for (int i = 0; i < request.source().knnSearch().size(); i++) {
//TODO how do we merge via diversified top-k parents?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple ideas here. My main thought is that kNNResults will not only include the child document ScoreDocs, but can also include an int[] of the matching parent documents. I would need some special logic here, but when combining shard level results, we can ensure that we combine the top k parent documents given the individual passage scores

* @param innerHitBuilder The inner hit builder to set as current inner hit builder
* @return The previous inner hit builder
*/
public InnerHitBuilder nextLevelInnerHits(InnerHitBuilder innerHitBuilder) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a way to tell the top-level & knn Query how many child documents we are wanting to gather. Since kNN is eager and returns only a wrapper around ScoreDoc values, we need to know how many total documents we care about.

this.childFilterIterator = childFilterIterator;
this.previouslyFoundChildren = previouslyFoundChildren;
this.parentBitSet = parentBitSet;
this.queue = new HitQueue(numChildrenPerParent, false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may ask "Ben, why not just score all and return all vectors because you gotta iterate all the children anyways to find the next nearest". Well, there are degenerate cases we want to avoid. What if one vector has 10k children?

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search
@jimczi
Copy link
Contributor

jimczi commented Jan 8, 2024

The other major way of handling this change would be figure out if NestedInnerHitSubContext could work. I am not 100% sure. I am still trying to figure out if that is even possible, but I spent a day+ twiddling with that and gave up.

The nested inner hit sub context should be the way to go imo.
I tried to reuse the contexts as much as possible and I ended up with:
https://github.com/jimczi/elasticsearch/pull/new/knn_nested_inner_hits
Is it similar to what you tried?
It only works for the knn section but it should be reproducible for the "knn as query path".

@benwtrent
Copy link
Member Author

The nested inner hit sub context should be the way to go imo.

@jimczi it seems to me the only way to do this (with inner sub context or without), is for there to be a new query that wraps our regular queries and scores the individual child docs given some previously calculated parent docs.

I have something like this working locally, I will see if it is cleaner to put that through the subcontext building process or not.

@jimczi
Copy link
Contributor

jimczi commented Jan 8, 2024

is for there to be a new query that wraps our regular queries and scores the individual child docs given some previously calculated parent docs.

See my PR, we only need a custom nested query builder (to override extractInnerHits) and an exact knn query builder.
We don't need to take the scored doc of the first pass (query phase) into account, I think it's ok to recompute every children and to build a top N per hits.

@benwtrent benwtrent marked this pull request as ready for review January 9, 2024 14:56
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jan 9, 2024
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one general comment. I don't think we need to mix the query phase results with the fetch phase/inner hits. It's ok to recompute the similarity for each children in the fetch phase, that's how inner hits works.

* This query is used to score vector child documents given the results of a {@link DiversifyingChildrenByteKnnVectorQuery}
* or {@link DiversifyingChildrenFloatKnnVectorQuery}.
*/
public class ESDiversifyingChildrenKnnVectorQuery extends Query {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this query is needed? Is it to avoid scoring the children returned by the query phase?
I don't think we should use any of the child result of the query phase to compute the inner hits. Ideally we can just have a nested query builder that uses a brute force knn query to score all children. We don't need to bother with the DiversifyingChildren... queries in the fetch phase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi, for knn query, there is no way I could find to interject information without leaking the abstraction up to the nested query that it somewhere contains kNN results. We don't rewrite to a KnnScoreAndDoc query when using knn query. We just use what Lucene gives us, so we need to wrap the query lower down which is what I did here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, thanks I forgot this part. I am conflicted because this solution involves running the approximate query twice. The first time during the query phase and then another time during the inner hits fetch sub-phase. This is more an issue at the Lucene query level since we rely on the rewrite to do the heavy lifting and we have no way to determine the context where the rewrite is executed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am conflicted because this solution involves running the approximate query twice.

Ah, so nested will run the query again, completely from scratch when gathering the inner_hits? That is indeed frustrating.

I will think a bit more on this. Maybe there is something we can do in Elasticsearch.

Copy link
Member Author

@benwtrent benwtrent Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looking at the top-level call extract is done here:

InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders);

This has access to a SearchExecutionContext and such. Which indicates to me we could have an additional thing that NestedQueryBuilder could call like rewriteForInnerHits. Multi-queries would have to push this down (but I think it would be easy enough), but then KnnVectorQueryBuilder could satisfy this interface and return a separate query builder that is a BruteForceQueryBuilder like you have.

What do you think?

EDIT: We may not need anything more than a "rewriteToInnerHitsBuilder" that defaults to return this. I am not sure including a SearchExecutionContext is even necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I reverted something I shouldn't have. I will address and push again. I think this will work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, @jimczi I have it working. Need to add more tests but the gist of it is this:

  • For top-level knn, we end up rewriting within the DFS phase. But, we still need to ensure we only match the global top-k, which is determined by the global top-k nearest single nested vector. So, I added NestedKnnScoreDocQueryBuilder, which scores all children of the given parent vector.
  • For query-level knn, we need some way for kNN to do score all matching documents. I added a new interface called rewriteForInnerHits(), which returns a new ExactKnnQueryBuilder so that the nested context can score all the children documents correctly.

I want to add some test coverage for NestedKnnScoreDocQueryBuilder & ExactKnnQueryBuilder. But what do you think? This is a mix between our two methods. This way query-level kNN doesn't do the entire graph search again with inner-hits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benwtrent I like the idea. Another option would be to add a new QueryRewriteContext so that we don't need to implement the new functions on all multi-queries. The advantage would be that it would also work on custom queries added via plugin without requiring any change?
Just for completeness we could also consider having a knn queries that executes at the scorerSupplier level (we know the leadCost there so we could pick between exact and approximate) but that would be a bigger change and would also change the expectation (each segment would return topN results).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to add a new QueryRewriteContext so that we don't need to implement the new functions on all multi-queries.

I thought of that option as well. I will see if it will work. I am not sure at what point the fetch will rewrite these queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi added a "InnerHitsRewriteContext" and rewrite with it in the SearchService and call extractInnerHits on the rewritten query.

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the iterations @benwtrent !

new ByteKnnVectorFieldSource(name()),
new ConstKnnByteVectorValueSource(bytes)
)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:Can we use a FieldExistsQuery to filter out documents without a value? :

yield new BooleanQuery.Builder()
                    .add(new FieldExistsQuery(name()), BooleanClause.Occur.FILTER)
                    .add(new ByteVectorSimilarityFunction(
                            vectorSimilarityFunction,
                            new ByteKnnVectorFieldSource(name()),
                            new ConstKnnByteVectorValueSource(bytes)), BooleanClause.Occur.SHOULD)
                    .build();

That would remove the need to add a new ExactKnnQuery.

if (query != null) {
InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders);
QueryBuilder rewrittenForInnerHits = Rewriteable.rewrite(query, innerHitsRewriteContext, true);
Copy link
Contributor

@mayya-sharipova mayya-sharipova Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused how much rewrite is happening here. I think we should do instead:
query = Rewriteable.rewrite(query, innerHitsRewriteContext, true);

Otherwise we again need to rewrite the query on line 1246.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. These are separate rewrites. One is only for inner-hits extraction, the other is for querying. These could match different documents. The queries that are rewritten to between them could be different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?

Inner hit query execution only happens during fetch. However no matter the phase, InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders); is called. We should have a larger rewrite where InnerHitContextBuilder only occurs during the fetch phase.

This is why I don't use the rewrittenForInnerHits query anywhere but within extractInnerHits.

),
BooleanClause.Occur.SHOULD
)
.setMinimumNumberShouldMatch(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't dosetMinimumNumberShouldMatch(1) for the equivalent query for FLOAT type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, copy paste error. It should be removed

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent @jimczi Thanks for your work!
I was following how this solution is evolving, and I very much like how short and elegant is the final version!

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jan 12, 2024

With a kNN query, this opens an interesting door, which allows for multiple inner_hit scoring schemes.

Yes, indeed that's a cool feature to use different score_mode.

Also for kNN query we can have additional nested filters that the top level kNN search doesn't support.

@benwtrent
Copy link
Member Author

Yes, indeed that's a cool feature to use different score_mode.

Well, in its current form, it does not. We can re-introduce it later.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 17, 2024
@elasticsearchmachine elasticsearchmachine merged commit e4feaff into elastic:main Jan 17, 2024
15 checks passed
@benwtrent benwtrent deleted the feature/add-more-passages-nested-knn-search branch January 17, 2024 16:33
@karmi
Copy link
Contributor

karmi commented Jan 17, 2024

Many thanks, @benwtrent!

@dkarlovi
Copy link

This looks great! 😍 Thank you for working on this feature @benwtrent!

@ephraim-s
Copy link

Thank you @benwtrent!

As far as other score_mode options, I would be very interested in being able to do "sum" to score a document higher based on the summation of multiple close passages.
I just wonder how you would do this efficiently in HNSW without having to traverse the full graph...

@benwtrent
Copy link
Member Author

@ephraim-s

I just wonder how you would do this efficiently in HNSW without having to traverse the full graph...

Yeah, that is indeed a concern.

I could see us providing a way to gather the kNN vectors by nearest passage, and then allowing the inner hits to be re-ranked by some other score_mode, but doing the initial search with a different mode would get tricky.

As for right now, if you are using dot-product or maximum-inner-product, why not index a "sum" vector that is the summation of every passage vector? I would have to do some napkin math to make sure this works...

This python code seems to indicate that this would work in theory:

import numpy as np

# some test vectors
test_vectors = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
query_vector = np.array([10, 11, 12])

# dot product, or max_inner_product in Elasticsearch
dot_products = np.dot(test_vectors, query_vector)
sum_of_dot_products = np.sum(dot_products)

print("Sum of dot products:", sum_of_dot_products)
# compare with the dot product of the sum of the vectors
sum_of_vectors = np.sum(test_vectors, axis=0)
dot_product_sum = np.dot(sum_of_vectors, query_vector)
print("Dot product of the sum of the vectors:", dot_product_sum)

@ephraim-s
Copy link

I think that the idea of doing the max (nearest passage) and then reranking based on sum would get us most of the way there. I was considering doing that on my own post-query when returning n inner-hits, but doing it across the k nearest candidates within ES would be even better.

As far as summing the vectors, I think that would be fine for the case of fixed length documents (i.e. same number of passage vectors). However, if the size of the documents (and passages per document) varies wildly then summation will tend to overly favor longer documents. This would be less of an issue in the case of keyword search with score_mode:sum where most of the document will have no hits. In the vector case all of the passages will contribute something to the overall score.

This may be asking too much, but being able to sum over the contribution of the top n inner hits (where n >= min passages per doc) would be the optimal outcome. I realize that this type of custom ranking may need to be performed post query or possibly in the max and then rerank mentioned above.

jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jan 31, 2025
…RewriteContext

The DFS and highlight phases require rewriting the Lucene query outside of the query phase.
However, if the query contains a k-NN query, this triggers a nearest neighbor search on the entire shard,
which is unnecessary in these phases since computing top-N results is not required.

This change builds upon elastic#104006, applying the same transformation used for nested inner hits.
As a result, DFS and highlight phases avoid wasting time and resources on costly nearest neighbor searches.

Note: The explain and matched query phases are also affected but still require the nearest neighbor search
for accurate results, so they remain unchanged for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Vectors Vector search Team:Search Meta label for search team v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support inner_hits.size > 1 for nested dense vector fields
7 participants