Add support for more than one inner_hit when searching nested vectors #104006

benwtrent · 2024-01-05T19:27:25Z

This commit adds the ability to gather more than one inner_hit when searching nested kNN.

Global kNN example

POST test/_search
{
    "_source": false,
    "fields": [
        "name"
    ],
    "knn": {
        "field": "nested.vector",
        "query_vector": [
            -0.5,
            90,
            -10,
            14.8,
            -156
        ],
        "k": 3,
        "num_candidates": 3,
        "inner_hits": {
            "size": 2,
            "fields": [
                "nested.paragraph_id"
            ],
            "_source": false
        }
    }
}

Results in

{
    "took": 66,
    "timed_out": false,
    "_shards": {
        "total": 2,
        "successful": 2,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.009090909,
        "hits": [
            {
                "_index": "test",
                "_id": "2",
                "_score": 0.009090909,
                "fields": {
                    "name": [
                        "moose.jpg"
                    ]
                },
                "inner_hits": {
                    "nested": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 0.009090909,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_id": "2",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 0
                                    },
                                    "_score": 0.009090909,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "0"
                                                ]
                                            }
                                        ]
                                    }
                                },
                                {
                                    "_index": "test",
                                    "_id": "2",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 1
                                    },
                                    "_score": 0.004968944,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "2"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "test",
                "_id": "3",
                "_score": 0.0021519717,
                "fields": {
                    "name": [
                        "rabbit.jpg"
                    ]
                },
                "inner_hits": {
                    "nested": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 0.0021519717,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_id": "3",
                                    "_nested": {
                                        "field": "nested",
                                        "offset": 0
                                    },
                                    "_score": 0.0021519717,
                                    "fields": {
                                        "nested": [
                                            {
                                                "paragraph_id": [
                                                    "0"
                                                ]
                                            }
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

kNN Query example

With a kNN query, this opens an interesting door, which allows for multiple inner_hit scoring schemes.

Nearest by max passage only

POST test/_search
{
    "size": 3,
    "query": {
        "nested": {
            "path": "nested",
            "score_mode": "max",
            "query": {
                "knn": {
                    "field": "nested.vector",
                    "query_vector": [
                        -0.5,
                        90,
                        -10,
                        14.8,
                        -156
                    ],
                    "num_candidates": 5
                }
            },
            "inner_hits": {
                "size": 2,
                "_source": false,
                "fields": [
                    "nested.paragraph_id"
                ]
            }
        }
    }
}

closes: #102950

…ssages-nested-knn-search

elasticsearchmachine · 2024-01-05T19:27:49Z

Hi @benwtrent, I've created a changelog YAML for you.

benwtrent

The other major way of handling this change would be figure out if NestedInnerHitSubContext could work. I am not 100% sure. I am still trying to figure out if that is even possible, but I spent a day+ twiddling with that and gave up.

@jimczi let me know what you think. I will leave this as it is for now and try to see if I can get the subcontext stuff working.

benwtrent · 2024-01-05T19:28:35Z

...-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/100_knn_nested_search.yml

@@ -134,3 +137,93 @@ setup:
  - match: {hits.hits.0._id: "3"}
  - match: {hits.hits.0.fields.name.0: "rabbit.jpg"}
  - match: {hits.hits.0.inner_hits.nested.hits.hits.0.fields.nested.0.paragraph_id.0: "0"}
+---


This test shows the current bug around global top-k gathering in DFS. I still need to figure out how to get the global k docs diversified over the parents given the children documents.

benwtrent · 2024-01-05T19:30:01Z

server/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java

@@ -161,6 +161,7 @@ public static List<DfsKnnResults> mergeKnnResults(SearchRequest request, List<Df

        List<DfsKnnResults> mergedResults = new ArrayList<>(request.source().knnSearch().size());
        for (int i = 0; i < request.source().knnSearch().size(); i++) {
+            //TODO how do we merge via diversified top-k parents?


I have a couple ideas here. My main thought is that kNNResults will not only include the child document ScoreDocs, but can also include an int[] of the matching parent documents. I would need some special logic here, but when combining shard level results, we can ensure that we combine the top k parent documents given the individual passage scores

benwtrent · 2024-01-05T19:31:05Z

server/src/main/java/org/elasticsearch/index/query/support/NestedScope.java

+     * @param innerHitBuilder The inner hit builder to set as current inner hit builder
+     * @return The previous inner hit builder
+     */
+    public InnerHitBuilder nextLevelInnerHits(InnerHitBuilder innerHitBuilder) {


I needed a way to tell the top-level & knn Query how many child documents we are wanting to gather. Since kNN is eager and returns only a wrapper around ScoreDoc values, we need to know how many total documents we care about.

server/src/main/java/org/elasticsearch/search/fetch/subphase/InnerHitsContext.java

server/src/main/java/org/elasticsearch/search/vectors/ChildBlockJoinVectorScorerProvider.java

benwtrent · 2024-01-05T19:34:58Z

server/src/main/java/org/elasticsearch/search/vectors/ChildBlockJoinVectorScorerProvider.java

+            this.childFilterIterator = childFilterIterator;
+            this.previouslyFoundChildren = previouslyFoundChildren;
+            this.parentBitSet = parentBitSet;
+            this.queue = new HitQueue(numChildrenPerParent, false);


You may ask "Ben, why not just score all and return all vectors because you gotta iterate all the children anyways to find the next nearest". Well, there are degenerate cases we want to avoid. What if one vector has 10k children?

...src/main/java/org/elasticsearch/search/vectors/ESDiversifyingChildrenByteKnnVectorQuery.java

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search

jimczi · 2024-01-08T14:27:12Z

The other major way of handling this change would be figure out if NestedInnerHitSubContext could work. I am not 100% sure. I am still trying to figure out if that is even possible, but I spent a day+ twiddling with that and gave up.

The nested inner hit sub context should be the way to go imo.
I tried to reuse the contexts as much as possible and I ended up with:
https://github.com/jimczi/elasticsearch/pull/new/knn_nested_inner_hits
Is it similar to what you tried?
It only works for the knn section but it should be reproducible for the "knn as query path".

benwtrent · 2024-01-08T17:00:59Z

The nested inner hit sub context should be the way to go imo.

@jimczi it seems to me the only way to do this (with inner sub context or without), is for there to be a new query that wraps our regular queries and scores the individual child docs given some previously calculated parent docs.

I have something like this working locally, I will see if it is cleaner to put that through the subcontext building process or not.

jimczi · 2024-01-08T18:20:51Z

is for there to be a new query that wraps our regular queries and scores the individual child docs given some previously calculated parent docs.

See my PR, we only need a custom nested query builder (to override extractInnerHits) and an exact knn query builder.
We don't need to take the scored doc of the first pass (query phase) into account, I think it's ok to recompute every children and to build a top N per hits.

elasticsearchmachine · 2024-01-09T14:57:10Z

Pinging @elastic/es-search (Team:Search)

jimczi

I left one general comment. I don't think we need to mix the query phase results with the fetch phase/inner hits. It's ok to recompute the similarity for each children in the fetch phase, that's how inner hits works.

jimczi · 2024-01-10T11:50:52Z

server/src/main/java/org/elasticsearch/search/vectors/ESDiversifyingChildrenKnnVectorQuery.java

+ * This query is used to score vector child documents given the results of a {@link DiversifyingChildrenByteKnnVectorQuery}
+ * or {@link DiversifyingChildrenFloatKnnVectorQuery}.
+ */
+public class ESDiversifyingChildrenKnnVectorQuery extends Query {


I don't understand why this query is needed? Is it to avoid scoring the children returned by the query phase?
I don't think we should use any of the child result of the query phase to compute the inner hits. Ideally we can just have a nested query builder that uses a brute force knn query to score all children. We don't need to bother with the DiversifyingChildren... queries in the fetch phase.

@jimczi, for knn query, there is no way I could find to interject information without leaking the abstraction up to the nested query that it somewhere contains kNN results. We don't rewrite to a KnnScoreAndDoc query when using knn query. We just use what Lucene gives us, so we need to wrap the query lower down which is what I did here.

Ouch, thanks I forgot this part. I am conflicted because this solution involves running the approximate query twice. The first time during the query phase and then another time during the inner hits fetch sub-phase. This is more an issue at the Lucene query level since we rely on the rewrite to do the heavy lifting and we have no way to determine the context where the rewrite is executed.

I am conflicted because this solution involves running the approximate query twice.

Ah, so nested will run the query again, completely from scratch when gathering the inner_hits? That is indeed frustrating.

I will think a bit more on this. Maybe there is something we can do in Elasticsearch.

OK, looking at the top-level call extract is done here:

elasticsearch/server/src/main/java/org/elasticsearch/search/SearchService.java

Line 1238 in f33122a

InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders);

This has access to a SearchExecutionContext and such. Which indicates to me we could have an additional thing that NestedQueryBuilder could call like rewriteForInnerHits. Multi-queries would have to push this down (but I think it would be easy enough), but then KnnVectorQueryBuilder could satisfy this interface and return a separate query builder that is a BruteForceQueryBuilder like you have.

What do you think?

EDIT: We may not need anything more than a "rewriteToInnerHitsBuilder" that defaults to return this. I am not sure including a SearchExecutionContext is even necessary.

Ah, I reverted something I shouldn't have. I will address and push again. I think this will work.

OK, @jimczi I have it working. Need to add more tests but the gist of it is this:

For top-level knn, we end up rewriting within the DFS phase. But, we still need to ensure we only match the global top-k, which is determined by the global top-k nearest single nested vector. So, I added NestedKnnScoreDocQueryBuilder, which scores all children of the given parent vector.

For query-level knn, we need some way for kNN to do score all matching documents. I added a new interface called rewriteForInnerHits(), which returns a new ExactKnnQueryBuilder so that the nested context can score all the children documents correctly.

I want to add some test coverage for NestedKnnScoreDocQueryBuilder & ExactKnnQueryBuilder. But what do you think? This is a mix between our two methods. This way query-level kNN doesn't do the entire graph search again with inner-hits.

Thanks @benwtrent I like the idea. Another option would be to add a new QueryRewriteContext so that we don't need to implement the new functions on all multi-queries. The advantage would be that it would also work on custom queries added via plugin without requiring any change?
Just for completeness we could also consider having a knn queries that executes at the scorerSupplier level (we know the leadCost there so we could pick between exact and approximate) but that would be a bigger change and would also change the expectation (each segment would return topN results).

Another option would be to add a new QueryRewriteContext so that we don't need to implement the new functions on all multi-queries.

I thought of that option as well. I will see if it will work. I am not sure at what point the fetch will rewrite these queries.

@jimczi added a "InnerHitsRewriteContext" and rewrite with it in the SearchService and call extractInnerHits on the rewritten query.

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search

…ssages-nested-knn-search

jimczi

LGTM, thanks for all the iterations @benwtrent !

jimczi · 2024-01-12T12:06:12Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java

+                            new ByteKnnVectorFieldSource(name()),
+                            new ConstKnnByteVectorValueSource(bytes)
+                        )
+                    );


nit:Can we use a FieldExistsQuery to filter out documents without a value? :

yield new BooleanQuery.Builder() .add(new FieldExistsQuery(name()), BooleanClause.Occur.FILTER) .add(new ByteVectorSimilarityFunction( vectorSimilarityFunction, new ByteKnnVectorFieldSource(name()), new ConstKnnByteVectorValueSource(bytes)), BooleanClause.Occur.SHOULD) .build();

That would remove the need to add a new ExactKnnQuery.

server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java

…ssages-nested-knn-search

server/src/main/java/org/elasticsearch/search/vectors/KnnScoreDocQueryBuilder.java

mayya-sharipova · 2024-01-12T15:24:07Z

server/src/main/java/org/elasticsearch/search/SearchService.java

        if (query != null) {
-            InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders);
+            QueryBuilder rewrittenForInnerHits = Rewriteable.rewrite(query, innerHitsRewriteContext, true);


I am confused how much rewrite is happening here. I think we should do instead:
query = Rewriteable.rewrite(query, innerHitsRewriteContext, true);

Otherwise we again need to rewrite the query on line 1246.

I don't think so. These are separate rewrites. One is only for inner-hits extraction, the other is for querying. These could match different documents. The queries that are rewritten to between them could be different.

Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?

Where does rewrite for inner hits happen? It seems to me that it happens both in QueryPhase and FetchPhase. Does it mean that we run exact knn query two times, basically computing scores for all inner hits two times?

Inner hit query execution only happens during fetch. However no matter the phase, InnerHitContextBuilder.extractInnerHits(query, innerHitBuilders); is called. We should have a larger rewrite where InnerHitContextBuilder only occurs during the fetch phase.

This is why I don't use the rewrittenForInnerHits query anywhere but within extractInnerHits.

mayya-sharipova · 2024-01-12T19:54:37Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java

+                            ),
+                            BooleanClause.Occur.SHOULD
+                        )
+                        .setMinimumNumberShouldMatch(1)


Is there a reason we don't dosetMinimumNumberShouldMatch(1) for the equivalent query for FLOAT type?

No, copy paste error. It should be removed

mayya-sharipova

@benwtrent @jimczi Thanks for your work!
I was following how this solution is evolving, and I very much like how short and elegant is the final version!

mayya-sharipova · 2024-01-12T20:41:25Z

With a kNN query, this opens an interesting door, which allows for multiple inner_hit scoring schemes.

Yes, indeed that's a cool feature to use different score_mode.

Also for kNN query we can have additional nested filters that the top level kNN search doesn't support.

benwtrent · 2024-01-12T21:32:03Z

Yes, indeed that's a cool feature to use different score_mode.

Well, in its current form, it does not. We can re-introduce it later.

…ssages-nested-knn-search

karmi · 2024-01-17T16:38:26Z

Many thanks, @benwtrent!

dkarlovi · 2024-01-18T09:38:16Z

This looks great! 😍 Thank you for working on this feature @benwtrent!

ephraim-s · 2024-05-07T16:58:12Z

Thank you @benwtrent!

As far as other score_mode options, I would be very interested in being able to do "sum" to score a document higher based on the summation of multiple close passages.
I just wonder how you would do this efficiently in HNSW without having to traverse the full graph...

benwtrent · 2024-05-07T20:22:26Z

@ephraim-s

I just wonder how you would do this efficiently in HNSW without having to traverse the full graph...

Yeah, that is indeed a concern.

I could see us providing a way to gather the kNN vectors by nearest passage, and then allowing the inner hits to be re-ranked by some other score_mode, but doing the initial search with a different mode would get tricky.

As for right now, if you are using dot-product or maximum-inner-product, why not index a "sum" vector that is the summation of every passage vector? I would have to do some napkin math to make sure this works...

This python code seems to indicate that this would work in theory:

import numpy as np

# some test vectors
test_vectors = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
query_vector = np.array([10, 11, 12])

# dot product, or max_inner_product in Elasticsearch
dot_products = np.dot(test_vectors, query_vector)
sum_of_dot_products = np.sum(dot_products)

print("Sum of dot products:", sum_of_dot_products)
# compare with the dot product of the sum of the vectors
sum_of_vectors = np.sum(test_vectors, axis=0)
dot_product_sum = np.dot(sum_of_vectors, query_vector)
print("Dot product of the sum of the vectors:", dot_product_sum)

ephraim-s · 2024-05-08T13:37:06Z

I think that the idea of doing the max (nearest passage) and then reranking based on sum would get us most of the way there. I was considering doing that on my own post-query when returning n inner-hits, but doing it across the k nearest candidates within ES would be even better.

As far as summing the vectors, I think that would be fine for the case of fixed length documents (i.e. same number of passage vectors). However, if the size of the documents (and passages per document) varies wildly then summation will tend to overly favor longer documents. This would be less of an issue in the case of keyword search with score_mode:sum where most of the document will have no hits. In the vector case all of the passages will contribute something to the overall score.

This may be asking too much, but being able to sum over the contribution of the top n inner hits (where n >= min passages per doc) would be the optimal outcome. I realize that this type of custom ranking may need to be performed post query or possibly in the max and then rerank mentioned above.

…RewriteContext The DFS and highlight phases require rewriting the Lucene query outside of the query phase. However, if the query contains a k-NN query, this triggers a nearest neighbor search on the entire shard, which is unnecessary in these phases since computing top-N results is not required. This change builds upon elastic#104006, applying the same transformation used for nested inner hits. As a result, DFS and highlight phases avoid wasting time and resources on costly nearest neighbor searches. Note: The explain and matched query phases are also affected but still require the nearest neighbor search for accurate results, so they remain unchanged for now.

benwtrent added 8 commits January 3, 2024 16:40

Allow more than one nearest passage to be returned via nested kNN

6fbb7cf

iter

a11921e

Merge remote-tracking branch 'upstream/main' into feature/add-more-pa…

5435c4a

…ssages-nested-knn-search

iter

cdbbe2d

Merge remote-tracking branch 'upstream/main' into feature/add-more-pa…

3db5822

…ssages-nested-knn-search

iter

149acf1

iter

273a69d

iter

419d5aa

benwtrent added >enhancement :Search Relevance/Vectors Vector search v8.13.0 labels Jan 5, 2024

Update docs/changelog/104006.yaml

2f9ed46

benwtrent commented Jan 5, 2024

View reviewed changes

benwtrent added 2 commits January 5, 2024 15:23

fixing some tests

800371b

Merge branch 'feature/add-more-passages-nested-knn-search' of github.…

628a688

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search

benwtrent and others added 4 commits January 8, 2024 13:51

fix top level kNN

7bc1881

fixing tests, refactoring

bf34dc3

fixing tests and cleaning up

26942c6

Merge branch 'main' into feature/add-more-passages-nested-knn-search

aa0a14f

benwtrent marked this pull request as ready for review January 9, 2024 14:56

elasticsearchmachine added the Team:Search Meta label for search team label Jan 9, 2024

benwtrent requested a review from mayya-sharipova January 9, 2024 19:15

jimczi reviewed Jan 10, 2024

View reviewed changes

benwtrent added 2 commits January 10, 2024 15:50

attempt 2

afd4069

Merge branch 'feature/add-more-passages-nested-knn-search' of github.…

405ef16

…com:benwtrent/elasticsearch into feature/add-more-passages-nested-knn-search

benwtrent added 3 commits January 11, 2024 13:23

Merge remote-tracking branch 'upstream/main' into feature/add-more-pa…

2c95fd4

…ssages-nested-knn-search

fixing tests

375947e

simplifying

8b5fa93

benwtrent requested review from mayya-sharipova and jimczi January 11, 2024 21:08

jimczi approved these changes Jan 12, 2024

View reviewed changes

benwtrent added 2 commits January 12, 2024 07:22

Merge remote-tracking branch 'upstream/main' into feature/add-more-pa…

31f7139

…ssages-nested-knn-search

fixing tests and simplifying

d2dfdef

mayya-sharipova reviewed Jan 12, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/search/vectors/KnnScoreDocQueryBuilder.java Show resolved Hide resolved

mayya-sharipova reviewed Jan 12, 2024

View reviewed changes

fixing tests

f2efd95

mayya-sharipova reviewed Jan 12, 2024

View reviewed changes

mayya-sharipova approved these changes Jan 12, 2024

View reviewed changes

benwtrent added 2 commits January 17, 2024 10:28

Merge remote-tracking branch 'upstream/main' into feature/add-more-pa…

ae8f334

…ssages-nested-knn-search

fixing bug

1083602

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 17, 2024

benwtrent mentioned this pull request Jan 17, 2024

Allow representing a document with multiple embeddings (dense vectors) #72068

Open

elasticsearchmachine merged commit e4feaff into elastic:main Jan 17, 2024
15 checks passed

benwtrent deleted the feature/add-more-passages-nested-knn-search branch January 17, 2024 16:33

miltonhultgren mentioned this pull request Feb 2, 2024

[Obs AI Assistant] Make content from Search connectors fully searchable elastic/kibana#175434

Closed

konstadin mentioned this pull request Jun 21, 2024

[FEATURE] add support for more than one kNN query on nested vectors with multiple inner hits and filter opensearch-project/k-NN#1768

Closed

jimczi mentioned this pull request Jan 31, 2025

Refactor InnerHitsRewriteContext into a more generic PerDocumentQueryRewriteContext #121405

Open

Add support for more than one inner_hit when searching nested vectors #104006

Add support for more than one inner_hit when searching nested vectors #104006

Conversation

benwtrent commented Jan 5, 2024 • edited Loading

Global kNN example

kNN Query example

Nearest by max passage only

elasticsearchmachine commented Jan 5, 2024

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Jan 8, 2024

benwtrent commented Jan 8, 2024

jimczi commented Jan 8, 2024

elasticsearchmachine commented Jan 9, 2024

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

mayya-sharipova commented Jan 12, 2024 • edited Loading

benwtrent commented Jan 12, 2024

karmi commented Jan 17, 2024

dkarlovi commented Jan 18, 2024

ephraim-s commented May 7, 2024

benwtrent commented May 7, 2024

ephraim-s commented May 8, 2024

benwtrent commented Jan 5, 2024 •

edited

Loading

benwtrent Jan 10, 2024 •

edited

Loading

mayya-sharipova Jan 12, 2024 •

edited

Loading

mayya-sharipova commented Jan 12, 2024 •

edited

Loading