Add AbstractKnnVectorQuery.seed for seeded HNSW #13635

seanmacavaney · 2024-08-06T11:45:15Z

Description

This PR addresses #13634.

The main changes are in:

AbstractKnnVectorQuery, which adds a seed field. It scores this query if provided, and passes these seeds along to the HnswGraphSearcher as a DocIdSetIterator.
HnswGraphSearcher, which bypasses the findBestEntryPoint step if seeds are provided.

Unfortunately, passing the seeds down to HnswGraphSearcher resulted in changes in several places, e.g., to backward codecs.

benwtrent

It would be very interesting to see how this behaves in Lucene directly. I could see it being very nice, or behaving poorly depending on the seed query (which, I guess is expected).

The API of adding the DocIdSetIterator for the seeding docs is what I would initially think of as well.

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

seanmacavaney · 2024-08-06T13:52:43Z

I could see it being very nice, or behaving poorly depending on the seed query (which, I guess is expected).

We could probably predict whether a seed set is good or bad based on how many documents it's needed to score and terminate early if it seems like it's too many. Could be accomplished with the existing KnnCollector visited count. I'll look into this.

per apache#13635 (comment)

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java

lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

seanmacavaney · 2024-09-05T12:22:20Z

Thanks all! I was away for a bit but back on it now :)

….java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

…ry.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

…cher.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

…rQuery.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>

cpoerschke · 2024-10-02T10:39:06Z

lucene/core/src/java/org/apache/lucene/search/SeededKnnCollector.java

+package org.apache.lucene.search;
+
+/** A {@link KnnCollector} that provides an initial set of seeds to initialize the search. */
+class SeededKnnCollector implements KnnCollector {


minor/subjective: could potentially factor out a FilterKnnCollector class (similar to FilterCollector) either in addition to or instead of SeededKnnCollector e.g.

class FilterKnnCollector implements KnnCollector class SeededKnnCollector extends FilterKnnCollector

or

class FilterKnnCollector implements KnnCollector if (seedDocs != null) { knnCollector = new FilterKnnCollector(knnCollector) { @Override public DocIdSetIterator getSeedEntryPoints() { return seedDocs; } }; }

I like that, thanks!

Alright, I've added a generic KnnCollector.Decorator, which applies in several places. I stopped short of applying filtering via this same mechanism, since that would ultamately change the LeafCollector.

seanmacavaney · 2024-10-02T10:47:14Z

Thanks all! Here is a summary of the design:

Addition of seed to *KnnVectorQuery. This is the main user-facing API.
Addition of methods to convert between docids and ordinals in *KnnVectorQuery and *VectorValues. A new inner class MappedDISI facilitates this conversion process.
A new SeededKnnCollector with a getSeedEntryPoints method to faciliate passing the seed ordinals down to the HnswGraphSearcher without affecting the LeafReader.

Is this better? If so, I want to go ahead and re-run some benchmarks to make sure that performance didn't degrade with the redesign.

cpoerschke · 2024-10-02T16:31:27Z

lucene/core/src/java/org/apache/lucene/search/KnnCollector.java

+   * @lucene.experimental
+   */
+  public abstract static class Decorator implements KnnCollector {
+    private KnnCollector collector;


Suggested change

private KnnCollector collector;

final private KnnCollector collector;

or could be protected if deriving classes might need access

Suggested change

private KnnCollector collector;

final protected KnnCollector collector;

cpoerschke · 2024-10-02T16:32:28Z

lucene/core/src/java/org/apache/lucene/search/KnnCollector.java

+   * @lucene.experimental
+   */
+  public static class Seeded extends Decorator {
+    private DocIdSetIterator seedEntryPoints;


Suggested change

private DocIdSetIterator seedEntryPoints;

final private DocIdSetIterator seedEntryPoints;

benwtrent · 2024-10-02T16:47:27Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

    TimeLimitingKnnCollectorManager knnCollectorManager =
        new TimeLimitingKnnCollectorManager(
            getKnnCollectorManager(k, indexSearcher), indexSearcher.getTimeout());


So, I was actually thinking of a new collector manager as well that would:

delegate to other managers as necessary

Build the seed weight & run the query when it builds the individual collectors (since it has access to the leaf context).

This helps abstract away all the messing about with the seed stuff into

SeededKnnCollectorManager which will then return SeededKnnCollectors that again, delegate as necessary.

We don't need to really update anything directly in any of the queries as all the new logic could live in these two new classes.

This will make the change much more contained, and keep us from having to update the approximateSearch API as the query classes are not experimental (and thus subject to bwc).

benwtrent · 2024-10-02T17:01:21Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

+    ArrayList<Integer> entryPointOrdInts = null;
+    DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints();
+    if (entryPoints != null) {
+      entryPointOrdInts = new ArrayList<Integer>();
+      int entryPointOrdInt;
+      while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {
+        entryPointOrdInts.add(entryPointOrdInt);
+      }
+    }


Suggested change

ArrayList<Integer> entryPointOrdInts = null;

DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints();

if (entryPoints != null) {

entryPointOrdInts = new ArrayList<Integer>();

int entryPointOrdInt;

while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {

entryPointOrdInts.add(entryPointOrdInt);

}

}

final int[] entryPoints;

if (knnCollector instanceOf SeededEntryPointProvider seeded) {

DocIdSetIterator entryPoints = seeded.getSeedEntryPoints();

if (entryPoints != null) {

entryPoints = new int[entryPoints.cost()];

int entryPointOrdInt;

while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {

entryPointOrdInts.add(entryPointOrdInt);

}

}

} else {

entryPoints = null;

}

I think we can do this.

I am not sure about adding a new default thing to ALL collectors, but having an interface that the SeededKnnCollector satisfies would be just fine.

cpoerschke · 2024-10-02T17:02:50Z

lucene/core/src/java/org/apache/lucene/index/ByteVectorValues.java

+   * @lucene.experimental
+   */
+  public DocIdSetIterator convertDocIdsToVectorOrdinals(DocIdSetIterator docIds) {
+    return docIds;


Wondering if this could move to the KnnVectorValues base class and be abstract and/or if a default implementation here (and in FloatVectorValues) should be throw new UnsupportedOperationException(); i.e. a mapping needs to be provided by the implementations of this abstract class?

FWIW, I am against default implementations that throw, just require the interface and thus catch it at compile time.

benwtrent · 2024-10-02T17:03:15Z

@seanmacavaney I realize I am making some broad API suggestions. If it would help, I can attempt to convey my ideas in a branch from your fork.

I do like the direction now by providing something via the collectors. This allows us to keep the query objects effectively unchanged (which is sensitive to BWC).

cpoerschke · 2024-10-02T17:06:52Z

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java

+        leafCollector.finish();
+      } catch (
+          @SuppressWarnings("unused")
+          CollectionTerminatedException e) {
+      }


Suggested change

leafCollector.finish();

} catch (

@SuppressWarnings("unused")

CollectionTerminatedException e) {

}

} catch (

@SuppressWarnings("unused")

CollectionTerminatedException e) {

}

leafCollector.finish();

benwtrent · 2024-10-02T17:37:59Z

@seanmacavaney here is my idea more fully: https://github.com/apache/lucene/compare/main...benwtrent:lucene:seeds-refactor-idea?expand=1

I didn't create a byte version of the query, nor am I handling IOExceptions correctly, but this should better show what I had in mind.

All the code should be able to be mostly encapsulated into the collector & collector manager with no changes to either query.

seanmacavaney · 2024-10-03T08:42:58Z

Thanks- I like the design. I'll try to make the remaining changes accordingly.

cpoerschke · 2024-10-03T08:46:42Z

@seanmacavaney here is my idea more fully: https://github.com/apache/lucene/compare/main...benwtrent:lucene:seeds-refactor-idea?expand=1
...

Wondering if the

public SeededKnnFloatVectorQuery(String field, float[] target, int k, Query filter, Query seed) {

then potentially makes it easier to have a seed count value too i.e. it could be different from k value?

public SeededKnnFloatVectorQuery(String field, float[] target, int k, Query filter, Query seed, int seed_count) {

seanmacavaney · 2024-10-03T08:47:48Z

Good call!

benwtrent · 2024-10-16T15:09:15Z

Hey @seanmacavaney didn't want this to die on the vine. I think with some refactoring and adding new experimental queries, this could be a nice experimental feature for vector search.

Do you want help on refactoring and pushing code?

Am I blocking your progress in anyway?

seanmacavaney · 2024-10-16T15:34:41Z

Not at all-- thanks a lot for the help @benwtrent! I totally agree with the proposed changes and it's clear how to move forward on this. I'm just occupied with several other priorities at the uni right now, so I haven't been able to get to these.

github-actions · 2024-10-31T00:23:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

benwtrent · 2024-12-19T15:39:50Z

@seanmacavaney do you still want this contributed to Apache Lucene? Its excellent work and I don't want it dying on the vine.

If its ok with you, I plan on refactoring it (with attribution to you of course), and attempting to get it into Lucene as a new query. Seeding entry points for HNSW exploration (and not just HNSW, but other vector indices as well), is powerful and can provide some unique interactions with vector search.

seanmacavaney · 2024-12-20T01:26:09Z

Hey @benwtrent -- it's been on my todo list to get back to this, but I've gotten bogged down with a bunch of other stuff.

If you're willing, please do go ahead and refactor :D! (I'd also love to see how the refactor looks to better understand Lucene internals.)

github-actions · 2025-01-04T00:22:58Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

Sean MacAvaney added 2 commits August 6, 2024 11:26

implement seeded knn queries

60e3fab

cleanup

82e7053

seanmacavaney mentioned this pull request Aug 6, 2024

Seeding HNSW Search #13634

Open

benwtrent reviewed Aug 6, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java Show resolved Hide resolved

Sean MacAvaney added 2 commits August 6, 2024 15:12

ensure seed docs have a vector

7955148

per apache#13635 (comment)

apply filter to seed queries

40d972d

per apache#13635 (comment)