Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AbstractKnnVectorQuery.seed for seeded HNSW #13635

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

seanmacavaney
Copy link

Description

This PR addresses #13634.

The main changes are in:

  • AbstractKnnVectorQuery, which adds a seed field. It scores this query if provided, and passes these seeds along to the HnswGraphSearcher as a DocIdSetIterator.
  • HnswGraphSearcher, which bypasses the findBestEntryPoint step if seeds are provided.

Unfortunately, passing the seeds down to HnswGraphSearcher resulted in changes in several places, e.g., to backward codecs.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very interesting to see how this behaves in Lucene directly. I could see it being very nice, or behaving poorly depending on the seed query (which, I guess is expected).

The API of adding the DocIdSetIterator for the seeding docs is what I would initially think of as well.

@seanmacavaney
Copy link
Author

I could see it being very nice, or behaving poorly depending on the seed query (which, I guess is expected).

We could probably predict whether a seed set is good or bad based on how many documents it's needed to score and terminate early if it seems like it's too many. Could be accomplished with the existing KnnCollector visited count. I'll look into this.

@seanmacavaney
Copy link
Author

Thanks all! I was away for a bit but back on it now :)

seanmacavaney and others added 12 commits September 5, 2024 14:47
….java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
….java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…ry.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…cher.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…cher.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…rQuery.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…rQuery.java

Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
@seanmacavaney seanmacavaney marked this pull request as ready for review September 10, 2024 10:52
@benwtrent benwtrent self-requested a review September 10, 2024 19:06
package org.apache.lucene.search;

/** A {@link KnnCollector} that provides an initial set of seeds to initialize the search. */
class SeededKnnCollector implements KnnCollector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor/subjective: could potentially factor out a FilterKnnCollector class (similar to FilterCollector) either in addition to or instead of SeededKnnCollector e.g.

class FilterKnnCollector implements KnnCollector
class SeededKnnCollector extends FilterKnnCollector

or

class FilterKnnCollector implements KnnCollector

if (seedDocs != null) {
  knnCollector = new FilterKnnCollector(knnCollector) {
    @Override
    public DocIdSetIterator getSeedEntryPoints() {
      return seedDocs;
    }
  };
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've added a generic KnnCollector.Decorator, which applies in several places. I stopped short of applying filtering via this same mechanism, since that would ultamately change the LeafCollector.

@seanmacavaney
Copy link
Author

Thanks all! Here is a summary of the design:

  1. Addition of seed to *KnnVectorQuery. This is the main user-facing API.
  2. Addition of methods to convert between docids and ordinals in *KnnVectorQuery and *VectorValues. A new inner class MappedDISI facilitates this conversion process.
  3. A new SeededKnnCollector with a getSeedEntryPoints method to faciliate passing the seed ordinals down to the HnswGraphSearcher without affecting the LeafReader.

Is this better? If so, I want to go ahead and re-run some benchmarks to make sure that performance didn't degrade with the redesign.

* @lucene.experimental
*/
public abstract static class Decorator implements KnnCollector {
private KnnCollector collector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private KnnCollector collector;
final private KnnCollector collector;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could be protected if deriving classes might need access

Suggested change
private KnnCollector collector;
final protected KnnCollector collector;

* @lucene.experimental
*/
public static class Seeded extends Decorator {
private DocIdSetIterator seedEntryPoints;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private DocIdSetIterator seedEntryPoints;
final private DocIdSetIterator seedEntryPoints;

Comment on lines 112 to 114
TimeLimitingKnnCollectorManager knnCollectorManager =
new TimeLimitingKnnCollectorManager(
getKnnCollectorManager(k, indexSearcher), indexSearcher.getTimeout());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I was actually thinking of a new collector manager as well that would:

  • delegate to other managers as necessary
  • Build the seed weight & run the query when it builds the individual collectors (since it has access to the leaf context).

This helps abstract away all the messing about with the seed stuff into

SeededKnnCollectorManager which will then return SeededKnnCollectors that again, delegate as necessary.

We don't need to really update anything directly in any of the queries as all the new logic could live in these two new classes.

This will make the change much more contained, and keep us from having to update the approximateSearch API as the query classes are not experimental (and thus subject to bwc).

Comment on lines +70 to +78
ArrayList<Integer> entryPointOrdInts = null;
DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints();
if (entryPoints != null) {
entryPointOrdInts = new ArrayList<Integer>();
int entryPointOrdInt;
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {
entryPointOrdInts.add(entryPointOrdInt);
}
}
Copy link
Member

@benwtrent benwtrent Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ArrayList<Integer> entryPointOrdInts = null;
DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints();
if (entryPoints != null) {
entryPointOrdInts = new ArrayList<Integer>();
int entryPointOrdInt;
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {
entryPointOrdInts.add(entryPointOrdInt);
}
}
final int[] entryPoints;
if (knnCollector instanceOf SeededEntryPointProvider seeded) {
DocIdSetIterator entryPoints = seeded.getSeedEntryPoints();
if (entryPoints != null) {
entryPoints = new int[entryPoints.cost()];
int entryPointOrdInt;
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) {
entryPointOrdInts.add(entryPointOrdInt);
}
}
} else {
entryPoints = null;
}

I think we can do this.

I am not sure about adding a new default thing to ALL collectors, but having an interface that the SeededKnnCollector satisfies would be just fine.

* @lucene.experimental
*/
public DocIdSetIterator convertDocIdsToVectorOrdinals(DocIdSetIterator docIds) {
return docIds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this could move to the KnnVectorValues base class and be abstract and/or if a default implementation here (and in FloatVectorValues) should be throw new UnsupportedOperationException(); i.e. a mapping needs to be provided by the implementations of this abstract class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I am against default implementations that throw, just require the interface and thus catch it at compile time.

@benwtrent
Copy link
Member

@seanmacavaney I realize I am making some broad API suggestions. If it would help, I can attempt to convey my ideas in a branch from your fork.

I do like the direction now by providing something via the collectors. This allows us to keep the query objects effectively unchanged (which is sensitive to BWC).

Comment on lines +222 to +226
leafCollector.finish();
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
leafCollector.finish();
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
}
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
}
leafCollector.finish();

@benwtrent
Copy link
Member

@seanmacavaney here is my idea more fully: https://github.com/apache/lucene/compare/main...benwtrent:lucene:seeds-refactor-idea?expand=1

I didn't create a byte version of the query, nor am I handling IOExceptions correctly, but this should better show what I had in mind.

All the code should be able to be mostly encapsulated into the collector & collector manager with no changes to either query.

@seanmacavaney
Copy link
Author

Thanks- I like the design. I'll try to make the remaining changes accordingly.

@cpoerschke
Copy link
Contributor

@seanmacavaney here is my idea more fully: https://github.com/apache/lucene/compare/main...benwtrent:lucene:seeds-refactor-idea?expand=1
...

Wondering if the

public SeededKnnFloatVectorQuery(String field, float[] target, int k, Query filter, Query seed) {

then potentially makes it easier to have a seed count value too i.e. it could be different from k value?

public SeededKnnFloatVectorQuery(String field, float[] target, int k, Query filter, Query seed, int seed_count) {

@seanmacavaney
Copy link
Author

Good call!

@benwtrent
Copy link
Member

Hey @seanmacavaney didn't want this to die on the vine. I think with some refactoring and adding new experimental queries, this could be a nice experimental feature for vector search.

Do you want help on refactoring and pushing code?

Am I blocking your progress in anyway?

@seanmacavaney
Copy link
Author

Not at all-- thanks a lot for the help @benwtrent! I totally agree with the proposed changes and it's clear how to move forward on this. I'm just occupied with several other priorities at the uni right now, so I haven't been able to get to these.

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 31, 2024
@benwtrent
Copy link
Member

@seanmacavaney do you still want this contributed to Apache Lucene? Its excellent work and I don't want it dying on the vine.

If its ok with you, I plan on refactoring it (with attribution to you of course), and attempting to get it into Lucene as a new query. Seeding entry points for HNSW exploration (and not just HNSW, but other vector indices as well), is powerful and can provide some unique interactions with vector search.

@github-actions github-actions bot removed the Stale label Dec 20, 2024
@seanmacavaney
Copy link
Author

Hey @benwtrent -- it's been on my todo list to get back to this, but I've gotten bogged down with a bunch of other stuff.

If you're willing, please do go ahead and refactor :D! (I'd also love to see how the refactor looks to better understand Lucene internals.)

Copy link

github-actions bot commented Jan 4, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants