-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AbstractKnnVectorQuery.seed for seeded HNSW #13635
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very interesting to see how this behaves in Lucene directly. I could see it being very nice, or behaving poorly depending on the seed query (which, I guess is expected).
The API of adding the DocIdSetIterator
for the seeding docs is what I would initially think of as well.
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Show resolved
Hide resolved
We could probably predict whether a seed set is good or bad based on how many documents it's needed to score and terminate early if it seems like it's too many. Could be accomplished with the existing |
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
Outdated
Show resolved
Hide resolved
Thanks all! I was away for a bit but back on it now :) |
….java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
….java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…ry.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…cher.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…cher.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…rQuery.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
…rQuery.java Co-authored-by: Christine Poerschke <cpoerschke@apache.org>
package org.apache.lucene.search; | ||
|
||
/** A {@link KnnCollector} that provides an initial set of seeds to initialize the search. */ | ||
class SeededKnnCollector implements KnnCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor/subjective: could potentially factor out a FilterKnnCollector
class (similar to FilterCollector
) either in addition to or instead of SeededKnnCollector
e.g.
class FilterKnnCollector implements KnnCollector
class SeededKnnCollector extends FilterKnnCollector
or
class FilterKnnCollector implements KnnCollector
if (seedDocs != null) {
knnCollector = new FilterKnnCollector(knnCollector) {
@Override
public DocIdSetIterator getSeedEntryPoints() {
return seedDocs;
}
};
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I've added a generic KnnCollector.Decorator
, which applies in several places. I stopped short of applying filtering via this same mechanism, since that would ultamately change the LeafCollector
.
Thanks all! Here is a summary of the design:
Is this better? If so, I want to go ahead and re-run some benchmarks to make sure that performance didn't degrade with the redesign. |
* @lucene.experimental | ||
*/ | ||
public abstract static class Decorator implements KnnCollector { | ||
private KnnCollector collector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private KnnCollector collector; | |
final private KnnCollector collector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could be protected
if deriving classes might need access
private KnnCollector collector; | |
final protected KnnCollector collector; |
* @lucene.experimental | ||
*/ | ||
public static class Seeded extends Decorator { | ||
private DocIdSetIterator seedEntryPoints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private DocIdSetIterator seedEntryPoints; | |
final private DocIdSetIterator seedEntryPoints; |
TimeLimitingKnnCollectorManager knnCollectorManager = | ||
new TimeLimitingKnnCollectorManager( | ||
getKnnCollectorManager(k, indexSearcher), indexSearcher.getTimeout()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I was actually thinking of a new collector manager as well that would:
- delegate to other managers as necessary
- Build the seed weight & run the query when it builds the individual collectors (since it has access to the leaf context).
This helps abstract away all the messing about with the seed stuff into
SeededKnnCollectorManager
which will then return SeededKnnCollectors
that again, delegate as necessary.
We don't need to really update anything directly in any of the queries as all the new logic could live in these two new classes.
This will make the change much more contained, and keep us from having to update the approximateSearch
API as the query classes are not experimental (and thus subject to bwc).
ArrayList<Integer> entryPointOrdInts = null; | ||
DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints(); | ||
if (entryPoints != null) { | ||
entryPointOrdInts = new ArrayList<Integer>(); | ||
int entryPointOrdInt; | ||
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) { | ||
entryPointOrdInts.add(entryPointOrdInt); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayList<Integer> entryPointOrdInts = null; | |
DocIdSetIterator entryPoints = knnCollector.getSeedEntryPoints(); | |
if (entryPoints != null) { | |
entryPointOrdInts = new ArrayList<Integer>(); | |
int entryPointOrdInt; | |
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) { | |
entryPointOrdInts.add(entryPointOrdInt); | |
} | |
} | |
final int[] entryPoints; | |
if (knnCollector instanceOf SeededEntryPointProvider seeded) { | |
DocIdSetIterator entryPoints = seeded.getSeedEntryPoints(); | |
if (entryPoints != null) { | |
entryPoints = new int[entryPoints.cost()]; | |
int entryPointOrdInt; | |
while ((entryPointOrdInt = entryPoints.nextDoc()) != NO_MORE_DOCS) { | |
entryPointOrdInts.add(entryPointOrdInt); | |
} | |
} | |
} else { | |
entryPoints = null; | |
} |
I think we can do this.
I am not sure about adding a new default thing to ALL collectors, but having an interface that the SeededKnnCollector satisfies would be just fine.
* @lucene.experimental | ||
*/ | ||
public DocIdSetIterator convertDocIdsToVectorOrdinals(DocIdSetIterator docIds) { | ||
return docIds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this could move to the KnnVectorValues
base class and be abstract and/or if a default implementation here (and in FloatVectorValues
) should be throw new UnsupportedOperationException();
i.e. a mapping needs to be provided by the implementations of this abstract class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I am against default implementations that throw, just require the interface and thus catch it at compile time.
@seanmacavaney I realize I am making some broad API suggestions. If it would help, I can attempt to convey my ideas in a branch from your fork. I do like the direction now by providing something via the collectors. This allows us to keep the query objects effectively unchanged (which is sensitive to BWC). |
leafCollector.finish(); | ||
} catch ( | ||
@SuppressWarnings("unused") | ||
CollectionTerminatedException e) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leafCollector.finish(); | |
} catch ( | |
@SuppressWarnings("unused") | |
CollectionTerminatedException e) { | |
} | |
} catch ( | |
@SuppressWarnings("unused") | |
CollectionTerminatedException e) { | |
} | |
leafCollector.finish(); |
@seanmacavaney here is my idea more fully: https://github.com/apache/lucene/compare/main...benwtrent:lucene:seeds-refactor-idea?expand=1 I didn't create a byte version of the query, nor am I handling IOExceptions correctly, but this should better show what I had in mind. All the code should be able to be mostly encapsulated into the collector & collector manager with no changes to either query. |
Thanks- I like the design. I'll try to make the remaining changes accordingly. |
Wondering if the
then potentially makes it easier to have a seed count value too i.e. it could be different from
|
Good call! |
Hey @seanmacavaney didn't want this to die on the vine. I think with some refactoring and adding new experimental queries, this could be a nice experimental feature for vector search. Do you want help on refactoring and pushing code? Am I blocking your progress in anyway? |
Not at all-- thanks a lot for the help @benwtrent! I totally agree with the proposed changes and it's clear how to move forward on this. I'm just occupied with several other priorities at the uni right now, so I haven't been able to get to these. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
@seanmacavaney do you still want this contributed to Apache Lucene? Its excellent work and I don't want it dying on the vine. If its ok with you, I plan on refactoring it (with attribution to you of course), and attempting to get it into Lucene as a new query. Seeding entry points for HNSW exploration (and not just HNSW, but other vector indices as well), is powerful and can provide some unique interactions with vector search. |
Hey @benwtrent -- it's been on my todo list to get back to this, but I've gotten bogged down with a bunch of other stuff. If you're willing, please do go ahead and refactor :D! (I'd also love to see how the refactor looks to better understand Lucene internals.) |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
This PR addresses #13634.
The main changes are in:
AbstractKnnVectorQuery
, which adds aseed
field. It scores this query if provided, and passes these seeds along to theHnswGraphSearcher
as aDocIdSetIterator
.HnswGraphSearcher
, which bypasses thefindBestEntryPoint
step if seeds are provided.Unfortunately, passing the seeds down to
HnswGraphSearcher
resulted in changes in several places, e.g., to backward codecs.