-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(search): Hnsw Vector Search Plan Operator & Executor #2434
Conversation
Hiii @PragmaTwice, once we discussed the potential solutions for vector search query executor on Slack, this PR is attempting to solve it. I'm wondering if the solution above sounds good to you as an initial implementation, and whether you see any room for improvement. Thanks! |
Yeah I think it's good to be an initial implementation. We can gradually refactor it to a more iterative way. |
This PR is ready for review! @PragmaTwice |
Yeah we can then add new nodes in IR and support these new field scan operators in optimizer passes. e.g. for RediSearch query As for which query plan is the best, we can leave it to the cost model to determine. |
|
||
IndexInfo *index; | ||
redis::SearchKey search_key; | ||
redis::HnswVectorFieldMetadata field_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems here we can just keep a pointer e.g. const redis::HnswVectorFieldMetadata *field_metadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this was because HnswIndex
constructor does not support const HnswVectorFieldMetadata*
.
HnswIndex(const SearchKey& search_key, HnswVectorFieldMetadata* vector, engine::Storage* storage);
HnswIndex
needs to modify HnswVectorFieldMetadata*
in other functions, so I did copy here. Do you have ideas on how we can improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think, if the metadata is modified in a certain vector query, it seems that the global metadata of the vector field (in server->index_mgr->index_map
) will not change. Will this cause any problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HnswVectorFieldMetadata*
is only modified when the node is inserted/deleted in a higher layer than all other nodes, so num_levels
in HnswVectorFieldMetadata : IndexFieldMetadata
is modified. So the affected field is server->index_mgr->index_map[<index_key>]->fields[<field>]->metadata
.
In indexer.cc
, we do
auto *metadata = iter->second.metadata.get();
if (auto vector = dynamic_cast<HnswVectorFieldMetadata *>(metadata)) {
GET_OR_RET(UpdateHnswVectorIndex(key, original, current, search_key, vector));
}
So *metadata
changes correspondingly.
But in this PR, since we don't modify metadata
, I guess it will not cause actual problems by copying (maybe with consistency issue but this would be fixed later aligning with #2310 , and the expensive copy because of the large size of HnswIndex
but I can fix that right after the PR if the solution using static
member variable sounds good to you <3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean.
It's good for now, but we can enhance it later. For example, we could create a const version of HnswIndex that takes a const pointer to HnswVectorFieldMetadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
I wanted to overload with a const version, but there are some nested calls also asking for HnswVectorFieldMetadata*
as a parameter, so eventually didn't implement it to avoid making this PR look too messy. I'll take a note in tracking issue #2426 and improve it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest looks good to me. Thank you!
Quality Gate passedIssues Measures |
This PR is part of #2426
Summary
"Pure" KNN Plan Operator + Executor (i.e. search on HNSW graph directly on disk)
"Pure" Range Query Plan Operator + Executor
k
nearest candidates as a mini-batch initialization, ensuring each one of them is within the expected rangeEnd
;Ref: Idea from Pgvector Hnsw Scan Code - bool hnswgettuple(IndexScanDesc scan, ScanDirection dir)
Next
Hybrid Query with Filter (Potential Solution)
Note that since both "Pure" Range Query Executor and "Pure" KNN Search Executor are based on disk, "Pure" Range Query Executor cannot be reused for hybrid query. I plan to add the new hybrid query executor after implementing the
ir
expression for HNSW (e.g. reuseFilterExecutor
forHnswRangeQueryExpr
?). Also, I currently don't have a clear idea on how to continue fetching more nearest vectors after some vectors are filtered for hybrid queries. So for hybrid query, I'll make changes in future PRs after the expression layer is more clear to me.Resources