Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: fix missing record in query result when many records were deleted and pending persist #2532

Merged
merged 1 commit into from
Jul 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions chromadb/segment/impl/vector/local_persistent_hnsw.py
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,14 @@ def query_vectors(
hnsw_pointer: int = 0
curr_bf_result: Sequence[VectorQueryResult] = bf_results[i]
curr_hnsw_result: Sequence[VectorQueryResult] = hnsw_results[i]

# Filter deleted results that haven't yet been removed from the persisted index
curr_hnsw_result = [
x
for x in curr_hnsw_result
if not self._curr_batch.is_deleted(x["id"])
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after reasoning through it for a while I think this fix is the correct approach


curr_results: List[VectorQueryResult] = []
# In the case where filters cause the number of results to be less than k,
# we set k to be the number of results
Expand All @@ -433,10 +441,7 @@ def query_vectors(
else:
id = curr_hnsw_result[hnsw_pointer]["id"]
# Only add the hnsw result if it is not in the brute force index
# as updated or deleted
if not self._brute_force_index.has_id(
id
) and not self._curr_batch.is_deleted(id):
if not self._brute_force_index.has_id(id):
curr_results.append(curr_hnsw_result[hnsw_pointer])
hnsw_pointer += 1
else:
Expand All @@ -448,9 +453,7 @@ def query_vectors(
min(len(curr_hnsw_result), hnsw_pointer + remaining + 1),
):
id = curr_hnsw_result[i]["id"]
if not self._brute_force_index.has_id(
id
) and not self._curr_batch.is_deleted(id):
if not self._brute_force_index.has_id(id):
curr_results.append(curr_hnsw_result[i])
elif remaining > 0 and bf_pointer < len(curr_bf_result):
curr_results.extend(
Expand Down
Loading