Use TopK node for KNN #1324

wjones127 · 2023-09-27T01:02:26Z

Right now to perform KNN, we compute the top k for each batch, concatenate all the results, and get the top k from those batches. If there are a lot of batches, this can lead to OOM error.

KNN is essentially Project(distance) -> TopK(k=k, order_by=distance), so we might just want to use the DataFusion nodes and build upon them.

There is a tracking issue upstream in DataFusion: apache/datafusion#7195
Also there is a drafted PR for an optimized TopK node: apache/datafusion#7250

We could complete that PR and use that to implement an optimize KNN query plan.

The text was updated successfully, but these errors were encountered:

wjones127 · 2024-02-13T16:38:16Z

Also handle here:

lance/rust/lance/src/index/vector/ivf.rs

Line 430 in ff793ad

// TODO: Use a heap sort to get the top-k.

wjones127 added enhancement New feature or request arrow Apache Arrow related issues performance labels Sep 27, 2023

wjones127 mentioned this issue Jan 2, 2024

perf: Use heap to run flat search #1773

Closed

westonpace added the priority: high Issues that are high priority (for LanceDb, the organization) label Jan 29, 2024

wjones127 self-assigned this Jan 29, 2024

changhiskhan assigned eddyxu and unassigned wjones127 Feb 9, 2024

eddyxu mentioned this issue Jun 26, 2024

refactor: flat search to use datafusion top k #2535

Merged

eddyxu closed this as completed in #2535 Jun 27, 2024

eddyxu closed this as completed in f8c5f4d Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use TopK node for KNN #1324

Use TopK node for KNN #1324

wjones127 commented Sep 27, 2023

wjones127 commented Feb 13, 2024

Use TopK node for KNN #1324

Use TopK node for KNN #1324

Comments

wjones127 commented Sep 27, 2023

wjones127 commented Feb 13, 2024