Use TopK node for KNN #1324
Labels
arrow
Apache Arrow related issues
enhancement
New feature or request
performance
priority: high
Issues that are high priority (for LanceDb, the organization)
Right now to perform KNN, we compute the top
k
for each batch, concatenate all the results, and get the topk
from those batches. If there are a lot of batches, this can lead to OOM error.KNN is essentially
Project(distance) -> TopK(k=k, order_by=distance)
, so we might just want to use the DataFusion nodes and build upon them.There is a tracking issue upstream in DataFusion: apache/datafusion#7195
Also there is a drafted PR for an optimized TopK node: apache/datafusion#7250
We could complete that PR and use that to implement an optimize KNN query plan.
The text was updated successfully, but these errors were encountered: