Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast incremental update with occasional reindexing #129

Open
npip99 opened this issue Aug 27, 2024 · 0 comments
Open

Fast incremental update with occasional reindexing #129

npip99 opened this issue Aug 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@npip99
Copy link

npip99 commented Aug 27, 2024

I'm not sure if I'm missing something in the documentation, but how do we periodically reindex new embeddings? Having a diskann index makes "INSERT" incredibly slow.

This is the behavior that I observe:

Initial Index

CREATE INDEX ix_chunk_embedding
    ON chunk
    USING diskann (embedding)
    WHERE (indexed = true)

(The indexed = true condition is for easily bringing rows into and out of the index.)

Fresh Index

SELECT pg_size_pretty(pg_relation_size('ix_chunk_embedding')) AS index_size;
 index_size 
------------
 24 kB
(1 row)

Add rows to index

UPDATE chunk
SET indexed = true
WHERE id IN [... 500 ids ...];

This takes 4.1 seconds. The only reason why i'm doing indexed = true, is so that appending rows to the database doesn't take forever.

Check Index size again

SELECT pg_size_pretty(pg_relation_size('ix_chunk_embedding')) AS index_size;
 index_size 
------------
 4024 kB
(1 row)

Reindex

REINDEX INDEX ix_chunk_embedding;

This runs almost instantly (<100ms?)

SELECT pg_size_pretty(pg_relation_size('ix_chunk_embedding')) AS index_size;
 index_size 
------------
 352 kB
(1 row)

And it's over 10x smaller.


My issue is that incremental updates are both incredibly slow, and also not space efficient. Is there a way to disable incremental indexing altogether? I don't mind if incrementally indexed rows don't show up in recall, because I can just do SET indexed = true and then follow-up with a REINDEX CONCURRENTLY. Right now, I have no efficient way to do this.

Theoretically, there should be ways to make updates should be more efficient (MSFT advertises 1000 QPS and sub-ms latency for inserts https://youtu.be/BnYNdSIKibQ?t=352), but even then a SET LOCAL diskann_suppress_indexing = on; to disable INSERT/UPDATE indexing would still be useful. Drop+Recreate doesn't work because then queries stop working.

@cevian cevian added the bug Something isn't working label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants