Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Treating disk index shards as dynamic indices? #405

Closed
jaredcthomas opened this issue Jul 27, 2023 · 3 comments
Closed

[Question] Treating disk index shards as dynamic indices? #405

jaredcthomas opened this issue Jul 27, 2023 · 3 comments
Labels
question Further information is requested

Comments

@jaredcthomas
Copy link

Hello!
I'm wondering whether it is viable to send new points to the appropriate shards of an existing disk index, apply the streaming procedure within each shard, and stitch everything back together. Are there any logistical/performance penalties for this strategy?

@jaredcthomas jaredcthomas added the question Further information is requested label Jul 27, 2023
@harsha-simhadri
Copy link
Contributor

If you can load the shards into memory, apply the delta and write back to disk, you would save a lot of small random writes.

Could you elaborate on your scenario a bit more? what is your machine size, dataset size and the number of shards?

@jaredcthomas
Copy link
Author

@harsha-simhadri Certainly, thanks for responding.

I have ~80M 768d vectors and a build DRAM budget of 150 GB, so I wound up with 5 shards. Really, I am just looking for the most efficient way of handling bulk insertions/deletions (say several million vectors at a time) without having to rebuild, if possible.

@harsha-simhadri
Copy link
Contributor

@jaredcthomas. I assume you meant 768d floats. So the overall index size is about 80M * (3KB vec + 100degX4) ~ 250GB?
Loading one shard (~50GB) at a time into DRAM, batch updating and writing to SSD seems reasonable.
We have a long overdue (and out of date) PR#11 that can merge batch updates to disk as in https://arxiv.org/abs/2105.09613. Would appreciate help with that PR. In any case, we plan to get to that in the next few months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants