Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] StreamingMerge Entrypoint #355

Open
infrawhispers opened this issue May 20, 2023 · 3 comments
Open

[Question] StreamingMerge Entrypoint #355

infrawhispers opened this issue May 20, 2023 · 3 comments
Labels
question Further information is requested

Comments

@infrawhispers
Copy link

Hi!

The FreshDiskANN paper outlines the StreamingMerge procedure. In combing through the codebase (main @ f8ef303), there doesn't appear to be a singular entrypoint that allows a caller to utilize the FreshDiskANN API contract without being aware of all the types of indices.

  • test_streaming_scenario.cpp outlines how to build an in-memory index that supports inserts and deletes.
  • build_stitched_index.cpp outlines how to merge indices
  • search_disk_index.cpp demonstrates how to run a search across an index that is stored on disk.

Given a client that provides a memory budget and no starting list of vectors, my reading of the paper would indicate the following needs to be done in a wrapping class:

  1. create an empty, streaming enabled, in-memory index that holds writes - this is outlined in test_streaming_scenario.cpp and is the only sink for insertions.
  2. create an empty, SSD resident index, which is demonstrated by build_disk_index.cpp... this index would not have a true build phase as there is nothing to add.
  3. once the mutable index in [1] is full, merge [1] and [2] using the routine outlined in merge_shards within disk_utils.h - during the merge process, we would have already created a new mutable in-memory index for any in-flight writes + deletes.
  4. separately, maintain a list of deletions that are used for filtering within all live indices.

I would be happy to submit a patch that unifies the above in such a way that a caller can just create an Index and not have to worry about RO-TempIndex, RW-TempIndex and the SSD-Resident Index; however, I would like to confirm that my read on the current codebase is correct in that there is no singular entrypoint for this.

@infrawhispers infrawhispers added the question Further information is requested label May 20, 2023
@harsha-simhadri
Copy link
Contributor

@infrawhispers You are right, there is no single entry point for this yet. Also, merge_shards is different from fresh-diskann paper. It is the method described in the original DiskANN paper. The procedure to merge an in-mem index to a SSD index and create a new SSD index is not yet in main. There is an outdated version in #11 which needs to be redone for the latest main. Once that is done, we can attempt a single entry point. You are most welcome to contribute any of these.

@lisirrx
Copy link

lisirrx commented Aug 8, 2023

hi @harsha-simhadri , I'm also interested in the FreshDiskANN implementation. Is there any roadmap about that?
By the way, what is the difference between the #11 and current code behind apps/test_insert_deletes_consolidate ?

@FredJiang0324
Copy link

hi @harsha-simhadri , I'm also interested in the FreshDiskANN implementation. Is there any roadmap about that?
By the way, what is the difference between the #11 and current code behind apps/test_insert_deletes_consolidate ?

seems that the tests are all in memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants