Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best way to use memory mapping with a SparseVector #8

Open
GSGerritsen opened this issue May 5, 2021 · 1 comment
Open

Best way to use memory mapping with a SparseVector #8

GSGerritsen opened this issue May 5, 2021 · 1 comment

Comments

@GSGerritsen
Copy link

Hello,

I'm hoping to do something like this:

struct Example {
    ... misc fields,
   path: PathBuf
}

Where path points to the location of a serialized SparseVector. Is it possible to use MemoryMap in some way with SparseVector such that I can do something like: let sparse_vector = SparseVector::load(memory_mapped_sparse_vector)?

I'm hoping to be able to do things like sparse_vector.rank(10) etc without having to load in a potentially huge SparseVector in its entirety from disk.

In addition, I'd also like to be able to append to the SparseVector - is this also possible to do with memory mapping where I can avoid loading the whole thing, writing to it, then serializing it back?

Thanks in advance for any guidance!

@jltsiren
Copy link
Owner

jltsiren commented May 6, 2021

Implementing a memory-mapped SparseVector would be straightforward but tedious. You can duplicate code, much in the same way as RawVectorMapper and IntVectorMapper duplicate the in-memory RawVector and IntVector. Alternatively, you can use generics in BitVector and SparseVector to have them use either in-memory or memory-mapped backing structures, but the code will probably get quite ugly.

Appending to a SparseVector is possible – SparseBuilder already does that for in-memory structures. First you choose divisor d, which is hardcoded to be a power of 2. The optimal choice depends on knowing the density of values in advance. Then you append the values in increasing order:

  • Quotient v / d determines the bucket the value goes to.
  • Bitvector high encodes the number of values in each bucket in unary, while integer vector low stores the remainders.
  • If v is the ith value, you append it by setting bit high[i + v / d] and appending v % d to low.

However, due to the way the structures have been implemented, you would have to use two separate files, with a RawVectorWriter for high and IntVectorWriter for low, and concatenate the files afterwards. Also, queries require select support structures for high, and those structures are immutable. Mutable alternatives do exist, but their time/space trade-offs are worse. Finally, the current memory mapping implementation assumes read-only files, and even my plans for mutable files assume that the size of the file does not change.

The overall design of simple-sds reflects how I have been using SDSL data structures in bioinformatics applications. In-memory data structures are assumed to be gigabytes or tens of gigabytes in size and often shared by tens of threads. The writer structures are intended for writing temporary files, while memory-mapped structures are used for reading them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants