-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the memory usage of the write-ahead log #354
Comments
Thanks for writing this up, that's great! I also tend to think that option
|
One more note on Now this is broken with this solution, as the RO instance relies on the on-disk state, which can be modified by RW. Consider the following scenario:
It seems easy to overcome e.g. with a header check from the RO to make sure a merge hasn't happened, but perhaps that also hurts the performance? |
@pascutto If my understanding is correct, this shouldn't be a problem as the RO will keep a file descriptor pointing to the old
EDIT: |
Ah right, I missed that change, it looks like it should be OK then |
Context
The total IO done by an index writer is roughly proportional to the maximum size of the log file, since the store must be entirely rewritten once per
log_size
replace operations. As a result, the in-memory mirror of the log file tends to dominate the overall memory usage (i.e. since it's worth makinglog_size
as large as possible). When used inirmin-pack
, this data-structure expands out as follows:When
Sys.word_size = 64
, this makes for approximately 14.5 words per binding, broken down into:In the worst case, there are both reader and writer instances and merges are stacking, meaning there are 4 such hashtables (
log
andlog_async
for each of the reader and writer) and the writer also has an array of sorted entries (5 words per binding). Whenlog_size = 500_000
, this sums to ~250MB. Here's two (conflicting) optimisations we could make:Suggestion 1: functorise over a user-defined hashtable implementations
We could provide a functor that allows the user to provide their own
Hashtbl
implementation specialised to their particularkey
andvalue
types. In the specific case ofirmin-pack
, we could do the following:string
keys in an arena (-2 words)int63
s (offset
andkind_then_length
) and keep them as immediates in the hashtable (-3 words)This makes for 8 words per binding. By itself, this isn't much of a change since the entries must be unpacked from their compact representation to sort them in an array before the merge. The simple solution here is just to pick the hashtable bucket using the upper bits of
key_hash
, so that the hashtable is already almost sorted in the correct order (only the bindings within a particular bucket are relatively out-of-order). This means we can expose ato_sorted_seq
that takes O(1) additional memory.When
log_size = 500_000
, this works out to ~128MB in the worst case. (Half what we had before.)Suggestion 2: keep only log offsets in memory
The previous approach effectively lets the user pick an efficient packed representation for entries in the write-ahead log. However, Index can already uniquely identify entries by their offset on disk (in 1 word). If we commit to needing to
read
when finding in the log file (and when resizing the hashtable), we can keep only the offsets in memory (in hashset buckets, again keyed by the upper bits ofkey_hash
). This achieves 2 words per binding.When
log_size = 500_000
, this is ~32MB in the worst case. The disadvantage is that values in the log can no longer be found without reading on disk, so best-case read performance is worse. If there are manykey_hash
prefix collisions in a given bucket, we have to read and decode each entry onfind
.Analysis
find
s must always read from disk. However, the improved memory footprint could allow an LRU to be added, which may be more effective anyway.Here are some preliminary measurements using the Index replay trace with ~100 million operations. Firstly, internally-measured heap size as a function of operations performed:
Secondly, externally-measured
maxrss
as a function of time:This shows that the performance hit of a naive implementation of Suggestion 2 is roughly 10% (but my laptop isn't a very precise environment). I suspect it's possible to recover the performance with a relatively small LRU, but haven't done this yet.
Overall, I'm leaning towards
#2
, depending on the measured benefits of adding an LRU. This should make it possible to substantially increase the defaultlog_size
in Tezos (and so avoid stacking merges / improve IO performance). In the medium-term, we'll have complementary solutions to the the log size issue in the form of reduced index load and/or an alternative index implementation.CC @pascutto, who kindly volunteered to review the suggestions.
The text was updated successfully, but these errors were encountered: