Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the memory usage of the write-ahead log #354

Closed
craigfe opened this issue Oct 3, 2021 · 4 comments · Fixed by #355
Closed

Reduce the memory usage of the write-ahead log #354

craigfe opened this issue Oct 3, 2021 · 4 comments · Fixed by #355

Comments

@craigfe
Copy link
Member

craigfe commented Oct 3, 2021

Context

The total IO done by an index writer is roughly proportional to the maximum size of the log file, since the store must be entirely rewritten once per log_size replace operations. As a result, the in-memory mirror of the log file tends to dominate the overall memory usage (i.e. since it's worth making log_size as large as possible). When used in irmin-pack, this data-structure expands out as follows:

  (key, value) Stdlib.Hashtbl.t
= (string<32>, (int63 * int * char)) Stdlib.Hashtbl.t
= (string<32>, (int63 * int * char)) list2 array

When Sys.word_size = 64, this makes for approximately 14.5 words per binding, broken down into:

  • 6 word string (header + 4 words data + null padding)
  • 4 word triple (header + 3 fields)
  • 4 word list node (header + key pointer + value pointer + tail pointer)
  • ~0.5 word bucket slot (in the best case, there are ~2 bindings per bucket)

In the worst case, there are both reader and writer instances and merges are stacking, meaning there are 4 such hashtables (log and log_async for each of the reader and writer) and the writer also has an array of sorted entries (5 words per binding). When log_size = 500_000, this sums to ~250MB. Here's two (conflicting) optimisations we could make:

Suggestion 1: functorise over a user-defined hashtable implementations

We could provide a functor that allows the user to provide their own Hashtbl implementation specialised to their particular key and value types. In the specific case of irmin-pack, we could do the following:

  • keep string keys in an arena (-2 words)
  • use a small-size optimised list for hashtable buckets (-1.5 words)
  • pack the value triple into a pair of int63s (offset and kind_then_length) and keep them as immediates in the hashtable (-3 words)

This makes for 8 words per binding. By itself, this isn't much of a change since the entries must be unpacked from their compact representation to sort them in an array before the merge. The simple solution here is just to pick the hashtable bucket using the upper bits of key_hash, so that the hashtable is already almost sorted in the correct order (only the bindings within a particular bucket are relatively out-of-order). This means we can expose a to_sorted_seq that takes O(1) additional memory.

When log_size = 500_000, this works out to ~128MB in the worst case. (Half what we had before.)

Suggestion 2: keep only log offsets in memory

The previous approach effectively lets the user pick an efficient packed representation for entries in the write-ahead log. However, Index can already uniquely identify entries by their offset on disk (in 1 word). If we commit to needing to read when finding in the log file (and when resizing the hashtable), we can keep only the offsets in memory (in hashset buckets, again keyed by the upper bits of key_hash). This achieves 2 words per binding.

When log_size = 500_000, this is ~32MB in the worst case. The disadvantage is that values in the log can no longer be found without reading on disk, so best-case read performance is worse. If there are many key_hash prefix collisions in a given bucket, we have to read and decode each entry on find.

Analysis

  Suggestion 1 Suggestion 2
Memory ~2x improvement ~7x improvement
Performance Same as before – fast. Worse than before: successful log finds must always read from disk. However, the improved memory footprint could allow an LRU to be added, which may be more effective anyway.
Interface Optimised Index is more complicated and requires that complex code (i.e. an extended hashtable implementation) lives outside Index in the user's code. We can provide helper functors to build compact hashtable implementations, but users won't benefit from them for free. Same as before – simple.

Here are some preliminary measurements using the Index replay trace with ~100 million operations. Firstly, internally-measured heap size as a function of operations performed:

image

Secondly, externally-measured maxrss as a function of time:

image

This shows that the performance hit of a naive implementation of Suggestion 2 is roughly 10% (but my laptop isn't a very precise environment). I suspect it's possible to recover the performance with a relatively small LRU, but haven't done this yet.

Overall, I'm leaning towards #2, depending on the measured benefits of adding an LRU. This should make it possible to substantially increase the default log_size in Tezos (and so avoid stacking merges / improve IO performance). In the medium-term, we'll have complementary solutions to the the log size issue in the form of reduced index load and/or an alternative index implementation.

CC @pascutto, who kindly volunteered to review the suggestions.

@pascutto
Copy link
Contributor

pascutto commented Oct 4, 2021

Thanks for writing this up, that's great!

I also tend to think that option #2 is a better (and good) solution. I had thought about a while ago but I thought the performance cost would be higher (at the time we were optimising for syscalls), not only because all finds read to the disk, but also because we're breaking the append-only pattern in the log file and we now need to lseek more. It's great to read the impact is not that big! I also agree an LRU can balance it easily (actually perhaps something more LFU, but we're closer to having a good LRU implementation).
It also does not conflict with #299 which is fortunate.

#1 seems a bit too hacky to me, and complicates the use of index even more (and it's already quite complicated), so I wouldn't go that way.

@pascutto
Copy link
Contributor

pascutto commented Oct 4, 2021

One more note on #2 though: I think that changes the behaviour of find for RO instances a bit.
IIRC at the moment, RO instances keep an independant (from the RW) copy of the log, which is updated via sync, so you get answers from the index as it was at the last sync.

Now this is broken with this solution, as the RO instance relies on the on-disk state, which can be modified by RW. Consider the following scenario:

  • RW writes binding with key k and value v at offset off.
  • RO syncs, now its log contains k -> off.
  • RW writes more and merges, the log is cleared (file and memory), k -> v is transferred to data.
  • RW writes more, to the point that off is used for another binding with key k' and value v'.
  • RO finds k (does not sync beforehand), its log says it's at offset off, but now it gets v' at this location.
    In the current implementation, that last find returns the correct value, while now it (silently) returns a wrong one.

It seems easy to overcome e.g. with a header check from the RO to make sure a merge hasn't happened, but perhaps that also hurts the performance?

@Ngoguey42
Copy link

Ngoguey42 commented Oct 4, 2021

@pascutto If my understanding is correct, this shouldn't be a problem as the RO will keep a file descriptor pointing to the old log file. See:

EDIT:
And see https://github.com/mirage/index/blob/master/src/index.ml?ts=4#L823

@pascutto
Copy link
Contributor

pascutto commented Oct 4, 2021

Ah right, I missed that change, it looks like it should be OK then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants