-
Notifications
You must be signed in to change notification settings - Fork 316
zigzag: cache optimizations #465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Baseline benchmark:
|
Need to change the entire |
I don't think you need to, take a look at this: https://doc.rust-lang.org/book/ch15-05-interior-mutability.html |
f68adac
to
465c48a
Compare
Thanks for that reference @dignifiedquire! I implemented the |
The current error is due to the fact that the |
I believe the common solution to refcell across threads is to use Arc<RefCell<T>>
…On 4. Feb 2019, 18:13 +0100, Lucas Molas ***@***.***>, wrote:
Thanks for that reference @dignifiedquire! I implemented the RefCell solution but at the end I got an error because the ZigZag seemed to be used across different threads, so I'm trying to use the RwLock instead (which is currently not working due to another error I need to figure out), does that makes sense to you?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
26069e7
to
2ab0798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using rwlock and implementing partialeq and clone manually seems like the right call to me. It would be nice to use an lru cache instead of a hashmap, to avoid unbounded growth, but I guess that can be done later
storage-proofs/src/zigzag_graph.rs
Outdated
@@ -223,7 +231,13 @@ where | |||
|
|||
#[inline] | |||
fn expanded_parents(&self, node: usize) -> Vec<usize> { | |||
(0..self.expansion_degree) | |||
|
|||
let parents_cache = self.parents_cache().read().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lock should be inside a block like this
{
let parents_cache = ...
}
otherwise the lock is held for too long, the same goes for the below write lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, just found that out in my local test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually deadlocking the second lock by never releasing the first one.
storage-proofs/src/zigzag_graph.rs
Outdated
|
||
let parents_cache = self.parents_cache().read().unwrap(); | ||
if (*parents_cache).contains_key(&node) { | ||
return (*parents_cache)[&node].clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change the signature to return a &[]u8
instead, cloning vectors is expensive, and reduces teh usefullness of the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be some issues that would indicate that has already been fixed, e.g.,
rust-lang/rust#13472
rust-lang/rust#11015
rust-lang/rust#13539
WDYT?
2ab0798
to
dc7906a
Compare
Made a temporary implementation of the |
storage-proofs/src/zigzag_graph.rs
Outdated
// TODO(dig): We should change the signature to return a &[]u8 instead, | ||
// cloning vectors is expensive, and reduces teh usefullness of the cache. | ||
} | ||
// Release the read lock (a write one will be taken later). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can, but that would slow down things as it would always block (rwlock is single writer and many reader based).
Pretty much the same time as without the cache (actually this is a bit slower), so I'll take a closer look at the implementation to check if I'm actually caching the results correctly (I'll try to add a test for it), until then I won't bother too much about the (Thanks for the help @dignifiedquire.) |
It's fairly likely that caching |
Good to know, my current obstacles right now are independent of what and where we'll be caching, mainly:
|
Note to self: to simplify the initial |
It turns out that the cache in What was the motivation behind including a cache in the first place? In which scenario can we get a significant number of cache hits to actually test its performance impact? /cc @porcuquine |
(Copying the cache when cloning the structure only increased the cache hit number to 15.) |
Share both caches. The caches need to be distinguished (even for the same type of graph) between forward and reverse, the parents are not the same. Most of the ZigZag graphs are created from their As a starting point every ZigZag graph will hold both caches, even if it will only used one of them throughout its lifetime, but it will retain the other one because it will be needed by the next This is already reducing the cache miss to only the encoding of the first layer (which was the desired effect) and providing a performance increase of 20%.
With an actual working cache that proves its usefulness we can iterate from here evaluating the different trade-offs. |
0f1393e
to
b524f5a
Compare
@porcuquine Ready for review. This is not the optimal/final solution. It's just a low impact cache (provided Strangely this is performing much better than the previous implementation that was conceptually the same, the only concrete difference was allocating up front the size of the caches. (Since I can't really explain this let's keep assuming for now the conservative 20% speed improvement mentioned before and not this 35%.)
|
b524f5a
to
274a52d
Compare
(Dropping the CircleCI benchmark now.) |
Actually, taking a look at the it may be more expensive than I originally thought. |
274a52d
to
a9d50f9
Compare
(Fix |
I tried running this and got a message about cache size:
Is that by design? [Okay, I see that it is.] |
storage-proofs/src/zigzag_graph.rs
Outdated
pub type ParentCache = HashMap<usize, Vec<usize>>; | ||
// TODO: Using `usize` as its the dominant type throughout the | ||
// code, but it should be reconciled with the underlying `u32` | ||
// used in Feistel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u32
will be fine up to a point, but I don't think it needs to be matched to the Feistel implementation. Graph nodes are 32 bytes (2^5), so u32
will let us handle sectors of up to 2^5 bytes * 2^32 = 2^37 bytes = 128 GiB. We may eventually want/need to support larger sectors — in which case we would need larger index representations. Since 64 bits would be wasteful, maybe we should just be using the smallest number of bytes that will hold all the indexes we need for the graph in question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I don't think it needs to be matched to the Feistel implementation.
My concern is that at the moment I haven't seen any check restricting the number of nodes (although I might have missed it in other related files), I seem to be able to pass any value to zigzag --size 100000000000
(I haven't waited for the generation of fake data to see if this actually continues the execution), and a usize
in the code also gives the impression that any value is possible when we're actually coercing it later to u32
(so any value above that range will seem to violate the ZigZag semantics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we are implicitly limited by Feistel. Here's what I suggest:
Let's put in an explicit check on the number of nodes allowed in a ZigZagGraph
. As we move forward, we are going to need to work to be able to support larger and larger sector sizes. 128GiB is still out of range, so we don't need to solve the problem yet. Once we can otherwise handle such large sectors, we can extend our use of Feistel to accommodate that need.
storage-proofs/src/zigzag_graph.rs
Outdated
// ZigZagGraph will hold two different (but related) `ParentCache`, | ||
// the first one for the `forward` direction and the second one | ||
// for the `reversed`. | ||
pub type TwoWayParentCache = Vec<ParentCache>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's exactly two, you might also use a pair (ParentCache, ParentCache)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, you might consider — instead of two caches — storing a pair (or struct) holding both the 'forward' and 'backward' parents for each node. If your data structure is a BTreeMap
of such pairs, this might be faster and/or smaller (I don't think there's physical overhead to such a static tuple) than two trees. Locality will also differ, so it might be something to play with when tweaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's exactly two, you might also use a pair
(ParentCache, ParentCache)
.
Yes, this seems more natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple doesn't seem to allow dynamic indexing like tuple.index
, it needs the literal number, but I'll change the Vec
to just an array of fixed length though.
storage-proofs/src/zigzag_graph.rs
Outdated
// TODO: Evaluate decoupling the two caches in different `RwLock` to reduce | ||
// contention. At the moment they are joined under the same lock for simplicity | ||
// since `transform_and_replicate_layers` even in the parallel case seems to | ||
// generate the parents (`vde::encode`) of the different layers sequentially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ZigZag almost gives a guarantee of serial access to subsequent layers when encoding. You can't begin to encode the next layer (in the opposite direction) until having finished with the current one. This could be fudged a little, and since the graph is reused for a period of time, one could analyze and perhaps come up with an ordering that violated this assumption, though.
However, multiple simultaneous replication processes certainly will want to have access. Multiple readers should be find, though with RwLock
(right?) — so I assume you're just talking about initially populating the cache. We probably don't need to hyper-optimize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple readers should be find, though with
RwLock
(right?)
Yes.
// This is not an LRU cache, it holds the first `cache_entries` of the total | ||
// possible `base_graph.size()` (the assumption here is that we either request | ||
// all entries sequentially when encoding or any random entry once when proving | ||
// or verifying, but there's no locality to take advantage of so keep the logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that assumption is accurate.
// it would allow to batch parents calculations with that single lock. Also, | ||
// since there is a reciprocity between forward and reversed parents, | ||
// we would only need to compute the parents in one direction and with | ||
// that fill both caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. If you populate the forward and backward cache on the first pass, you can cut the Feistel calls in half and make full use of each.
// TODO: Arbitrarily chosen for tests. | ||
|
||
// Cache of node's parents. | ||
pub type ParentCache = HashMap<usize, Vec<usize>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to consider using a BTreeMap
. I believe it will be more compact and faster to iterate through sequentially (either direction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how we end up designing it, I'm considering pre-allocating all of it and leave just a [u8]
, but that's something that should be discussed in the issue.
@porcuquine Thanks for the thorough review, I think most of what's discussed here should actually be moved to the original issue (see #455 (comment)) to finish up delineating the design of the cache. The purpose of this PR is just to have a basic implementation with a low memory footprint to help move the design discussion forward. Besides I minor change I'll do about the cache tuple, what do you think needs to be changed here now (instead of postponing it for the design discussion in the issue) to land this PR? I think we should set |
We can make that a |
If I understand correctly, the idea is to hold off on merging this until a configuration API is in place (which I think is captured in #501) that would help adjust the knobs of this cache. |
That is correct. Please coordinate with @sidke about ETA on that feature. |
d1a0196
to
ce33e38
Compare
@porcuquine rebased and unlimited, go wild 🏃♂️ |
Thank you. |
@sidke delivered so it's my turn to push this forward (next week). |
@porcuquine heads-up, this PR will be changing in the following days (so make sure to |
Depends on (and will adapt to) #539. |
@schomatis Now that #539 has merged, I think you are clearly to lightly adapt and finally get this one merged. Thank you for your patience. |
3e41d91
to
4175ba3
Compare
@porcuquine Adapted to the new config, ready for review. |
4175ba3
to
9c2b2b7
Compare
9c2b2b7
to
e12f5c4
Compare
(Rebasing.) |
// If we can't find the `MAXIMIZE_CACHING` assume the conservative | ||
// option of no cache. | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for now. We will probably move this logic into another layer later, but putting it at the point of use seems optimal for present purposes.
No description provided.