zigzag: cache optimizations #465

schomatis · 2019-02-02T16:23:18Z

No description provided.

schomatis · 2019-02-02T16:31:47Z

Baseline benchmark:

replication_time: 2.594320218s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 2.473µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2656.583903231s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

schomatis · 2019-02-02T17:42:04Z

Need to change the entire parents API to make it mutable to allow it to keep a state.

dignifiedquire · 2019-02-02T18:17:04Z

Need to change the entire parents API to make it mutable to allow it to keep a state.

I don't think you need to, take a look at this: https://doc.rust-lang.org/book/ch15-05-interior-mutability.html

schomatis · 2019-02-04T17:13:56Z

Thanks for that reference @dignifiedquire! I implemented the RefCell solution but at the end I got an error because the ZigZag seemed to be used across different threads, so I'm trying to use the RwLock instead (which is currently not working due to another error I need to figure out), does that makes sense to you?

schomatis · 2019-02-04T17:24:55Z

The current error is due to the fact that the RwLock introduced doesn't support the Eq and Clone traits that the ZigZag trait requires. Should I introduce the explicit methods that would need to handle the RwLock?

dignifiedquire · 2019-02-04T17:25:43Z

I believe the common solution to refcell across threads is to use Arc<RefCell<T>>

…

On 4. Feb 2019, 18:13 +0100, Lucas Molas ***@***.***>, wrote: Thanks for that reference @dignifiedquire! I implemented the RefCell solution but at the end I got an error because the ZigZag seemed to be used across different threads, so I'm trying to use the RwLock instead (which is currently not working due to another error I need to figure out), does that makes sense to you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

schomatis · 2019-02-04T17:37:57Z

I believe the common solution to refcell across threads is to use Arc<RefCell>

Arc<RefCell<T>> seems to go back to the cannot be shared between threads safely error (see 26069e7). I'll keep digging.

dignifiedquire

Using rwlock and implementing partialeq and clone manually seems like the right call to me. It would be nice to use an lru cache instead of a hashmap, to avoid unbounded growth, but I guess that can be done later

dignifiedquire · 2019-02-04T18:33:13Z

storage-proofs/src/zigzag_graph.rs

@@ -223,7 +231,13 @@ where

    #[inline]
    fn expanded_parents(&self, node: usize) -> Vec<usize> {
-        (0..self.expansion_degree)
+
+        let parents_cache = self.parents_cache().read().unwrap();


this lock should be inside a block like this

{ let parents_cache = ... }

otherwise the lock is held for too long, the same goes for the below write lock

yeap, just found that out in my local test

I was actually deadlocking the second lock by never releasing the first one.

dignifiedquire · 2019-02-04T18:33:53Z

storage-proofs/src/zigzag_graph.rs

+
+        let parents_cache = self.parents_cache().read().unwrap();
+        if (*parents_cache).contains_key(&node) {
+            return (*parents_cache)[&node].clone();


We should change the signature to return a &[]u8 instead, cloning vectors is expensive, and reduces teh usefullness of the cache

There seem to be some issues that would indicate that has already been fixed, e.g.,

rust-lang/rust#13472
rust-lang/rust#11015
rust-lang/rust#13539

WDYT?

schomatis · 2019-02-04T18:37:02Z

Made a temporary implementation of the Eq and Clone traits (which ignore the cache to change the current implementation as less as possible, since I'm not sure under which circumstances the graph is cloned) and I can now build the RwLock implementation (still not sure if this is the structure I should be using though).

dignifiedquire · 2019-02-04T18:45:52Z

storage-proofs/src/zigzag_graph.rs

+                // TODO(dig): We should change the signature to return a &[]u8 instead,
+                // cloning vectors is expensive, and reduces teh usefullness of the cache.
+            }
+            // Release the read lock (a write one will be taken later).


we can, but that would slow down things as it would always block (rwlock is single writer and many reader based).

schomatis · 2019-02-04T18:46:51Z

replication_time: 2.811180136s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 2.68µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2878.648459263s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

Pretty much the same time as without the cache (actually this is a bit slower), so I'll take a closer look at the implementation to check if I'm actually caching the results correctly (I'll try to add a test for it), until then I won't bother too much about the RwLock use (I don't think the lock mechanism is cancelling the performance improvement of the cache, but I don't know how is ZigZag used across threads).

(Thanks for the help @dignifiedquire.)

porcuquine · 2019-02-04T20:47:04Z

It's fairly likely that caching parents won't turn out to be a worthwhile tradeoff, so it would be fine to abandon this once you confirm that you've measured what you think. If you find that caching the Feistel computations (or expansion_parents) is slower than not doing so, that would be unexpected and worth investigating.

schomatis · 2019-02-05T00:17:02Z

It's fairly likely that caching parents won't turn out to be a worthwhile tradeoff, so it would be fine to abandon this once you confirm that you've measured what you think. If you find that caching the Feistel computations (or expansion_parents) is slower than not doing so, that would be unexpected and worth investigating.

Good to know, my current obstacles right now are independent of what and where we'll be caching, mainly:

(Probably a bug in my current implementation:) I need to prove that caching anything (everything) is in fact making a difference (we're getting some speed improvement even at the cost of memory footprint, which I'm not even tracking at the moment).
I'm trying not to change the parents API but still mutate the ZigZag structure, I couldn't use the RefCell pattern because the method is being called throughout different threads so I'm giving RwLock a try (I'm not paying much attention to this though, I just want to be able to test the caching without modifying big parts of the code).

schomatis · 2019-02-05T00:18:30Z

so I'll take a closer look at the implementation

Note to self: to simplify the initial Clone implementation I'm not copying the internal cache, this might have an important impact on performance (should clone by default at this point until I see any improvement).

schomatis · 2019-02-05T12:31:11Z

It turns out that the cache in expanded_parents is being used only 10 times out of a total number of 32K nodes (see bbc7329 and its bench output), at least in this benchmark example.

What was the motivation behind including a cache in the first place? In which scenario can we get a significant number of cache hits to actually test its performance impact?

/cc @porcuquine

schomatis · 2019-02-05T12:44:50Z

(Copying the cache when cloning the structure only increased the cache hit number to 15.)

schomatis · 2019-02-06T17:22:39Z

Share both caches.

The caches need to be distinguished (even for the same type of graph) between forward and reverse, the parents are not the same.

Most of the ZigZag graphs are created from their zigzag() counterpart (transforming forward to reverse and vice versa), that means we can't pass the cache of a graph to another because it will be inverted and it won't be useful to the new graph.

As a starting point every ZigZag graph will hold both caches, even if it will only used one of them throughout its lifetime, but it will retain the other one because it will be needed by the next zigzag().

This is already reducing the cache miss to only the encoding of the first layer (which was the desired effect) and providing a performance increase of 20%.

replication_time: 2.049265745s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 1.953µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 2098.448122879s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

With an actual working cache that proves its usefulness we can iterate from here evaluating the different trade-offs.

schomatis · 2019-02-08T03:24:14Z

@porcuquine Ready for review.

This is not the optimal/final solution. It's just a low impact cache (provided MAX_CACHE_SIZE is set to an acceptable value) that provides a benchmarked time optimization. The objective is to lock this down to have a concrete implementation to iterate from discussing possible trade-offs.

Strangely this is performing much better than the previous implementation that was conceptually the same, the only concrete difference was allocating up front the size of the caches. (Since I can't really explain this let's keep assuming for now the conservative 20% speed improvement mentioned before and not this 35%.)

replication_time: 1.668175459s, target: stats, place: filecoin-proofs/examples/zigzag.rs:170 zigzag, root: filecoin-proofs
replication_time/byte: 1.59µs, target: stats, place: filecoin-proofs/examples/zigzag.rs:171 zigzag, root: filecoin-proofs
replication_time/GiB: 1708.211670015s, target: stats, place: filecoin-proofs/examples/zigzag.rs:176 zigzag, root: filecoin-proofs

schomatis · 2019-02-08T03:24:50Z

(Dropping the CircleCI benchmark now.)

schomatis · 2019-02-08T04:03:02Z

the only concrete difference was allocating up front the size of the caches.

Actually, taking a look at the HashMap resize logic in

https://github.com/rust-lang/rust/blob/d1731801163df1d3a8d4ddfa68adac2ec833ef7f/src/libstd/collections/hash/map.rs#L941-L959

it may be more expensive than I originally thought.

schomatis · 2019-02-08T07:49:48Z

(Fix usize size estimation.)

porcuquine · 2019-02-08T21:04:24Z

I tried running this and got a message about cache size:

➜  rust-proofs git:(feat/zigzag/cache-optimizations) ✗ ./target/release/examples/zigzag --size 1048576 --m 5 --expansion 8 --no-bench
Feb 08 13:00:42.410 INFO hasher: pedersen, target: config, place: filecoin-proofs/examples/zigzag.rs:408 zigzag, root: filecoin-proofs
Feb 08 13:00:42.410 INFO data size: 1 GB, target: config, place: filecoin-proofs/examples/zigzag.rs:106 zigzag, root: filecoin-proofs
...
Feb 08 13:01:14.022 INFO running setup, place: filecoin-proofs/examples/zigzag.rs:138 zigzag, root: filecoin-proofs
Feb 08 13:01:14.022 INFO using a cache smaller (81920) than the number of nodes (33554432), place: storage-proofs/src/zigzag_graph.rs:107 storage_proofs::zigzag_graph, root: storage-proofs

Is that by design?

[Okay, I see that it is.]

porcuquine · 2019-02-08T21:17:55Z

storage-proofs/src/zigzag_graph.rs

+pub type ParentCache = HashMap<usize, Vec<usize>>;
+// TODO: Using `usize` as its the dominant type throughout the
+// code, but it should be reconciled with the underlying `u32`
+// used in Feistel.


u32 will be fine up to a point, but I don't think it needs to be matched to the Feistel implementation. Graph nodes are 32 bytes (2^5), so u32 will let us handle sectors of up to 2^5 bytes * 2^32 = 2^37 bytes = 128 GiB. We may eventually want/need to support larger sectors — in which case we would need larger index representations. Since 64 bits would be wasteful, maybe we should just be using the smallest number of bytes that will hold all the indexes we need for the graph in question.

Agreed, see https://github.com/filecoin-project/rust-proofs/pull/465/files/a9d50f972670bfaea6b6b94cf9ee877592435c18#diff-1551c863327ac4e6dbfca70895f4c9c2R104.

but I don't think it needs to be matched to the Feistel implementation.

My concern is that at the moment I haven't seen any check restricting the number of nodes (although I might have missed it in other related files), I seem to be able to pass any value to zigzag --size 100000000000 (I haven't waited for the generation of fake data to see if this actually continues the execution), and a usize in the code also gives the impression that any value is possible when we're actually coercing it later to u32 (so any value above that range will seem to violate the ZigZag semantics).

I agree that we are implicitly limited by Feistel. Here's what I suggest:

Let's put in an explicit check on the number of nodes allowed in a ZigZagGraph. As we move forward, we are going to need to work to be able to support larger and larger sector sizes. 128GiB is still out of range, so we don't need to solve the problem yet. Once we can otherwise handle such large sectors, we can extend our use of Feistel to accommodate that need.

porcuquine · 2019-02-08T21:35:45Z

storage-proofs/src/zigzag_graph.rs

+// ZigZagGraph will hold two different (but related) `ParentCache`,
+// the first one for the `forward` direction and the second one
+// for the `reversed`.
+pub type TwoWayParentCache = Vec<ParentCache>;


Since it's exactly two, you might also use a pair (ParentCache, ParentCache).

Alternately, you might consider — instead of two caches — storing a pair (or struct) holding both the 'forward' and 'backward' parents for each node. If your data structure is a BTreeMap of such pairs, this might be faster and/or smaller (I don't think there's physical overhead to such a static tuple) than two trees. Locality will also differ, so it might be something to play with when tweaking.

Since it's exactly two, you might also use a pair (ParentCache, ParentCache).

Yes, this seems more natural.

The tuple doesn't seem to allow dynamic indexing like tuple.index, it needs the literal number, but I'll change the Vec to just an array of fixed length though.

porcuquine · 2019-02-08T21:43:58Z

storage-proofs/src/zigzag_graph.rs

+// TODO: Evaluate decoupling the two caches in different `RwLock` to reduce
+// contention. At the moment they are joined under the same lock for simplicity
+// since `transform_and_replicate_layers` even in the parallel case seems to
+// generate the parents (`vde::encode`) of the different layers sequentially.


I think ZigZag almost gives a guarantee of serial access to subsequent layers when encoding. You can't begin to encode the next layer (in the opposite direction) until having finished with the current one. This could be fudged a little, and since the graph is reused for a period of time, one could analyze and perhaps come up with an ordering that violated this assumption, though.

However, multiple simultaneous replication processes certainly will want to have access. Multiple readers should be find, though with RwLock (right?) — so I assume you're just talking about initially populating the cache. We probably don't need to hyper-optimize that.

Multiple readers should be find, though with RwLock (right?)

Yes.

porcuquine · 2019-02-08T21:45:10Z

storage-proofs/src/zigzag_graph.rs

+    // This is not an LRU cache, it holds the first `cache_entries` of the total
+    // possible `base_graph.size()` (the assumption here is that we either request
+    // all entries sequentially when encoding or any random entry once when proving
+    // or verifying, but there's no locality to take advantage of so keep the logic


I think that assumption is accurate.

porcuquine · 2019-02-08T22:13:03Z

storage-proofs/src/zigzag_graph.rs

+    // it would allow to batch parents calculations with that single lock. Also,
+    // since there is a reciprocity between forward and reversed parents,
+    // we would only need to compute the parents in one direction and with
+    // that fill both caches.


Good observation. If you populate the forward and backward cache on the first pass, you can cut the Feistel calls in half and make full use of each.

porcuquine · 2019-02-08T22:15:16Z

storage-proofs/src/zigzag_graph.rs

+// TODO: Arbitrarily chosen for tests.
+
+// Cache of node's parents.
+pub type ParentCache = HashMap<usize, Vec<usize>>;


You may want to consider using a BTreeMap. I believe it will be more compact and faster to iterate through sequentially (either direction).

Depending on how we end up designing it, I'm considering pre-allocating all of it and leave just a [u8], but that's something that should be discussed in the issue.

schomatis · 2019-02-11T13:08:53Z

@porcuquine Thanks for the thorough review, I think most of what's discussed here should actually be moved to the original issue (see #455 (comment)) to finish up delineating the design of the cache. The purpose of this PR is just to have a basic implementation with a low memory footprint to help move the design discussion forward.

Besides I minor change I'll do about the cache tuple, what do you think needs to be changed here now (instead of postponing it for the design discussion in the issue) to land this PR? I think we should set MAX_CACHE_SIZE to a value that would have little impact at the moment (even at the cost of a poor performance improvement, since this is not the final version of the cache), is there something else?

schomatis · 2019-02-12T16:47:45Z

I tried running this and got a message about cache size:

We can make that a debug! or remove it altogether if you prefer. I just wanted to give that some visibility since this PR accomplishes a considerable performance improvement but only if the cache is big enough for the number of sectors we replicate, so that's something to have in mind while using it.

schomatis · 2019-02-28T15:54:31Z

If I understand correctly, the idea is to hold off on merging this until a configuration API is in place (which I think is captured in #501) that would help adjust the knobs of this cache.

porcuquine · 2019-02-28T16:31:52Z

That is correct. Please coordinate with @sidke about ETA on that feature.

schomatis · 2019-03-06T15:49:22Z

@porcuquine rebased and unlimited, go wild 🏃‍♂️

porcuquine · 2019-03-07T00:45:26Z

Thank you.

schomatis · 2019-03-08T20:31:01Z

@sidke delivered so it's my turn to push this forward (next week).

schomatis · 2019-03-08T20:31:48Z

@porcuquine heads-up, this PR will be changing in the following days (so make sure to cherry-pick what you need before that).

schomatis · 2019-03-12T13:18:55Z

Depends on (and will adapt to) #539.

porcuquine · 2019-03-13T07:38:01Z

@schomatis Now that #539 has merged, I think you are clearly to lightly adapt and finally get this one merged. Thank you for your patience.

schomatis · 2019-03-13T16:40:52Z

@porcuquine Adapted to the new config, ready for review.

schomatis · 2019-03-13T16:52:18Z

(Rebasing.)

porcuquine · 2019-03-13T17:01:16Z

storage-proofs/src/zigzag_graph.rs

+            // If we can't find the `MAXIMIZE_CACHING` assume the conservative
+            // option of no cache.
+        };
+


This looks good for now. We will probably move this logic into another layer later, but putting it at the point of use seems optimal for present purposes.

schomatis self-assigned this Feb 2, 2019

schomatis requested review from dignifiedquire and porcuquine as code owners February 2, 2019 16:23

schomatis force-pushed the feat/zigzag/cache-optimizations branch from f68adac to 465c48a Compare February 4, 2019 17:10

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 26069e7 to 2ab0798 Compare February 4, 2019 18:22

dignifiedquire reviewed Feb 4, 2019

View reviewed changes

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 2ab0798 to dc7906a Compare February 4, 2019 18:36

dignifiedquire reviewed Feb 4, 2019

View reviewed changes

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 0f1393e to b524f5a Compare February 8, 2019 03:12

schomatis force-pushed the feat/zigzag/cache-optimizations branch from b524f5a to 274a52d Compare February 8, 2019 03:24

schomatis changed the title ~~[WIP] zigzag: cache optimizations~~ zigzag: cache optimizations Feb 8, 2019

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 274a52d to a9d50f9 Compare February 8, 2019 07:48

porcuquine reviewed Feb 8, 2019

View reviewed changes

porcuquine mentioned this pull request Feb 12, 2019

Cache (expansion?) parents #455

Closed

schomatis force-pushed the feat/zigzag/cache-optimizations branch 4 times, most recently from d1a0196 to ce33e38 Compare March 6, 2019 15:34

schomatis force-pushed the feat/zigzag/cache-optimizations branch 4 times, most recently from 3e41d91 to 4175ba3 Compare March 13, 2019 16:40

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 4175ba3 to 9c2b2b7 Compare March 13, 2019 16:41

zigzag: cache expansion parents

e12f5c4

schomatis force-pushed the feat/zigzag/cache-optimizations branch from 9c2b2b7 to e12f5c4 Compare March 13, 2019 16:51

porcuquine approved these changes Mar 13, 2019

View reviewed changes

schomatis merged commit bb5a96e into master Mar 13, 2019

schomatis deleted the feat/zigzag/cache-optimizations branch March 13, 2019 17:07

zigzag: cache optimizations #465

zigzag: cache optimizations #465

Conversation

schomatis commented Feb 2, 2019 • edited Loading

schomatis commented Feb 2, 2019

schomatis commented Feb 2, 2019

dignifiedquire commented Feb 2, 2019

schomatis commented Feb 4, 2019

schomatis commented Feb 4, 2019

dignifiedquire commented Feb 4, 2019 via email

schomatis commented Feb 4, 2019

dignifiedquire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Feb 4, 2019

Choose a reason for hiding this comment

schomatis commented Feb 4, 2019

porcuquine commented Feb 4, 2019

schomatis commented Feb 5, 2019

schomatis commented Feb 5, 2019

schomatis commented Feb 5, 2019

schomatis commented Feb 5, 2019

schomatis commented Feb 6, 2019

schomatis commented Feb 8, 2019

schomatis commented Feb 8, 2019

schomatis commented Feb 8, 2019

schomatis commented Feb 8, 2019

porcuquine commented Feb 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Feb 11, 2019

schomatis commented Feb 12, 2019

schomatis commented Feb 28, 2019

porcuquine commented Feb 28, 2019

schomatis commented Mar 6, 2019

porcuquine commented Mar 7, 2019

schomatis commented Mar 8, 2019

schomatis commented Mar 8, 2019

schomatis commented Mar 12, 2019

porcuquine commented Mar 13, 2019

schomatis commented Mar 13, 2019

schomatis commented Mar 13, 2019

Choose a reason for hiding this comment

schomatis commented Feb 2, 2019 •

edited

Loading

porcuquine commented Feb 8, 2019 •

edited

Loading