-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack #1935
Conversation
Thanks a lot for this PR! Those are some impressive wins given the conceptually simple changes. :)
Presumably this also has basically no memory cost, so I wonder how much just doing this would help?
So could we save some memory by removing this part?
Note that I have no idea how suited these benchmarks are. It might be good to do at least superficial benchmarking of the code in Also measuring with default flags to ensure that doesn't regress (or cost a lot of memory for no gain) would be good.
That was one of my hypotheses for the main culprit. The other one is that even if RangeMap is fitting to the access pattern (so we don't keep splitting and merging subranges), just the fact that the stacks keep growing will be expensive in the long run -- even if we have perfect sharing, that's still quadratic since we keep iterating a larger and larger vector. |
Hah! I was afraid these would look prohibitively complex.
We could, if we always tag raw pointers. Without As a sidenote: This is definitely the reason
Yeah. I'll cook up a little script to run through some more reasonable benchmarking code with the various SB options.
I suspect all iteration can be efficiently amortized out. If the parent stack from a split is stored in an |
Well, we could always use a linear scan instead, right? That will only slow down raw ptr accesses, and save some memory. This might or might not be a reasonable trade-off, not sure.
That is odd, since the search should stop at the first untagged item it finds -- except when it is looking for a writeable item, then it will skip all the read-only untagged items. |
You're right. I don't know what I was looking at before, but this is a good insight. Nearly every access of an Untagged item wants the topmost Untagged item (easily demonstrated with some So I of course slapped a trivial cache on it, and now it's an improvement across the board. I was planning on using |
rust-lang/rust#92121 should help with that. |
☔ The latest upstream changes (presumably #1945) made this pull request unmergeable. Please resolve the merge conflicts. |
9606bce
to
a73d32f
Compare
I'm still slowly working on this. I tried running miri-test-libstd over
So I have absolutely no idea what was going on there. It would be cool to have a way to peer inside miri and see what code it is executing, best guess is something in the test harness ends up quadratic or worse inside miri due to the nature of stacked borrows. I also tried running it the tests for At the moment I'm chewing on ideas to do something about this. I don't think a patch that makes running the standard library tests with tag-raw-pointers a non-starter is a good idea. |
Yeah that would be cool... Cc #1782. |
Marking as "waiting for author" based on above comment. Please let me know if that is inaccurate. :) |
That is accurate. I'm currently focusing on other things, but I'm becoming increasingly convinced that the only good way out of this borrow stack growth situation is to GC the borrow stacks. But even if that works, it would require that Miri be notified when a pointer or reference goes away, which of course bumps up against ptr-int-ptr casts. So overall I feel like I don't have any ideas which are easy enough to implement that they rise up my priority list, and the hard ideas I have are stepping right into the most unstable parts of Miri. |
☔ The latest upstream changes (presumably #2030) made this pull request unmergeable. Please resolve the merge conflicts. |
I've been thinking about this problem for a while, and I finally decided to code up my idea for a super simple limited-size cache. I've rebased in the new implementation and updated the top comment. This is freaking awesome, can't believe I didn't try this approach before. I think there are a lot of clear ways that this could be micro-optimized, but what I'm seeing in profiles makes that work hard to justify. My biggest concern at this point is code organization. |
Yeah, I agree. As a starter, actually documenting the data structure invariants and how exactly the cache relates to the "live" data would be good. :) Right now, to review this I would have to reverse engineer all that. In fact, I think we should have an actual Rust function that checks this invariant, i.e., that checks whether the cache is consistent with the "live" data, and we should have a compile-time flag to enable run-time checking that the cache does not get out-of-sync with the live data. Such a machine-readable invariant would probably also help clarify/complement the human-readable comments. As for factoring things, the first "obvious" proposal would be to move |
☔ The latest upstream changes (presumably #2297) made this pull request unmergeable. Please resolve the merge conflicts. |
This adds a very simple LRU-like cache which stores the locations of often-used tags. While the implementation is very simple, the cache hit rate is incredible at ~99.9% on most programs, and often the element at position 0 in the cache has a hit rate of 90%. So the sub-optimality of this cache basicaly vanishes into the noise in a profile. Additionally, we keep a range which denotes where there might be an item granting Unique permission in the stack, so that when we invalidate Uniques we do not need to scan much of the stack, and often scan nothing at all.
I'm getting denied by clippy:
But isn't that same signature currently on the master branch? I'm happy to add an allow here, but it feels odd to be adding the allow in this PR. |
Other than my last rounds of comments, I think this is good to go. :) |
Stacked Borrows go brrrrrr |
📌 Commit b004a03 has been approved by |
☀️ Test successful - checks-actions |
Before this PR, a profile of Miri under almost any workload points quite squarely at these regions of code as being incredibly hot (each being ~40% of cycles):
miri/src/stacked_borrows.rs
Lines 259 to 269 in dadcbeb
miri/src/stacked_borrows.rs
Lines 362 to 369 in dadcbeb
This code is one of at least three reasons that stacked borrows analysis is super-linear: These are both linear in the number of borrows in the stack and they are positioned along the most commonly-taken paths.
I'm addressing the first loop (which is in
Stack::find_granting
) by adding a very very simple sort of LRU cache implemented on aVecDeque
, which maps recently-looked-up tags to their position in the stack. ForUntagged
access we fall back to the same sort of linear search. But as far as I can tell there are never enoughUntagged
items to be significant.I'm addressing the second loop by keeping track of the region of stack where there could be items granting
Permission::Unique
. This optimization is incredibly effective becauseRead
access tends to dominate and many trips through this code path now skip the loop entirely.These optimizations result in pretty enormous improvements:
Without raw pointer tagging,
mse
34.5s -> 2.4s,serde1
5.6s -> 3.6sWith raw pointer tagging,
mse
35.3s -> 2.4s,serde1
5.7s -> 3.6sAnd there is hardly any impact on memory usage:
Memory usage on
mse
844 MB -> 848 MB,serde1
184 MB -> 184 MB (jitter on these is a few MB).