Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack #1935

Merged
merged 5 commits into from
Jul 3, 2022

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Dec 12, 2021

Before this PR, a profile of Miri under almost any workload points quite squarely at these regions of code as being incredibly hot (each being ~40% of cycles):

miri/src/stacked_borrows.rs

Lines 259 to 269 in dadcbeb

self.borrows
.iter()
.enumerate() // we also need to know *where* in the stack
.rev() // search top-to-bottom
// Return permission of first item that grants access.
// We require a permission with the right tag, ensuring U3 and F3.
.find_map(
|(idx, item)| {
if tag == item.tag && item.perm.grants(access) { Some(idx) } else { None }
},
)

miri/src/stacked_borrows.rs

Lines 362 to 369 in dadcbeb

for idx in ((granting_idx + 1)..self.borrows.len()).rev() {
let item = &mut self.borrows[idx];
if item.perm == Permission::Unique {
trace!("access: disabling item {:?}", item);
Stack::check_protector(item, Some(tag), global)?;
item.perm = Permission::Disabled;
}
}

This code is one of at least three reasons that stacked borrows analysis is super-linear: These are both linear in the number of borrows in the stack and they are positioned along the most commonly-taken paths.

I'm addressing the first loop (which is in Stack::find_granting) by adding a very very simple sort of LRU cache implemented on a VecDeque, which maps recently-looked-up tags to their position in the stack. For Untagged access we fall back to the same sort of linear search. But as far as I can tell there are never enough Untagged items to be significant.

I'm addressing the second loop by keeping track of the region of stack where there could be items granting Permission::Unique. This optimization is incredibly effective because Read access tends to dominate and many trips through this code path now skip the loop entirely.

These optimizations result in pretty enormous improvements:
Without raw pointer tagging, mse 34.5s -> 2.4s, serde1 5.6s -> 3.6s
With raw pointer tagging, mse 35.3s -> 2.4s, serde1 5.7s -> 3.6s

And there is hardly any impact on memory usage:
Memory usage on mse 844 MB -> 848 MB, serde1 184 MB -> 184 MB (jitter on these is a few MB).

src/stacked_borrows.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

RalfJung commented Dec 16, 2021

Thanks a lot for this PR! Those are some impressive wins given the conceptually simple changes. :)

I'm addressing the second loop by keeping track of the region of stack where there could be items granting Permission::Unique. This optimization is incredibly effective because Read access tends to dominate and many trips through this code path now skip the loop entirely.

Presumably this also has basically no memory cost, so I wonder how much just doing this would help?

But as far as I can tell there are never enough Untagged items to be significant.

So could we save some memory by removing this part?

These optimizations result in a 30% improvement in the the serde1 benchmark and an 85% (~6x) improvement in the mse benchmark. Which is pretty great because I didn't even develop these changes by running against those programs.

Note that I have no idea how suited these benchmarks are. It might be good to do at least superficial benchmarking of the code in tests/run-pass/stacked-borrows to avoid surprises for more "normal" code.

Also measuring with default flags to ensure that doesn't regress (or cost a lot of memory for no gain) would be good.

But as far as I can tell from some very casual instrumenting, the root cause of the stacked borrows memory usage is that borrow stacks grow over time without bound and a RangeMap will keep duplicating Stacks. The Clone+PartialEq activity that this generates eventually dominates as stacks grow, because that's O(n) where the HashMap optimization here gets darn close to O(1) for all other operations. I think it may be possible to ease this by nesting stacks, something like parent: Option<Rc>, to avoid the expensive Clone operation until we actually need to modify the older elements.

That was one of my hypotheses for the main culprit. The other one is that even if RangeMap is fitting to the access pattern (so we don't keep splitting and merging subranges), just the fact that the stacks keep growing will be expensive in the long run -- even if we have perfect sharing, that's still quadratic since we keep iterating a larger and larger vector.

@saethlin
Copy link
Member Author

Thanks a lot for this PR! Those are some impressive wins given the conceptually simple changes. :)

Hah! I was afraid these would look prohibitively complex.

So could we save some memory by removing this part?

We could, if we always tag raw pointers. Without -Zmiri-tag-raw-pointers, this data structure returns indexes in find_granting and we need the right index to invalidate the right part of the stack. Or at least that's what I've been able to glean so far.

As a sidenote: This is definitely the reason -Zmiri-tag-raw-pointers is faster with this PR. Without it, there are still linear searches of the untagged stack in find_granting, and the number of Untagged borrows is up to about half that of the Tagged ones in large stacks.

Note that I have no idea how suited these benchmarks are.

Yeah. I'll cook up a little script to run through some more reasonable benchmarking code with the various SB options.

just the fact that the stacks keep growing will be expensive in the long run -- even if we have perfect sharing, that's still quadratic since we keep iterating a larger and larger vector.

I suspect all iteration can be efficiently amortized out. If the parent stack from a split is stored in an Rc/Arc we can use ptr_eq to do a constant-time parent comparison. But there also other questions about that implementation turns out on a real workload (I haven't even started on it yet).

@RalfJung
Copy link
Member

We could, if we always tag raw pointers. Without -Zmiri-tag-raw-pointers, this data structure returns indexes in find_granting and we need the right index to invalidate the right part of the stack. Or at least that's what I've been able to glean so far.

Well, we could always use a linear scan instead, right? That will only slow down raw ptr accesses, and save some memory. This might or might not be a reasonable trade-off, not sure.

Without it, there are still linear searches of the untagged stack in find_granting, and the number of Untagged borrows is up to about half that of the Tagged ones in large stacks.

That is odd, since the search should stop at the first untagged item it finds -- except when it is looking for a writeable item, then it will skip all the read-only untagged items.

@saethlin
Copy link
Member Author

That is odd, since the search should stop at the first untagged item it finds

You're right. I don't know what I was looking at before, but this is a good insight. Nearly every access of an Untagged item wants the topmost Untagged item (easily demonstrated with some println! and cargo miri run | counts. Removing the Vec of untagged indices has a ~3% improvement in RSS on mse, no change on serde1. But I see a ~50% improvement in the runtime of mse and a similarly-large regression on serde1. Which is itself interesting; some workloads are very Untagged-heavy but don't have an Untagged near the top of the stack.

So I of course slapped a trivial cache on it, and now it's an improvement across the board.

I was planning on using core as my benchmarks suite, but of course the join test is failing 😢 so I'm going to collect some interesting cases from third-party crates I've run cargo miri test on.

@RalfJung
Copy link
Member

I was planning on using core as my benchmarks suite, but of course the join test is failing

rust-lang/rust#92121 should help with that.

@bors
Copy link
Contributor

bors commented Dec 21, 2021

☔ The latest upstream changes (presumably #1945) made this pull request unmergeable. Please resolve the merge conflicts.

@saethlin
Copy link
Member Author

saethlin commented Jan 7, 2022

I'm still slowly working on this.

I tried running miri-test-libstd over core with -Zmiri-tag-raw-pointers and I killed the process after 2 hours (using 20 GB). It spent over an hour doing something after printing

Failures:

So I have absolutely no idea what was going on there. It would be cool to have a way to peer inside miri and see what code it is executing, best guess is something in the test harness ends up quadratic or worse inside miri due to the nature of stacked borrows.

I also tried running it the tests for core with this branch, and the memory usage started growing rapidly at the same stage in the run, I killed the process when it hit 100 GB. Neither of these runs completed, so I don't think there is a 5x memory blowup from this PR, it seems likely that memory would have only grown to this level if I'd let the current tip of master run long enough.

At the moment I'm chewing on ideas to do something about this. I don't think a patch that makes running the standard library tests with tag-raw-pointers a non-starter is a good idea.

@RalfJung
Copy link
Member

RalfJung commented Jan 8, 2022

t would be cool to have a way to peer inside miri and see what code it is executing,

Yeah that would be cool... Cc #1782.
The issue you describe sounds familiar as well: #1780. However it is odd that one of the tests would fail... Still, #1780 (comment) is probably interesting for you.

@RalfJung RalfJung added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Mar 5, 2022
@RalfJung
Copy link
Member

RalfJung commented Mar 5, 2022

I'm still slowly working on this.

Marking as "waiting for author" based on above comment. Please let me know if that is inaccurate. :)

@saethlin
Copy link
Member Author

saethlin commented Mar 6, 2022

That is accurate.

I'm currently focusing on other things, but I'm becoming increasingly convinced that the only good way out of this borrow stack growth situation is to GC the borrow stacks. But even if that works, it would require that Miri be notified when a pointer or reference goes away, which of course bumps up against ptr-int-ptr casts. So overall I feel like I don't have any ideas which are easy enough to implement that they rise up my priority list, and the hard ideas I have are stepping right into the most unstable parts of Miri.

@bors
Copy link
Contributor

bors commented May 14, 2022

☔ The latest upstream changes (presumably #2030) made this pull request unmergeable. Please resolve the merge conflicts.

@saethlin saethlin changed the title Optimizing Stacked Borrows (part 1?): Remove linear searches of borrow stacks Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack May 14, 2022
@saethlin
Copy link
Member Author

saethlin commented May 14, 2022

I've been thinking about this problem for a while, and I finally decided to code up my idea for a super simple limited-size cache. I've rebased in the new implementation and updated the top comment.

This is freaking awesome, can't believe I didn't try this approach before. I think there are a lot of clear ways that this could be micro-optimized, but what I'm seeing in profiles makes that work hard to justify. My biggest concern at this point is code organization.

@saethlin
Copy link
Member Author

@RalfJung @oli-obk this is not waiting-on-author anymore, it's looking-for-advice-on-factoring :)

@saethlin
Copy link
Member Author

Examples where this doesn't help as much

regex: the stack merging code seems to be really active here. regex==1.5.5

unicode-xid: the the cache hit rate in find_granting is still really high, but we keep having to do searches of tends of thousands of tags. I bumped the cache size to 32 with no impact, unclear if making it super big would help.
unicode-xid==0.2.3

@RalfJung RalfJung added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels May 15, 2022
@RalfJung
Copy link
Member

My biggest concern at this point is code organization.

Yeah, I agree. As a starter, actually documenting the data structure invariants and how exactly the cache relates to the "live" data would be good. :) Right now, to review this I would have to reverse engineer all that. In fact, I think we should have an actual Rust function that checks this invariant, i.e., that checks whether the cache is consistent with the "live" data, and we should have a compile-time flag to enable run-time checking that the cache does not get out-of-sync with the live data. Such a machine-readable invariant would probably also help clarify/complement the human-readable comments.

As for factoring things, the first "obvious" proposal would be to move Stack to its own file and make sure the public API is such that caching can never be broken. Is that possible?

@RalfJung RalfJung added S-waiting-on-author Status: Waiting for the PR author to address review comments and removed S-waiting-on-review Status: Waiting for a review to complete labels May 18, 2022
@bors
Copy link
Contributor

bors commented Jul 1, 2022

☔ The latest upstream changes (presumably #2297) made this pull request unmergeable. Please resolve the merge conflicts.

This adds a very simple LRU-like cache which stores the locations of
often-used tags. While the implementation is very simple, the cache hit
rate is incredible at ~99.9% on most programs, and often the element at
position 0 in the cache has a hit rate of 90%. So the sub-optimality of
this cache basicaly vanishes into the noise in a profile.

Additionally, we keep a range which denotes where there might be an item
granting Unique permission in the stack, so that when we invalidate
Uniques we do not need to scan much of the stack, and often scan nothing
at all.
@saethlin
Copy link
Member Author

saethlin commented Jul 2, 2022

I'm getting denied by clippy:

error: this returns a Result<_, ()>

But isn't that same signature currently on the master branch? I'm happy to add an allow here, but it feels odd to be adding the allow in this PR.

ui_test/Cargo.toml Outdated Show resolved Hide resolved
src/stacked_borrows.rs Outdated Show resolved Hide resolved
src/stacked_borrows.rs Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/stacked_borrows.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

RalfJung commented Jul 2, 2022

Other than my last rounds of comments, I think this is good to go. :)

@RalfJung
Copy link
Member

RalfJung commented Jul 3, 2022

Stacked Borrows go brrrrrr
@bors r+

@bors
Copy link
Contributor

bors commented Jul 3, 2022

📌 Commit b004a03 has been approved by RalfJung

@bors
Copy link
Contributor

bors commented Jul 3, 2022

⌛ Testing commit b004a03 with merge cfad9d1...

@bors
Copy link
Contributor

bors commented Jul 3, 2022

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing cfad9d1 to master...

@bors bors merged commit cfad9d1 into rust-lang:master Jul 3, 2022
@saethlin saethlin deleted the optimize-sb branch July 13, 2022 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: Waiting for the PR author to address review comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants