Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack #1935

saethlin · 2021-12-12T01:53:56Z

Before this PR, a profile of Miri under almost any workload points quite squarely at these regions of code as being incredibly hot (each being ~40% of cycles):

miri/src/stacked_borrows.rs

Lines 259 to 269 in dadcbeb

    
           self.borrows 
        
               .iter() 
        
               .enumerate() // we also need to know *where* in the stack 
        
               .rev() // search top-to-bottom 
        
               // Return permission of first item that grants access. 
        
               // We require a permission with the right tag, ensuring U3 and F3. 
        
               .find_map( 
        
                   |(idx, item)| { 
        
                       if tag == item.tag && item.perm.grants(access) { Some(idx) } else { None } 
        
                   }, 
        
               )

miri/src/stacked_borrows.rs

Lines 362 to 369 in dadcbeb

    
           for idx in ((granting_idx + 1)..self.borrows.len()).rev() { 
        
               let item = &mut self.borrows[idx]; 
        
               if item.perm == Permission::Unique { 
        
                   trace!("access: disabling item {:?}", item); 
        
                   Stack::check_protector(item, Some(tag), global)?; 
        
                   item.perm = Permission::Disabled; 
        
               } 
        
           }

This code is one of at least three reasons that stacked borrows analysis is super-linear: These are both linear in the number of borrows in the stack and they are positioned along the most commonly-taken paths.

I'm addressing the first loop (which is in Stack::find_granting) by adding a very very simple sort of LRU cache implemented on a VecDeque, which maps recently-looked-up tags to their position in the stack. For Untagged access we fall back to the same sort of linear search. But as far as I can tell there are never enough Untagged items to be significant.

I'm addressing the second loop by keeping track of the region of stack where there could be items granting Permission::Unique. This optimization is incredibly effective because Read access tends to dominate and many trips through this code path now skip the loop entirely.

These optimizations result in pretty enormous improvements:
Without raw pointer tagging, mse 34.5s -> 2.4s, serde1 5.6s -> 3.6s
With raw pointer tagging, mse 35.3s -> 2.4s, serde1 5.7s -> 3.6s

And there is hardly any impact on memory usage:
Memory usage on mse 844 MB -> 848 MB, serde1 184 MB -> 184 MB (jitter on these is a few MB).

src/stacked_borrows.rs

RalfJung · 2021-12-16T16:46:38Z

Thanks a lot for this PR! Those are some impressive wins given the conceptually simple changes. :)

I'm addressing the second loop by keeping track of the region of stack where there could be items granting Permission::Unique. This optimization is incredibly effective because Read access tends to dominate and many trips through this code path now skip the loop entirely.

Presumably this also has basically no memory cost, so I wonder how much just doing this would help?

But as far as I can tell there are never enough Untagged items to be significant.

So could we save some memory by removing this part?

These optimizations result in a 30% improvement in the the serde1 benchmark and an 85% (~6x) improvement in the mse benchmark. Which is pretty great because I didn't even develop these changes by running against those programs.

Note that I have no idea how suited these benchmarks are. It might be good to do at least superficial benchmarking of the code in tests/run-pass/stacked-borrows to avoid surprises for more "normal" code.

Also measuring with default flags to ensure that doesn't regress (or cost a lot of memory for no gain) would be good.

But as far as I can tell from some very casual instrumenting, the root cause of the stacked borrows memory usage is that borrow stacks grow over time without bound and a RangeMap will keep duplicating Stacks. The Clone+PartialEq activity that this generates eventually dominates as stacks grow, because that's O(n) where the HashMap optimization here gets darn close to O(1) for all other operations. I think it may be possible to ease this by nesting stacks, something like parent: Option<Rc>, to avoid the expensive Clone operation until we actually need to modify the older elements.

That was one of my hypotheses for the main culprit. The other one is that even if RangeMap is fitting to the access pattern (so we don't keep splitting and merging subranges), just the fact that the stacks keep growing will be expensive in the long run -- even if we have perfect sharing, that's still quadratic since we keep iterating a larger and larger vector.

saethlin · 2021-12-17T23:27:02Z

Thanks a lot for this PR! Those are some impressive wins given the conceptually simple changes. :)

Hah! I was afraid these would look prohibitively complex.

So could we save some memory by removing this part?

We could, if we always tag raw pointers. Without -Zmiri-tag-raw-pointers, this data structure returns indexes in find_granting and we need the right index to invalidate the right part of the stack. Or at least that's what I've been able to glean so far.

As a sidenote: This is definitely the reason -Zmiri-tag-raw-pointers is faster with this PR. Without it, there are still linear searches of the untagged stack in find_granting, and the number of Untagged borrows is up to about half that of the Tagged ones in large stacks.

Note that I have no idea how suited these benchmarks are.

Yeah. I'll cook up a little script to run through some more reasonable benchmarking code with the various SB options.

just the fact that the stacks keep growing will be expensive in the long run -- even if we have perfect sharing, that's still quadratic since we keep iterating a larger and larger vector.

I suspect all iteration can be efficiently amortized out. If the parent stack from a split is stored in an Rc/Arc we can use ptr_eq to do a constant-time parent comparison. But there also other questions about that implementation turns out on a real workload (I haven't even started on it yet).

RalfJung · 2021-12-19T18:52:35Z

We could, if we always tag raw pointers. Without -Zmiri-tag-raw-pointers, this data structure returns indexes in find_granting and we need the right index to invalidate the right part of the stack. Or at least that's what I've been able to glean so far.

Well, we could always use a linear scan instead, right? That will only slow down raw ptr accesses, and save some memory. This might or might not be a reasonable trade-off, not sure.

Without it, there are still linear searches of the untagged stack in find_granting, and the number of Untagged borrows is up to about half that of the Tagged ones in large stacks.

That is odd, since the search should stop at the first untagged item it finds -- except when it is looking for a writeable item, then it will skip all the read-only untagged items.

saethlin · 2021-12-20T01:18:46Z

That is odd, since the search should stop at the first untagged item it finds

You're right. I don't know what I was looking at before, but this is a good insight. Nearly every access of an Untagged item wants the topmost Untagged item (easily demonstrated with some println! and cargo miri run | counts. Removing the Vec of untagged indices has a ~3% improvement in RSS on mse, no change on serde1. But I see a ~50% improvement in the runtime of mse and a similarly-large regression on serde1. Which is itself interesting; some workloads are very Untagged-heavy but don't have an Untagged near the top of the stack.

So I of course slapped a trivial cache on it, and now it's an improvement across the board.

I was planning on using core as my benchmarks suite, but of course the join test is failing 😢 so I'm going to collect some interesting cases from third-party crates I've run cargo miri test on.

RalfJung · 2021-12-20T11:35:56Z

I was planning on using core as my benchmarks suite, but of course the join test is failing

rust-lang/rust#92121 should help with that.

bors · 2021-12-21T21:46:10Z

☔ The latest upstream changes (presumably #1945) made this pull request unmergeable. Please resolve the merge conflicts.

saethlin · 2022-01-07T17:01:05Z

I'm still slowly working on this.

I tried running miri-test-libstd over core with -Zmiri-tag-raw-pointers and I killed the process after 2 hours (using 20 GB). It spent over an hour doing something after printing

Failures:

So I have absolutely no idea what was going on there. It would be cool to have a way to peer inside miri and see what code it is executing, best guess is something in the test harness ends up quadratic or worse inside miri due to the nature of stacked borrows.

I also tried running it the tests for core with this branch, and the memory usage started growing rapidly at the same stage in the run, I killed the process when it hit 100 GB. Neither of these runs completed, so I don't think there is a 5x memory blowup from this PR, it seems likely that memory would have only grown to this level if I'd let the current tip of master run long enough.

At the moment I'm chewing on ideas to do something about this. I don't think a patch that makes running the standard library tests with tag-raw-pointers a non-starter is a good idea.

RalfJung · 2022-01-08T16:32:05Z

t would be cool to have a way to peer inside miri and see what code it is executing,

Yeah that would be cool... Cc #1782.
The issue you describe sounds familiar as well: #1780. However it is odd that one of the tests would fail... Still, #1780 (comment) is probably interesting for you.

RalfJung · 2022-03-05T22:36:47Z

I'm still slowly working on this.

Marking as "waiting for author" based on above comment. Please let me know if that is inaccurate. :)

saethlin · 2022-03-06T00:31:07Z

That is accurate.

I'm currently focusing on other things, but I'm becoming increasingly convinced that the only good way out of this borrow stack growth situation is to GC the borrow stacks. But even if that works, it would require that Miri be notified when a pointer or reference goes away, which of course bumps up against ptr-int-ptr casts. So overall I feel like I don't have any ideas which are easy enough to implement that they rise up my priority list, and the hard ideas I have are stepping right into the most unstable parts of Miri.

bors · 2022-05-14T20:04:11Z

☔ The latest upstream changes (presumably #2030) made this pull request unmergeable. Please resolve the merge conflicts.

saethlin · 2022-05-14T20:41:27Z

I've been thinking about this problem for a while, and I finally decided to code up my idea for a super simple limited-size cache. I've rebased in the new implementation and updated the top comment.

This is freaking awesome, can't believe I didn't try this approach before. I think there are a lot of clear ways that this could be micro-optimized, but what I'm seeing in profiles makes that work hard to justify. My biggest concern at this point is code organization.

saethlin · 2022-05-14T20:48:55Z

@RalfJung @oli-obk this is not waiting-on-author anymore, it's looking-for-advice-on-factoring :)

saethlin · 2022-05-14T23:11:37Z

Examples where this doesn't help as much

regex: the stack merging code seems to be really active here.

unicode-xid: the the cache hit rate in find_granting is still really high, but we keep having to do searches of tends of thousands of tags. I bumped the cache size to 32 with no impact, unclear if making it super big would help.

RalfJung · 2022-05-17T12:02:45Z

My biggest concern at this point is code organization.

Yeah, I agree. As a starter, actually documenting the data structure invariants and how exactly the cache relates to the "live" data would be good. :) Right now, to review this I would have to reverse engineer all that. In fact, I think we should have an actual Rust function that checks this invariant, i.e., that checks whether the cache is consistent with the "live" data, and we should have a compile-time flag to enable run-time checking that the cache does not get out-of-sync with the live data. Such a machine-readable invariant would probably also help clarify/complement the human-readable comments.

As for factoring things, the first "obvious" proposal would be to move Stack to its own file and make sure the public API is such that caching can never be broken. Is that possible?

bors · 2022-07-01T16:54:23Z

☔ The latest upstream changes (presumably #2297) made this pull request unmergeable. Please resolve the merge conflicts.

This adds a very simple LRU-like cache which stores the locations of often-used tags. While the implementation is very simple, the cache hit rate is incredible at ~99.9% on most programs, and often the element at position 0 in the cache has a hit rate of 90%. So the sub-optimality of this cache basicaly vanishes into the noise in a profile. Additionally, we keep a range which denotes where there might be an item granting Unique permission in the stack, so that when we invalidate Uniques we do not need to scan much of the stack, and often scan nothing at all.

saethlin · 2022-07-02T00:33:13Z

I'm getting denied by clippy:

error: this returns a Result<_, ()>

But isn't that same signature currently on the master branch? I'm happy to add an allow here, but it feels odd to be adding the allow in this PR.

src/stacked_borrows/stack.rs

ui_test/Cargo.toml

src/stacked_borrows/stack.rs

src/stacked_borrows.rs

Cargo.toml

src/stacked_borrows.rs

src/stacked_borrows/stack.rs

RalfJung · 2022-07-02T21:33:13Z

Other than my last rounds of comments, I think this is good to go. :)

src/stacked_borrows/stack.rs

RalfJung · 2022-07-03T14:39:17Z

Stacked Borrows go brrrrrr
@bors r+

bors · 2022-07-03T14:39:19Z

📌 Commit b004a03 has been approved by RalfJung

bors · 2022-07-03T14:39:24Z

⌛ Testing commit b004a03 with merge cfad9d1...

bors · 2022-07-03T15:03:16Z

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing cfad9d1 to master...

camelid reviewed Dec 12, 2021

View reviewed changes

src/stacked_borrows.rs Outdated Show resolved Hide resolved

saethlin force-pushed the optimize-sb branch from 9606bce to a73d32f Compare December 31, 2021 18:04

RalfJung added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Mar 5, 2022

saethlin mentioned this pull request Mar 17, 2022

Print spans where tags are created and invalidated #2030

Merged

saethlin force-pushed the optimize-sb branch from a73d32f to b152f49 Compare March 18, 2022 04:11

saethlin changed the title ~~Optimizing Stacked Borrows (part 1?): Remove linear searches of borrow stacks~~ Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack May 14, 2022

saethlin force-pushed the optimize-sb branch from b152f49 to 0581cf8 Compare May 14, 2022 20:39

RalfJung added S-waiting-on-review Status: Waiting for a review to complete and removed S-waiting-on-author Status: Waiting for the PR author to address review comments labels May 15, 2022

RalfJung added S-waiting-on-author Status: Waiting for the PR author to address review comments and removed S-waiting-on-review Status: Waiting for a review to complete labels May 18, 2022

saethlin force-pushed the optimize-sb branch from 0581cf8 to 6d89f8e Compare May 19, 2022 16:03

This was referenced May 20, 2022

Perf regression related to cold path code size growth in #2075 #2132

Closed

Factor current-span logic into a caching handle #2139

Merged

saethlin force-pushed the optimize-sb branch from 6d89f8e to be00557 Compare May 21, 2022 15:15

saethlin force-pushed the optimize-sb branch from e3d946d to 3bbcafe Compare July 1, 2022 21:51

Document implementation a bit, add some fast paths

17acc3f