Introduce `ChunkedBitSet` and use it for some dataflow analyses. #93984

nnethercote · 2022-02-14T05:26:58Z

This reduces peak memory usage significantly for some programs with very
large functions.

nnethercote · 2022-02-14T05:28:14Z

This will be a complex one to evaluate, because there are various contradictory metrics results, e.g. some improving, some worsening. Anyway, let's get a perf run started to serve as the basis for any discussion:

@bors try @rust-timer queue

rust-timer · 2022-02-14T05:28:15Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

nnethercote · 2022-02-14T20:24:37Z

@bors try

bors · 2022-02-14T20:25:03Z

⌛ Trying commit fd9228607b35e1094193fe9e423e356544b527f2 with merge 9eec9616e3c91ff7231a3fadff261984e83f251f...

bors · 2022-02-14T21:55:22Z

☀️ Try build successful - checks-actions
Build commit: 9eec9616e3c91ff7231a3fadff261984e83f251f (9eec9616e3c91ff7231a3fadff261984e83f251f)

rust-timer · 2022-02-14T21:55:24Z

Queued 9eec9616e3c91ff7231a3fadff261984e83f251f with parent 52dd59e, future comparison URL.

rust-timer · 2022-02-15T00:47:38Z

Finished benchmarking commit (9eec9616e3c91ff7231a3fadff261984e83f251f): comparison url.

Summary: This benchmark run shows 115 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.1%
Largest regression in instruction counts: 5.5% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

nnethercote · 2022-02-15T02:57:56Z

If you look at instruction counts, it looks bad. If you look at every other metric, it looks good. One of the cases where instruction counts it misleading, I think.

A big part of this: x86-64 has "string processing" instructions like REPE MOVS. They effectively implement a loop, and things like memcpy can be implemented with them. A single string processing instruction is counted as one instruction by hardware counters, which perf uses. But Cachegrind counts them as N instructions, where N is the number of repetitions. (See section 4.1 of this paper for details.) And this commit gets rid of a lot of memcpy calls within iterate_to_fixpoint. If we went by Cachegrind's instruction counts, rather than perf's instruction counts, things would look substantially better.

nnethercote · 2022-02-15T03:03:50Z

This also improves the http crate quite a bit, though not as much as keccak. Plus I have some draft code locally that uses Rc within chunks that reduces memory usage quite a bit more, mostly on keccak.

nnethercote · 2022-02-15T03:27:31Z

BTW, this will fix #54208, if it ends up being merged.

Mark-Simulacrum · 2022-02-15T03:30:17Z

Not sure if draft status is intended to reflect some unreadiness, but assigning myself to at least take an initial look.

nnethercote · 2022-02-15T05:29:18Z

Not sure if draft status is intended to reflect some unreadiness, but assigning myself to at least take an initial look.

Thanks! The code quality isn't great right now, there are numerous "njn:" comments indicating things that need to be fixed, and the Rc addition isn't uploaded yet. But I'm happy to hear early feedback :)

nnethercote · 2022-02-15T11:55:04Z

I have updated the code. The second commit contains the use of Rc.

@bors try @rust-timer queue

rust-timer · 2022-02-15T11:55:06Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-02-15T11:55:11Z

⌛ Trying commit 3ba04f6462ebd682fb50c5b7123c50e0ee748f99 with merge 5a267675a45b7dcaef6afd739f55f3b207eaba3b...

Mark-Simulacrum

Left high-level commentary on data structures for now, didn't look at usage/impls.

compiler/rustc_index/src/bit_set.rs

bors · 2022-02-15T13:23:54Z

☀️ Try build successful - checks-actions
Build commit: 5a267675a45b7dcaef6afd739f55f3b207eaba3b (5a267675a45b7dcaef6afd739f55f3b207eaba3b)

rust-timer · 2022-02-15T13:23:56Z

Queued 5a267675a45b7dcaef6afd739f55f3b207eaba3b with parent 6655109, future comparison URL.

rust-timer · 2022-02-15T15:04:15Z

Finished benchmarking commit (5a267675a45b7dcaef6afd739f55f3b207eaba3b): comparison url.

Summary: This benchmark run shows 6 relevant improvements 🎉 but 121 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.1%
Average relevant improvement: -3.8%
Largest improvement in instruction counts: -5.5% on full builds of keccak debug
Largest regression in instruction counts: 6.3% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

rust-timer · 2022-02-18T05:56:11Z

Finished benchmarking commit (9003e92b6f7298995cc9772da0ec54c403a09a7e): comparison url.

Summary: This benchmark run shows 12 relevant improvements 🎉 but 105 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.0%
Average relevant improvement: -2.8%
Largest improvement in instruction counts: -5.4% on full builds of keccak debug
Largest regression in instruction counts: 6.1% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

nnethercote · 2022-02-18T06:35:12Z

Finished benchmarking commit (9003e92): comparison url.

Summary: This benchmark run shows 12 relevant improvements tada but 105 relevant regressions crying_cat_face to instruction counts.

As mentioned here and here, this is a rare case where instruction counts are misleading. Instead look at wall-time and cycles (some big wins along with the usual noise) along with max-rss and faults (some massive wins along with the usual noise).

@rustbot label: +perf-regression-triaged

Mark-Simulacrum

Looks pretty good overall

compiler/rustc_index/src/bit_set.rs

Mark-Simulacrum · 2022-02-22T13:13:04Z

Mostly nits, r=me if you don't make major changes

This reduces peak memory usage significantly for some programs with very large functions, such as: - `keccak`, `unicode_normalization`, and `match-stress-enum`, from the `rustc-perf` benchmark suite; - `http-0.2.6` from crates.io. The new type is used in the analyses where the bitsets can get huge (e.g. 10s of thousands of bits): `MaybeInitializedPlaces`, `MaybeUninitializedPlaces`, and `EverInitializedPlaces`. Some refactoring was required in `rustc_mir_dataflow`. All existing analysis domains are either `BitSet` or a trivial wrapper around `BitSet`, and access in a few places is done via `Borrow<BitSet>` or `BorrowMut<BitSet>`. Now that some of these domains are `ClusterBitSet`, that no longer works. So this commit replaces the `Borrow`/`BorrowMut` usage with a new trait `BitSetExt` containing the needed bitset operations. The impls just forward these to the underlying bitset type. This required fiddling with trait bounds in a few places. The commit also: - Moves `static_assert_size` from `rustc_data_structures` to `rustc_index` so it can be used in the latter; the former now re-exports it so existing users are unaffected. - Factors out some common "clear excess bits in the final word" functionality in `bit_set.rs`. - Uses `fill` in a few places instead of loops.

nnethercote · 2022-02-23T00:24:47Z

@bors r=Mark-Simulacrum

bors · 2022-02-23T00:24:48Z

📌 Commit 36b495f has been approved by Mark-Simulacrum

bors · 2022-02-23T01:26:09Z

⌛ Testing commit 36b495f with merge bafe8d0...

nnethercote · 2022-02-23T02:32:44Z

I took some extra measurements on my machine for affected benchmarks:

From rustc-perf: cranelift-codegen, inflate, keccak, match-stress-enum, unicode_normalization
External: http-0.2.6, rust-language-tags-0.3.2, tinyvec-1.5.1, vte-0.10.1

instructions:u

As mentioned above, these aren't useful for this change, but I've put them here for completeness.

Benchmark & Profile	Scenario	% Change	Significance Factor?	Chunk0	Chunk1
match-stress-enum check	full	3.56%	17.82x	6206276362.00	6427489110.00
unicode_normalization check	full	2.61%	13.05x	4394337716.00	4509070318.00
http check	full	-2.17%	10.85x	5639733100.00	5517400820.00
inflate check	full	1.07%	5.33x	4265448539.00	4310912103.00
keccak check	full	-1.05%	5.23x	23069621635.00	22828169903.00
cranelift-codegen check	full	0.82%	4.08x	14411424274.00	14529074725.00
vte check	full	-0.73%	3.67x	3260641577.00	3236722513.00
tinyvec check	full	0.56%	2.82x	3378759222.00	3397785087.00

cycles

The cycle results for this PR were consistently better on CI than they were for my local builds. Maybe because my local builds don't have PGO? Anyway, these aren't all that reliable.

Benchmark & Profile	Scenario	% Change	Significance Factor?	Chunk0	Chunk1
match-stress-enum check	full	5.49%	27.47x	2037211439.00	2149120206.00
vte check	full	-5.06%	25.31x	2192701285.00	2081692906.00
keccak check	full	-5.05%	25.26x	13616846225.00	12929031896.00
http check	full	-4.34%	21.71x	4499140807.00	4303815599.00
cranelift-codegen check	full	3.07%	15.37x	12658880013.00	13047971256.00
rust-language-tags check	full	-2.57%	12.85x	2098872505.00	2044922945.00
unicode_normalization check	full	-1.91%	9.53x	2663754160.00	2613007641.00

wall-time

Same story for wall-time as for cycles.

Benchmark & Profile	Scenario	% Change	Significance Factor?	Chunk0	Chunk1
match-stress-enum check	full	12.25%	61.26x	0.59	0.67
http check	full	-8.70%	43.51x	1.26	1.15
keccak check	full	-7.74%	38.69x	3.56	3.29
vte check	full	-7.53%	37.67x	0.72	0.67
unicode_normalization check	full	4.79%	23.95x	0.84	0.88
inflate check	full	1.64%	8.20x	0.63	0.64
cranelift-codegen check	full	1.16%	5.80x	3.28	3.31
rust-language-tags check	full	-0.99%	4.95x	0.68	0.68
tinyvec check	full	-0.95%	4.75x	0.62	0.62

max-rss

Huge wins here.

Benchmark & Profile	Scenario	% Change	Significance Factor?	Chunk0	Chunk1
keccak check	full	-59.05%	295.25x	974100.00	398892.00
http check	full	-32.11%	160.54x	390080.00	264832.00
vte check	full	-26.96%	134.78x	256144.00	187096.00
unicode_normalization check	full	-15.45%	77.27x	264228.00	223396.00
match-stress-enum check	full	-11.79%	58.94x	162364.00	143224.00
inflate check	full	-9.10%	45.49x	205048.00	186392.00
rust-language-tags check	full	-8.39%	41.93x	229596.00	210344.00
tinyvec check	full	-7.16%	35.79x	163636.00	151924.00
cranelift-codegen check	full	-4.03%	20.14x	400552.00	384420.00

faults

Huge wins here, too.

Benchmark & Profile	Scenario	% Change	Significance Factor?	Chunk0	Chunk1
keccak check	full	-60.21%	301.06x	240390.00	95647.00
http check	full	-46.54%	232.69x	68926.00	36849.00
vte check	full	-46.42%	232.08x	43003.00	23043.00
unicode_normalization check	full	-27.03%	135.17x	39006.00	28461.00
match-stress-enum check	full	-24.91%	124.55x	19325.00	14511.00
inflate check	full	-21.48%	107.39x	23373.00	18353.00
tinyvec check	full	-18.21%	91.07x	18809.00	15383.00
rust-language-tags check	full	-15.42%	77.08x	33191.00	28074.00
cranelift-codegen check	full	-5.83%	29.13x	75575.00	71172.00

bors · 2022-02-23T04:06:50Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing bafe8d0 to master...

rust-timer · 2022-02-23T05:38:39Z

Finished benchmarking commit (bafe8d0): comparison url.

Summary: This benchmark run shows 6 relevant improvements 🎉 but 107 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.0%
Average relevant improvement: -3.8%
Largest improvement in instruction counts: -5.3% on full builds of keccak debug
Largest regression in instruction counts: 6.0% on full builds of clap-rs check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression

nnethercote · 2022-02-23T06:26:14Z

As mentioned here and here, this is a rare case where instruction counts are misleading. Instead look at wall-time and cycles (some big wins along with the usual noise) along with max-rss and faults (some massive wins along with the usual noise).

@rustbot label: +perf-regression-triaged

…Simulacrum Remove `HybridBitSet` `HybridBitSet` was introduced under the name `HybridIdxSetBuf` way back in rust-lang#53383 where it was a big win for NLL borrow checker performance. In rust-lang#93984 the more flexible `ChunkedBitSet` was added. Uses of `HybridBitSet` have gradually disappeared (e.g. rust-lang#116152) and there are now few enough that they can be replaced with `BitSet` or `ChunkedBitSet`, and `HybridBitSet` can be removed, cutting more than 700 lines of code. r? `@Mark-Simulacrum`

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 14, 2022

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 14, 2022

nnethercote marked this pull request as draft February 14, 2022 06:25

rustbot added perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 15, 2022

Mark-Simulacrum self-assigned this Feb 15, 2022

nnethercote force-pushed the ChunkedBitSet branch from fd92286 to 3ba04f6 Compare February 15, 2022 11:54

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2022

Mark-Simulacrum reviewed Feb 15, 2022

View reviewed changes

compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved

compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved

compiler/rustc_index/src/bit_set.rs Outdated Show resolved Hide resolved

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 18, 2022

rustbot added the perf-regression-triaged The performance regression has been triaged. label Feb 18, 2022

Mark-Simulacrum reviewed Feb 22, 2022

View reviewed changes

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2022

nnethercote force-pushed the ChunkedBitSet branch from d97c79e to 36b495f Compare February 23, 2022 00:24

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 23, 2022

bors mentioned this pull request Feb 23, 2022

fix names in feature(...) suggestion #94213

Merged

bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 23, 2022

bors merged commit bafe8d0 into rust-lang:master Feb 23, 2022

rustbot added this to the 1.61.0 milestone Feb 23, 2022

bors mentioned this pull request Feb 23, 2022

Convert newtype_index to a proc macro #93878

Merged

nnethercote deleted the ChunkedBitSet branch February 23, 2022 06:24

nnethercote mentioned this pull request Feb 24, 2022

High memory usage compiling keccak benchmark #54208

Closed

nnethercote mentioned this pull request Mar 9, 2022

Optimize ascii::escape_default #94776

Merged

lqd mentioned this pull request Sep 25, 2023

Only use dense bitsets in dataflow analyses #116152

Merged

nnethercote mentioned this pull request Nov 25, 2024

Remove HybridBitSet #133431

Merged

Introduce ChunkedBitSet and use it for some dataflow analyses. #93984

Introduce ChunkedBitSet and use it for some dataflow analyses. #93984

Uh oh!

Conversation

nnethercote commented Feb 14, 2022

Uh oh!

nnethercote commented Feb 14, 2022

Uh oh!

rust-timer commented Feb 14, 2022

Uh oh!

nnethercote commented Feb 14, 2022

Uh oh!

bors commented Feb 14, 2022

Uh oh!

bors commented Feb 14, 2022

Uh oh!

rust-timer commented Feb 14, 2022

Uh oh!

rust-timer commented Feb 15, 2022

Uh oh!

nnethercote commented Feb 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Feb 15, 2022

Uh oh!

nnethercote commented Feb 15, 2022

Uh oh!

Mark-Simulacrum commented Feb 15, 2022

Uh oh!

nnethercote commented Feb 15, 2022

Uh oh!

nnethercote commented Feb 15, 2022

Uh oh!

rust-timer commented Feb 15, 2022

Uh oh!

bors commented Feb 15, 2022

Uh oh!

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bors commented Feb 15, 2022

Uh oh!

rust-timer commented Feb 15, 2022

Uh oh!

rust-timer commented Feb 15, 2022

Uh oh!

rust-timer commented Feb 18, 2022

Uh oh!

nnethercote commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mark-Simulacrum commented Feb 22, 2022

Uh oh!

nnethercote commented Feb 23, 2022

Uh oh!

bors commented Feb 23, 2022

Uh oh!

bors commented Feb 23, 2022

Uh oh!

nnethercote commented Feb 23, 2022

instructions:u

cycles

wall-time

max-rss

faults

Uh oh!

bors commented Feb 23, 2022

Uh oh!

rust-timer commented Feb 23, 2022

Introduce `ChunkedBitSet` and use it for some dataflow analyses. #93984

Introduce `ChunkedBitSet` and use it for some dataflow analyses. #93984

nnethercote commented Feb 15, 2022 •

edited

Loading

nnethercote commented Feb 18, 2022 •

edited

Loading