Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force 2 iterations of OptRepeat #100473

Conversation

BruceForstall
Copy link
Member

No description provided.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 31, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 1, 2024

Ouch, quite expensive (@SingleAccretion predicted similar numbers)
It's still nice to have JitOptRepeat in good shape and it can point us to missing opportunities.
Maybe there are some hidden fruits to reduce the TP? E.g.

  • Don't run it for huge methods (e.g. non-linear complexity for number of basic blocks, etc)
  • Run on demand? E.g. if RBO removes a branch and invalidates dominators or Assertprop propagates a constant and potentionally enables more opportunities for RBO?
  • Don't run some opts in the 2nd iter like assertprop and rangecheck if they don't contribute to the final diffs a lot

But of course that is quite non-trivial to guess whether you might benefit from the 2nd iter or not. Maybe if we study these diffs we'll figure out what we benefit the most from in most cases?

@BruceForstall
Copy link
Member Author

Diffs

Code size improvement from about 3 to 4.5MB. TP cost of about 20%.

@BruceForstall
Copy link
Member Author

Ouch, quite expensive (@SingleAccretion predicted similar numbers) It's still nice to have JitOptRepeat in good shape and it can point us to missing opportunities. Maybe there are some hidden fruits to reduce the TP? E.g.

  • Don't run it for huge methods (e.g. non-linear complexity for number of basic blocks, etc)
  • Run on demand? E.g. if RBO removes a branch and invalidates dominators or Assertprop propagates a constant and potentionally enables more opportunities for RBO?
  • Don't run some opts in the 2nd iter like assertprop and rangecheck if they don't contribute to the final diffs a lot

But of course that is quite non-trivial to guess whether you might benefit from the 2nd iter or not. Maybe if we study these diffs we'll figure out what we benefit the most from in most cases?

Lots of good ideas here. I agree that we wouldn't want to run it unconditionally. But running on big methods might be the most beneficial. And maybe we'd be willing to take the 20% TP cost for NAOT compiles.

@BruceForstall
Copy link
Member Author

cc @dotnet/jit-contrib

Example of enabling JitOptRepeat always, with 2 iterations. Code size improvement from about 3 to 4.5MB. TP cost of about 20%.

@BruceForstall BruceForstall force-pushed the FixOptRepeat_ForceRepeat2InRelease branch from bea3f5f to 19c5d79 Compare April 1, 2024 22:39
@BruceForstall BruceForstall changed the title Force OptRepeat, 2 iterations. Enable in Release. Force 2 iterations of OptRepeat Apr 2, 2024
@BruceForstall BruceForstall force-pushed the FixOptRepeat_ForceRepeat2InRelease branch from 19c5d79 to 2c9dbda Compare April 2, 2024 02:04
@BruceForstall BruceForstall added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Apr 2, 2024
@BruceForstall
Copy link
Member Author

Diffs

@BruceForstall
Copy link
Member Author

image
image
image

@BruceForstall
Copy link
Member Author

Here is a run of asmdiffs on win-x64 between JitOptRepeat 2 and 3 iterations, showing there is some benefit to be had to run more than twice.

Diffs are based on 2,384,245 contexts (931,974 MinOpts, 1,452,271 FullOpts).

MISSED contexts: base: 7,980 (0.33%), diff: 8,122 (0.34%)

Base JIT options: JitOptRepeat#*

Diff JIT options: JitOptRepeat#*;JitOptRepeatCount#3

Overall (-149,570 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,524,826 -2,200 -1.16%
benchmarks.run_pgo.windows.x64.checked.mch 35,113,082 -8,390 +36.54%
benchmarks.run_tiered.windows.x64.checked.mch 12,444,446 -1,131 -1.25%
coreclr_tests.run.windows.x64.checked.mch 398,753,006 -53,627 +50.20%
libraries.crossgen2.windows.x64.checked.mch 44,921,842 -7,029 -0.59%
libraries.pmi.windows.x64.checked.mch 61,240,935 -6,528 +125.54%
libraries_tests.run.windows.x64.Release.mch 274,740,189 -23,431 +20.96%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 135,902,387 -44,863 +160.81%
realworld.run.windows.x64.checked.mch 11,060,751 -1,196 -0.12%
smoke_tests.nativeaot.windows.x64.checked.mch 3,565,641 -1,175 -0.83%
FullOpts (-149,570 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,524,465 -2,200 -1.16%
benchmarks.run_pgo.windows.x64.checked.mch 20,929,241 -8,390 +36.54%
benchmarks.run_tiered.windows.x64.checked.mch 3,279,500 -1,131 -1.25%
coreclr_tests.run.windows.x64.checked.mch 118,412,975 -53,627 +50.20%
libraries.crossgen2.windows.x64.checked.mch 44,920,652 -7,029 -0.59%
libraries.pmi.windows.x64.checked.mch 61,127,434 -6,528 +125.54%
libraries_tests.run.windows.x64.Release.mch 102,077,791 -23,431 +20.96%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 125,588,969 -44,863 +160.81%
realworld.run.windows.x64.checked.mch 10,647,587 -1,196 -0.12%
smoke_tests.nativeaot.windows.x64.checked.mch 3,565,548 -1,175 -0.83%

@EgorBo
Copy link
Member

EgorBo commented Apr 4, 2024

Sounds like 2 iterations are enough? 🙂 It'd be nice to have a breakdown TP report like @jakobbotsch did here for example to know what is the most expensive thing

@BruceForstall
Copy link
Member Author

Here's a run of asmdiffs on win-x64 between JitOptRepeat 3 and 4 iterations. There's still diffs.

Diffs are based on 2,384,210 contexts (931,974 MinOpts, 1,452,236 FullOpts).

MISSED contexts: base: 8,122 (0.34%), diff: 8,157 (0.34%)

Base JIT options: JitOptRepeat#*;JitOptRepeatCount#3

Diff JIT options: JitOptRepeat#*;JitOptRepeatCount#4

Overall (-71,360 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,522,626 -1,154 +0.26%
benchmarks.run_pgo.windows.x64.checked.mch 35,104,692 -3,889 +70.31%
benchmarks.run_tiered.windows.x64.checked.mch 12,443,315 -295 -0.03%
coreclr_tests.run.windows.x64.checked.mch 398,699,105 -64,162 +420.88%
libraries.crossgen2.windows.x64.checked.mch 44,914,813 -3,739 -0.10%
libraries.pmi.windows.x64.checked.mch 61,219,837 -8,283 +306.51%
libraries_tests.run.windows.x64.Release.mch 274,716,321 +913 +41.18%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 135,840,633 +11,773 +374.53%
realworld.run.windows.x64.checked.mch 11,059,555 -2,040 -0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 3,564,466 -484 -0.19%
FullOpts (-71,360 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,522,265 -1,154 +0.26%
benchmarks.run_pgo.windows.x64.checked.mch 20,920,851 -3,889 +70.31%
benchmarks.run_tiered.windows.x64.checked.mch 3,278,369 -295 -0.03%
coreclr_tests.run.windows.x64.checked.mch 118,359,074 -64,162 +420.88%
libraries.crossgen2.windows.x64.checked.mch 44,913,623 -3,739 -0.10%
libraries.pmi.windows.x64.checked.mch 61,106,336 -8,283 +306.51%
libraries_tests.run.windows.x64.Release.mch 102,053,923 +913 +41.18%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 125,527,215 +11,773 +374.53%
realworld.run.windows.x64.checked.mch 10,646,391 -2,040 -0.08%
smoke_tests.nativeaot.windows.x64.checked.mch 3,564,373 -484 -0.19%

@BruceForstall
Copy link
Member Author

Here's a run of asmdiffs on win-x64 between JitOptRepeat 4 and 5 iterations.

Diffs are based on 2,384,205 contexts (931,974 MinOpts, 1,452,231 FullOpts).

MISSED contexts: base: 8,157 (0.34%), diff: 8,162 (0.34%)

Base JIT options: JitOptRepeat#*;JitOptRepeatCount#4

Diff JIT options: JitOptRepeat#*;JitOptRepeatCount#5

Overall (-14,538 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,521,472 -388 -0.12%
benchmarks.run_pgo.windows.x64.checked.mch 35,100,803 -1,606 +79.30%
benchmarks.run_tiered.windows.x64.checked.mch 12,443,020 -318 -0.19%
coreclr_tests.run.windows.x64.checked.mch 398,634,943 -28,742 +118.37%
libraries.crossgen2.windows.x64.checked.mch 44,911,074 +2,441 -0.31%
libraries.pmi.windows.x64.checked.mch 61,206,674 +4,410 +746.88%
libraries_tests.run.windows.x64.Release.mch 274,717,234 +4,295 +46.76%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 135,838,676 +5,179 +577.00%
realworld.run.windows.x64.checked.mch 11,057,515 +452 +0.04%
smoke_tests.nativeaot.windows.x64.checked.mch 3,563,982 -261 -0.02%
FullOpts (-14,538 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,521,111 -388 -0.12%
benchmarks.run_pgo.windows.x64.checked.mch 20,916,962 -1,606 +79.30%
benchmarks.run_tiered.windows.x64.checked.mch 3,278,074 -318 -0.19%
coreclr_tests.run.windows.x64.checked.mch 118,294,912 -28,742 +118.37%
libraries.crossgen2.windows.x64.checked.mch 44,909,884 +2,441 -0.31%
libraries.pmi.windows.x64.checked.mch 61,093,173 +4,410 +746.88%
libraries_tests.run.windows.x64.Release.mch 102,054,836 +4,295 +46.76%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 125,525,258 +5,179 +577.00%
realworld.run.windows.x64.checked.mch 10,644,351 +452 +0.04%
smoke_tests.nativeaot.windows.x64.checked.mch 3,563,889 -261 -0.02%

@BruceForstall
Copy link
Member Author

And just "for fun", here is asmdiffs on win-x64 between JitOptRepeat 8 and 9 iterations! Yes, there are still diffs. One question: are optimization changes somehow monotonic, or is it possible the optimizations are oscillating (e.g., one iteration creates a CSE, another removes it)?

Diffs are based on 2,384,201 contexts (931,974 MinOpts, 1,452,227 FullOpts).

MISSED contexts: 8,166 (0.34%)

Base JIT options: JitOptRepeat#*;JitOptRepeatCount#8

Diff JIT options: JitOptRepeat#*;JitOptRepeatCount#9

Overall (+4,080 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,520,257 -64 -0.02%
benchmarks.run_pgo.windows.x64.checked.mch 35,095,669 -227 +87.21%
benchmarks.run_tiered.windows.x64.checked.mch 12,442,539 -38 +0.00%
coreclr_tests.run.windows.x64.checked.mch 398,475,394 -1,406 +2445.47%
libraries.crossgen2.windows.x64.checked.mch 44,910,655 +1,299 -0.03%
libraries.pmi.windows.x64.checked.mch 61,193,216 +2,633 +2193.97%
libraries_tests.run.windows.x64.Release.mch 274,706,838 +3,977 +61.81%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 135,827,290 -2,776 +1920.18%
realworld.run.windows.x64.checked.mch 11,057,374 +600 +0.00%
smoke_tests.nativeaot.windows.x64.checked.mch 3,563,813 +82 +0.07%
FullOpts (+4,080 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.windows.x64.checked.mch 8,519,896 -64 -0.02%
benchmarks.run_pgo.windows.x64.checked.mch 20,911,828 -227 +87.21%
benchmarks.run_tiered.windows.x64.checked.mch 3,277,593 -38 +0.00%
coreclr_tests.run.windows.x64.checked.mch 118,135,363 -1,406 +2445.47%
libraries.crossgen2.windows.x64.checked.mch 44,909,465 +1,299 -0.03%
libraries.pmi.windows.x64.checked.mch 61,079,715 +2,633 +2193.97%
libraries_tests.run.windows.x64.Release.mch 102,044,440 +3,977 +61.81%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 125,513,872 -2,776 +1920.18%
realworld.run.windows.x64.checked.mch 10,644,210 +600 +0.00%
smoke_tests.nativeaot.windows.x64.checked.mch 3,563,720 +82 +0.07%

@github-actions github-actions bot locked and limited conversation to collaborators May 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants