Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow some intrinsics in Tier0 #77357

Merged
merged 12 commits into from
Feb 13, 2023
Merged

Allow some intrinsics in Tier0 #77357

merged 12 commits into from
Feb 13, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 23, 2022

Expand some intrinsics in Tier0, e.g. these:

image

Plaintext-plaintext, Tier0 instrumented:

| load                   | Main            | PR              |         |
| ---------------------- | --------------- | --------------- | ------- |
| First Request (ms)     |             199 |             192 |  -3.52% |
| Requests/sec           |         545,865 |         610,476 | +11.84% |
| Mean latency (ms)      |            4.30 |            3.89 |  -9.53% |
| Latency 50th (ms)      |            4.17 |            3.78 |  -9.35% |
| Latency 75th (ms)      |            6.14 |            5.55 |  -9.61% |
| Latency 90th (ms)      |            7.58 |            6.90 |  -8.97% |

Mainly due to Span.get_Item and typeof() == typeof()

E.g. here is Main's codegen for typeof():

image

This PR fixes it to just return false

@ghost ghost assigned EgorBo Oct 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 23, 2022
@ghost
Copy link

ghost commented Oct 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

As it turns out, even extremely hot methods may stuck in Tier0 for several seconds (see #70941 (comment)) pending call-counting stub installation. So we'd better optimize Enum.HasFlag in tier0 too because it allocates and produces a large tree, e.g.:

bool Test(MyEnum e) => e.HasFlag(MyEnum.B);

Current Tier0 codegen:

; Assembly listing for method Program:Test(int):bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M000_IG01:                ;; offset=0000H
       55                   push     rbp
       4883EC30             sub      rsp, 48
       488D6C2430           lea      rbp, [rsp+30H]
       33C0                 xor      eax, eax
       488945F8             mov      qword ptr [rbp-08H], rax
       488945F0             mov      qword ptr [rbp-10H], rax
       48894D10             mov      gword ptr [rbp+10H], rcx
       895518               mov      dword ptr [rbp+18H], ed
G_M000_IG02:                ;; offset=001BH
       48B928667DEAFD7F0000 mov      rcx, 0x7FFDEA7D6628
       E8E6C0AB5F           call     CORINFO_HELP_NEWSFAST
       488945F8             mov      gword ptr [rbp-08H], rax
       488B4DF8             mov      rcx, gword ptr [rbp-08H]
       8B4518               mov      eax, dword ptr [rbp+18H]
       894108               mov      dword ptr [rcx+08H], eax
       48B928667DEAFD7F0000 mov      rcx, 0x7FFDEA7D6628
       E8C9C0AB5F           call     CORINFO_HELP_NEWSFAST
       488945F0             mov      gword ptr [rbp-10H], rax
       488B4DF0             mov      rcx, gword ptr [rbp-10H]
       C7410801000000       mov      dword ptr [rcx+08H], 1
       488B4DF8             mov      rcx, gword ptr [rbp-08H]
       488B55F0             mov      rdx, gword ptr [rbp-10H]
       FF15DCF0F4FF         call     [System.Enum:HasFlag(System.Enum):bool:this]
       90                   nop      
G_M000_IG03:                ;; offset=0065H
       4883C430             add      rsp, 48
       5D                   pop      rbp
       C3                   ret      
; Total bytes of code 107

New tier0 codegen:

; Assembly listing for method Program:Test(int):bool:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M57656_IG01:              ;; offset=0000H
       55                   push     rbp
       4883EC20             sub      rsp, 32
       488D6C2420           lea      rbp, [rsp+20H]
       33C0                 xor      eax, eax
       488945F8             mov      qword ptr [rbp-08H], rax
       488945F0             mov      qword ptr [rbp-10H], rax
       48894D10             mov      gword ptr [rbp+10H], rcx
       895518               mov      dword ptr [rbp+18H], edx
G_M57656_IG02:              ;; offset=001BH
       8B4518               mov      eax, dword ptr [rbp+18H]
       8945EC               mov      dword ptr [rbp-14H], eax
       8B45EC               mov      eax, dword ptr [rbp-14H]
       83E001               and      eax, 1
       83F801               cmp      eax, 1
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
G_M57656_IG03:              ;; offset=0030H
       4883C420             add      rsp, 32
       5D                   pop      rbp
       C3                   ret      
; Total bytes of code 54

Also, allow expansion for some other lightweight intrinsics for two reasons:

  1. GT_CALL nodes are not cheap too
  2. We want to avoid unnecessary JIT compilations e.g. why would we need to spend time compiling String.get_Length() + installing a call-counting stub for it when the expansion for it super trivial. Etc.
Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Oct 23, 2022

Avalonia's version of Enum.HasFlag: https://github.com/AvaloniaUI/Avalonia/blob/28d2765abcac2bd55128611c8a416aef051b6e82/src/Avalonia.Base/EnumExtensions.cs#L12-L42 - it was invented because our Enum.HasFlag is slow and allocates in tier0 (and for net4x support)

@AndyAyersMS
Copy link
Member

Linking to #9120.

@BruceForstall
Copy link
Member

@EgorBo No activity on this for a while. Do you want to convert it to Draft until it is ready? Or are you waiting on more reviews?

@EgorBo EgorBo marked this pull request as draft November 28, 2022 19:48
@EgorBo
Copy link
Member Author

EgorBo commented Nov 28, 2022

@EgorBo No activity on this for a while. Do you want to convert it to Draft until it is ready? Or are you waiting on more reviews?

Yes, it's blocked on the other PR to introduce opt levels.

@ghost ghost closed this Dec 28, 2022
@ghost
Copy link

ghost commented Dec 28, 2022

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@EgorBo EgorBo reopened this Jan 12, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Feb 10, 2023

Reminder for myself to land this.

TechEmpower Tier0:

image

A lot of time is spent inside Span.get_Item that could be expanded (and is in this PR). Our theory that it's expensive becuase of block counters inside (PGO is enabled) that we don't need for it

@EgorBo
Copy link
Member Author

EgorBo commented Feb 11, 2023

@EgorBo EgorBo marked this pull request as ready for review February 11, 2023 01:19
@EgorBo
Copy link
Member Author

EgorBo commented Feb 12, 2023

image

😕 (that's for Main, this PR fixes it)

@EgorBo EgorBo changed the title Allow some intrinsics in Tier0 (e.g. Enum.HasFlag) Allow some intrinsics in Tier0 Feb 12, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Feb 12, 2023

Plaintext-plaintext, Tier0 instrumented:

| load                   | Main            | PR              |         |
| ---------------------- | --------------- | --------------- | ------- |
| First Request (ms)     |             199 |             192 |  -3.52% |
| Requests/sec           |         545,865 |         610,476 | +11.84% |
| Mean latency (ms)      |            4.30 |            3.89 |  -9.53% |
| Latency 50th (ms)      |            4.17 |            3.78 |  -9.35% |
| Latency 75th (ms)      |            6.14 |            5.55 |  -9.61% |
| Latency 90th (ms)      |            7.58 |            6.90 |  -8.97% |

Mainly due to Span.get_Item and typeof() == typeof()

@EgorBo
Copy link
Member Author

EgorBo commented Feb 12, 2023

Even better numbers for YARP, start time is also decreased!

| application                             | Main        | PR          |         |
| --------------------------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)                           |         100 |          99 |  -1.00% |
| Cores usage (%)                         |       2,794 |       2,777 |  -0.61% |
| Working Set (MB)                        |         460 |         459 |  -0.22% |
| Private Memory (MB)                     |       1,232 |       1,233 |  +0.08% |
| Build Time (ms)                         |       5,127 |       4,346 | -15.23% |
| Start Time (ms)                         |         846 |         802 |  -5.20% |
| Published Size (KB)                     |      90,941 |      90,941 |   0.00% |
| Symbols Size (KB)                       |           0 |           0 |         |
| Max CPU Usage (%)                       |         100 |          99 |  -0.56% |
| Max Working Set (MB)                    |         482 |         481 |  -0.07% |
| Max GC Heap Size (MB)                   |         339 |         339 |  +0.12% |
| Size of committed memory by the GC (MB) |         398 |         403 |  +1.10% |
| Max Number of Gen 0 GCs / sec           |        1.00 |        1.00 |   0.00% |
| Max Number of Gen 1 GCs / sec           |        1.00 |        1.00 |   0.00% |
| Max Number of Gen 2 GCs / sec           |        1.00 |        1.00 |   0.00% |
| Max Time in GC (%)                      |        0.00 |        1.00 |     +∞% |
| Max Gen 0 Size (B)                      |   5,541,728 |   6,963,296 | +25.65% |
| Max Gen 1 Size (B)                      |   9,210,496 |   9,532,536 |  +3.50% |
| Max Gen 2 Size (B)                      |  11,709,272 |  11,713,680 |  +0.04% |
| Max LOH Size (B)                        |      85,072 |      85,072 |   0.00% |
| Max POH Size (B)                        |   1,186,016 |   1,181,896 |  -0.35% |
| Max Allocation Rate (B/sec)             | 113,178,784 | 179,763,992 | +58.83% |
| Max GC Heap Fragmentation               |          34 |          33 |  -3.02% |
| # of Assemblies Loaded                  |         104 |         104 |   0.00% |
| Max Exceptions (#/s)                    |           0 |           0 |         |
| Max Lock Contention (#/s)               |         246 |         304 | +23.58% |
| Max ThreadPool Threads Count            |          48 |          48 |   0.00% |
| Max ThreadPool Queue Length             |         181 |         193 |  +6.63% |
| Max ThreadPool Items (#/s)              |      64,596 |      99,265 | +53.67% |
| Max Active Timers                       |         221 |         212 |  -4.07% |
| IL Jitted (B)                           |     724,634 |     723,985 |  -0.09% |
| Methods Jitted                          |      10,422 |      10,385 |  -0.36% |
| Cores Usage (90th %)                    |       2,780 |       2,750 |  -1.08% |

| load                   | Main        | PR          |         |
| ---------------------- | ----------- |------------ | ------- |
| CPU Usage (%)          |          14 |          19 | +35.71% |
| Cores usage (%)        |         383 |         539 | +40.73% |
| Working Set (MB)       |         125 |         125 |   0.00% |
| Private Memory (MB)    |         217 |         217 |   0.00% |
| Start Time (ms)        |         109 |         106 |  -2.75% |
| First Request (ms)     |         496 |         499 |  +0.60% |
| Requests               |     740,445 |   1,157,557 | +56.33% |
| Latency 50th (ms)      |        5.08 |        3.24 | -36.33% |
| Latency 75th (ms)      |        5.33 |        3.48 | -34.67% |
| Latency 90th (ms)      |        5.69 |        3.76 | -33.88% |
| Latency 95th (ms)      |        5.94 |        3.95 | -33.41% |
| Latency 99th (ms)      |        6.80 |        4.90 | -27.94% |
| Mean latency (ms)      |        5.18 |        3.31 | -36.07% |
| Max latency (ms)       |       78.73 |       91.77 | +16.56% |
| Requests/sec           |      49,478 |      77,391 | +56.42% |
| Requests/sec (max)     |      67,962 |      87,072 | +28.12% |
| Read throughput (MB/s) |        7.15 |       11.19 | +56.35% |
| Cores Usage (90th %)   |         382 |         538 | +40.84% |

@EgorBo
Copy link
Member Author

EgorBo commented Feb 12, 2023

PTAL @AndyAyersMS @dotnet/jit-contrib

Co-authored-by: Jakob Botsch Nielsen <Jakob.botsch.nielsen@gmail.com>
@EgorBo
Copy link
Member Author

EgorBo commented Feb 13, 2023

ping @AndyAyersMS @jakobbotsch - I need this to land for my next pr 🙂

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable set to expand to me.

@EgorBo EgorBo merged commit e1edb16 into dotnet:main Feb 13, 2023
@EgorBo EgorBo deleted the allow-tier0-opts branch February 13, 2023 22:46
@AndyAyersMS
Copy link
Member

@EgorBo sorry I missed seeing this. Going forward can you assign me as reviewer? Otherwise I may not look in the right places.

Also if you have copies of your evaluation scripts laying around, point me at them.

BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Feb 14, 2023
@BruceForstall
Copy link
Member

This caused outerloop regressions: #82078

BruceForstall added a commit that referenced this pull request Feb 14, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants