Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up ILEmitTrie jump table using Binary Search Tree #52812

Closed
andrewjsaid opened this issue Dec 14, 2023 · 4 comments · Fixed by #52928
Closed

Speed up ILEmitTrie jump table using Binary Search Tree #52812

andrewjsaid opened this issue Dec 14, 2023 · 4 comments · Fixed by #52928
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions area-routing
Milestone

Comments

@andrewjsaid
Copy link
Contributor

andrewjsaid commented Dec 14, 2023

The proposal here is that we can improve ILEmitTrie by using binary search instead of the linear search we currently do for both finding the length as well as the vectorized search.

This improves the MicroBenchmarks as follows:

  • VectorTrie (Count = 25, Count = 50) is now 60% of original duration. Curiously only 80% for Count = 100
  • Trie (Count = 100) is now 40% of original duration. Curiously very minor / insignificant impact on other Counts

Implementation in andrewjsaid#1. The PR template implied I had to first create an issue before the PR so I did so. Please let me know if design makes sense I can then point the PR to here for detailed code review.

In the linked PR you can see the changes are:

  1. Binary search for length instead of linear search
  2. Binary search for vectorized comparison instead of linear search
  3. Where possible, binary search for single character comparison instead of linear
  4. Reduce comparison of single character from 2 branches to 1
  5. In vectorized comparison when only comparing letters (which is the majority of real-world cases), I removed 14 IL operations for each iteration of 4 characters. <-- kind of the same optimization as is done in StringSearchValues
dotnet run -c Release --filter *JumpTableMultiple*Trie*  --launchCount 3

Before

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-RMGOVO : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput  
Method Count Mean Error StdDev Op/s
Trie 2 3.284 ns 0.0152 ns 0.0286 ns 304,474,863.5
VectorTrie 2 3.373 ns 0.0190 ns 0.0357 ns 296,466,572.2
Trie 5 3.872 ns 0.0208 ns 0.0386 ns 258,288,594.1
VectorTrie 5 4.081 ns 0.0209 ns 0.0398 ns 245,050,802.2
Trie 10 4.960 ns 0.0501 ns 0.0942 ns 201,618,909.5
VectorTrie 10 5.069 ns 0.0594 ns 0.1129 ns 197,283,921.0
Trie 25 24.888 ns 0.2683 ns 0.5104 ns 40,180,047.5
VectorTrie 25 8.783 ns 0.0418 ns 0.0796 ns 113,854,026.0
Trie 50 41.514 ns 0.3654 ns 0.6952 ns 24,088,288.4
VectorTrie 50 15.320 ns 0.1335 ns 0.2540 ns 65,273,554.0
Trie 100 162.799 ns 1.0319 ns 1.9382 ns 6,142,555.1
VectorTrie 100 38.724 ns 0.2069 ns 0.3887 ns 25,823,634.6

After

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-VILVTP : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput  
Method Count Mean Error StdDev Median Op/s
Trie 2 3.149 ns 0.0169 ns 0.0321 ns 3.148 ns 317,515,568.2
VectorTrie 2 3.380 ns 0.0193 ns 0.0347 ns 3.371 ns 295,849,542.1
Trie 5 3.791 ns 0.0209 ns 0.0397 ns 3.806 ns 263,756,033.7
VectorTrie 5 4.036 ns 0.0210 ns 0.0395 ns 4.026 ns 247,769,489.7
Trie 10 4.079 ns 0.0419 ns 0.0766 ns 4.054 ns 245,174,104.0
VectorTrie 10 4.197 ns 0.0210 ns 0.0400 ns 4.192 ns 238,244,013.7
Trie 25 25.089 ns 0.2536 ns 0.4826 ns 24.883 ns 39,858,055.0
VectorTrie 25 5.478 ns 0.0306 ns 0.0582 ns 5.474 ns 182,558,428.1
Trie 50 37.309 ns 0.2636 ns 0.5015 ns 37.299 ns 26,803,430.1
VectorTrie 50 8.761 ns 0.0302 ns 0.0575 ns 8.768 ns 114,138,948.1
Trie 100 60.896 ns 0.2635 ns 0.5013 ns 60.724 ns 16,421,417.5
VectorTrie 100 31.195 ns 0.1460 ns 0.2742 ns 31.206 ns 32,055,996.7
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label Used by the dotnet-issue-labeler to label those issues which couldn't be triaged automatically label Dec 14, 2023
@gfoidl gfoidl added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed needs-area-label Used by the dotnet-issue-labeler to label those issues which couldn't be triaged automatically labels Dec 14, 2023
@BrennanConroy
Copy link
Member

+@javiercn @JamesNK you two are familiar with routing

@JamesNK
Copy link
Member

JamesNK commented Dec 20, 2023

I know what the ILEmitTrie does but I've never touched it. From looking at the file history, no one has made significant changes to it since @rynowak wrote it years ago. Snaps for Ryan.

If the tests all pass and perf is good then we're halfway there. I'd like to get someone to double check the vertorized path. I don't have enough expertise to review code and offer an opinion.

What should think about:

  • Is the new generated IL significantly larger? There can be a lot of trie jump tables so we need to keep its size in check
  • There are rules about what jump table to create. There is a check that prefers a dictionary jump table over a trie above 100 items. Should this change?

@JamesNK
Copy link
Member

JamesNK commented Dec 20, 2023

@andrewjsaid Go ahead and create a PR. We should have discussion in the PR.

@rynowak
Copy link
Member

rynowak commented Dec 20, 2023

There are rules about what jump table to create. There is a check that prefers a dictionary jump table over a trie above 100 items. Should this change?

I can probably provide some degree of feedback on topics like this in the PR. Unfortunately it's all from memory because I didn't work on this code much after the original version 😓

The good news is that I think the stakes probably pretty low. Most of the decisions like the switch between emit and dictionary were made empirically based on the benchmarks. So if the benchmarks change then the breakpoints should change too.

Binary search wasn't something I thought of while building the original version. Compared to the previous version of routing all of this work was in a different galaxy perf-wise 🪐 . Most of my time was spent worrying about overhead in the N == 1 case.

Is the new generated IL significantly larger? There can be a lot of trie jump tables so we need to keep its size in check

I think @JamesNK raises a good point about code size. Changing the complexity to log(N) means that the emit approach will scale to a higher breakpoint (where dictionary becomes better) than the linear version. We may not want to make this decision solely based on the throughput. In my anecdotal experience a "big" node of the routing tree has 10+ branches, not hundreds or thousands so cases that would result in big tries aren't that common.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@BrennanConroy BrennanConroy added this to the 9.0-preview2 milestone Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions area-routing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants