Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the triage suspected machinery and break cycles with weak refs #5910

Merged
merged 20 commits into from
Jul 9, 2024

Conversation

teoxoy
Copy link
Member

@teoxoy teoxoy commented Jul 4, 2024

Resolves #5560.
Resolves #5120.
Resolves #5592 & resolves #5583 by removing Device::temp_suspected.

PR doesn't need to be squashed, each commit builds by itself.

@teoxoy teoxoy requested a review from a team as a code owner July 4, 2024 20:20
@teoxoy
Copy link
Member Author

teoxoy commented Jul 8, 2024

renderpass benchmark results of current trunk vs this PR (TLDR: 39-64% less time spent in .submit()):

Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time): Warming up for 3.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 2080 with Max-Q Design", vendor: 4318, device: 7824, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "555.99", backend: Vulkan }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Renderpass Time)
                        time:   [45.386 ms 45.612 ms 45.862 ms]
                        thrpt:  [218.04 Kelem/s 219.24 Kelem/s 220.33 Kelem/s]
                 change:
                        time:   [-45.146% -43.159% -41.027%] (p = 0.00 < 0.05)
                        thrpt:  [+69.569% +75.928% +82.302%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Renderpass Time)
                        time:   [45.995 ms 46.243 ms 46.515 ms]
                        thrpt:  [214.98 Kelem/s 216.25 Kelem/s 217.41 Kelem/s]
                 change:
                        time:   [-38.633% -36.379% -34.017%] (p = 0.00 < 0.05)
                        thrpt:  [+51.554% +57.181% +62.954%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Renderpass Time)
                        time:   [46.370 ms 46.602 ms 46.849 ms]
                        thrpt:  [213.45 Kelem/s 214.58 Kelem/s 215.66 Kelem/s]
                 change:
                        time:   [-36.293% -33.691% -30.883%] (p = 0.00 < 0.05)
                        thrpt:  [+44.681% +50.809% +56.969%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Renderpass Time)
                        time:   [47.662 ms 47.883 ms 48.131 ms]
                        thrpt:  [207.77 Kelem/s 208.84 Kelem/s 209.81 Kelem/s]
                 change:
                        time:   [-48.640% -46.851% -45.073%] (p = 0.00 < 0.05)
                        thrpt:  [+82.058% +88.149% +94.703%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit Time)
                        time:   [7.6377 ms 7.6754 ms 7.7151 ms]
                        thrpt:  [1.2962 Melem/s 1.3029 Melem/s 1.3093 Melem/s]
                 change:
                        time:   [-66.548% -64.128% -61.611%] (p = 0.00 < 0.05)
                        thrpt:  [+160.49% +178.77% +198.93%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit Time)
                        time:   [9.2829 ms 9.3285 ms 9.3782 ms]
                        thrpt:  [1.0663 Melem/s 1.0720 Melem/s 1.0772 Melem/s]
                 change:
                        time:   [-52.002% -50.382% -48.791%] (p = 0.00 < 0.05)
                        thrpt:  [+95.276% +101.54% +108.34%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit Time)
                        time:   [10.535 ms 10.598 ms 10.669 ms]
                        thrpt:  [937.28 Kelem/s 943.61 Kelem/s 949.21 Kelem/s]
                 change:
                        time:   [-59.243% -56.854% -54.335%] (p = 0.00 < 0.05)
                        thrpt:  [+118.99% +131.77% +145.36%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit Time)
                        time:   [12.488 ms 12.657 ms 12.852 ms]
                        thrpt:  [778.10 Kelem/s 790.06 Kelem/s 800.78 Kelem/s]
                 change:
                        time:   [-40.483% -39.474% -38.497%] (p = 0.00 < 0.05)
                        thrpt:  [+62.594% +65.217% +68.020%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

Renderpass: Multi Threaded/2 threads x 5000 draws
                        time:   [26.509 ms 26.626 ms 26.747 ms]
                        thrpt:  [373.88 Kelem/s 375.57 Kelem/s 377.22 Kelem/s]
                 change:
                        time:   [-1.3415% -0.7862% -0.2772%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2779% +0.7924% +1.3597%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Renderpass: Multi Threaded/4 threads x 2500 draws
                        time:   [17.354 ms 17.457 ms 17.562 ms]
                        thrpt:  [569.42 Kelem/s 572.83 Kelem/s 576.23 Kelem/s]
                 change:
                        time:   [-0.5265% +0.2546% +1.0503%] (p = 0.52 > 0.05)
                        thrpt:  [-1.0394% -0.2539% +0.5293%]
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
Renderpass: Multi Threaded/8 threads x 1250 draws
                        time:   [13.877 ms 13.933 ms 13.991 ms]
                        thrpt:  [714.74 Kelem/s 717.70 Kelem/s 720.60 Kelem/s]
                 change:
                        time:   [-1.8826% -1.1951% -0.5450%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5480% +1.2096% +1.9188%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Renderpass: Bindless/10000 draws
                        time:   [18.025 ms 18.092 ms 18.160 ms]
                        thrpt:  [550.67 Kelem/s 552.73 Kelem/s 554.79 Kelem/s]
                 change:
                        time:   [-3.7818% -3.2353% -2.7117%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7873% +3.3435% +3.9305%]
                        Performance has improved.

Renderpass: Empty Submit with 90000 Resources
                        time:   [4.1057 µs 4.1180 µs 4.1325 µs]
                        change: [-99.905% -99.904% -99.903%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Copy link
Contributor

@nical nical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This makes significant changes to some pretty complicated systems. The changes look good to me although I usually have a firmer understanding of the potential side effects when I approve a PR.

So this calls for some more thorough testing than usual or a second pair of eyes.

@teoxoy
Copy link
Member Author

teoxoy commented Jul 8, 2024

I modified the benchmark to check the poll time to see what impact removing triage_suspected and dropping the Trackers in triage_submissions had. Times improved for 1 and 2 command buffers, remained the same for 4 and increased for 8; I don't think this is surprising, but I was actually expecting perf regressions in poll across the board.

Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Poll Time): Warming up for 3.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 2080 with Max-Q Design", vendor: 4318, device: 7824, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "555.99", backend: Vulkan }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Poll Time)
                        time:   [4.3303 ms 4.3876 ms 4.4523 ms]
                        thrpt:  [2.2460 Melem/s 2.2792 Melem/s 2.3093 Melem/s]
                 change:
                        time:   [-34.959% -27.738% -19.927%] (p = 0.00 < 0.05)
                        thrpt:  [+24.886% +38.386% +53.748%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Poll Time)
                        time:   [4.4162 ms 4.4597 ms 4.5056 ms]
                        thrpt:  [2.2195 Melem/s 2.2423 Melem/s 2.2644 Melem/s]
                 change:
                        time:   [-49.601% -45.172% -40.151%] (p = 0.00 < 0.05)
                        thrpt:  [+67.087% +82.389% +98.418%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Poll Time)
                        time:   [5.3340 ms 5.3993 ms 5.4708 ms]
                        thrpt:  [1.8279 Melem/s 1.8521 Melem/s 1.8748 Melem/s]
                 change:
                        time:   [-8.7991% -1.7697% +5.5161%] (p = 0.64 > 0.05)
                        thrpt:  [-5.2277% +1.8016% +9.6481%]
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Poll Time)
                        time:   [7.1084 ms 7.1824 ms 7.2583 ms]
                        thrpt:  [1.3777 Melem/s 1.3923 Melem/s 1.4068 Melem/s]
                 change:
                        time:   [+43.896% +52.281% +60.210%] (p = 0.00 < 0.05)
                        thrpt:  [-37.582% -34.332% -30.505%]
                        Performance has regressed.

@teoxoy
Copy link
Member Author

teoxoy commented Jul 8, 2024

All in all, perf of submit + poll has improved by 22-32%.

Benchmarking Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit + Poll Time): Warming up for 3.0000 sAdapterInfo { name: "NVIDIA GeForce RTX 2080 with Max-Q Design", vendor: 4318, device: 7824, device_type: DiscreteGpu, driver: "NVIDIA", driver_info: "555.99", backend: Vulkan }
Renderpass: Single Threaded/1 renderpasses x 10000 draws (Submit + Poll Time)
                        time:   [11.894 ms 11.955 ms 12.020 ms]
                        thrpt:  [831.93 Kelem/s 836.46 Kelem/s 840.79 Kelem/s]
                 change:
                        time:   [-33.133% -32.534% -31.944%] (p = 0.00 < 0.05)
                        thrpt:  [+46.938% +48.223% +49.551%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/2 renderpasses x 5000 draws (Submit + Poll Time)
                        time:   [13.566 ms 13.656 ms 13.757 ms]
                        thrpt:  [726.90 Kelem/s 732.26 Kelem/s 737.13 Kelem/s]
                 change:
                        time:   [-30.535% -29.948% -29.234%] (p = 0.00 < 0.05)
                        thrpt:  [+41.311% +42.752% +43.958%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Renderpass: Single Threaded/4 renderpasses x 2500 draws (Submit + Poll Time)
                        time:   [15.506 ms 15.593 ms 15.685 ms]
                        thrpt:  [637.56 Kelem/s 641.30 Kelem/s 644.91 Kelem/s]
                 change:
                        time:   [-27.790% -27.327% -26.847%] (p = 0.00 < 0.05)
                        thrpt:  [+36.699% +37.603% +38.485%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Renderpass: Single Threaded/8 renderpasses x 1250 draws (Submit + Poll Time)
                        time:   [19.174 ms 19.330 ms 19.534 ms]
                        thrpt:  [511.92 Kelem/s 517.34 Kelem/s 521.55 Kelem/s]
                 change:
                        time:   [-23.052% -22.296% -21.462%] (p = 0.00 < 0.05)
                        thrpt:  [+27.326% +28.693% +29.958%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Copy link
Member

@cwfitzgerald cwfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I have one comment about the trackers. Did you do much changes in the trackers, or just moving code around, it's a bit hard to tell.

Love the diff numbers!

wgpu-core/src/track/buffer.rs Outdated Show resolved Hide resolved
@teoxoy
Copy link
Member Author

teoxoy commented Jul 9, 2024

Did you do much changes in the trackers, or just moving code around, it's a bit hard to tell.

The only behavioral change was .insert_single() will no longer panic if the resource at the tracker index already exists. This was done because we recycle tracker indices on resource drop and I found no issue with just overwriting the data in the tracker at a given index.

teoxoy added 20 commits July 9, 2024 10:01
…ources.

It's worth noting that `suspected_resources` never contained those resources.
The Vec only ever contained 0 or 1 command buffers.
We now acquire an encoder on every submit for pending writes but that's not a problem since those are coming from a pool anyway.
…ive in `LifetimeTracker.handle_mapping`

This change doesn't change behavior as `Global.buffer_drop` already unmaps the buffer.
@teoxoy teoxoy force-pushed the submit-tracking branch from 20a8e35 to 7ba4498 Compare July 9, 2024 08:35
@teoxoy teoxoy merged commit 4255268 into gfx-rs:trunk Jul 9, 2024
25 checks passed
@teoxoy teoxoy deleted the submit-tracking branch July 9, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
3 participants