Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Implement APIs for some threading metrics (CoreCLR) #22754

Closed
wants to merge 7 commits into from

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Feb 21, 2019

@kouvel kouvel added this to the 3.0 milestone Feb 21, 2019
@kouvel kouvel self-assigned this Feb 21, 2019
@kouvel kouvel requested a review from stephentoub February 21, 2019 20:44
kouvel added a commit to kouvel/corert that referenced this pull request Feb 21, 2019
…ixes

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754
- Fixed `Timer` implementation on Unixes. Previously there was only ever one timer request from the upper-level implementation and that is not the case anymore, so the lower-level "app domain timer" implementation needed to handle multiple timer requests.
@kouvel kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Feb 21, 2019
@kouvel
Copy link
Member Author

kouvel commented Feb 21, 2019

There was no measurable change to perf from the changes

kouvel added a commit to kouvel/corert that referenced this pull request Feb 21, 2019
…ixes

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754
- Fixed `Timer` implementation on Unixes. Previously there was only ever one timer request from the upper-level implementation and that is not the case anymore, so the lower-level "app domain timer" implementation needed to handle multiple timer requests.
src/vm/syncblk.cpp Outdated Show resolved Hide resolved
@kouvel
Copy link
Member Author

kouvel commented Mar 6, 2019

Some perf numbers:

  • Left score = Before any changes, work items per ms
  • Right score = After any changes, work items per ms

Initial change:

CoreCLR 4-core 4-thread 1-node          Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   670.88 ±0.20%   677.22 ±0.16%      0.95%
TaskBurstWorkThroughput 1PcT 001.0PcWi  1642.87 ±0.39%  1661.91 ±0.45%      1.16%
TaskBurstWorkThroughput 1PcT 004.0PcWi  1859.70 ±0.61%  1877.41 ±0.63%      0.95%
TaskBurstWorkThroughput 1PcT 016.0PcWi  1696.58 ±0.60%  1681.65 ±0.57%     -0.88%
TaskBurstWorkThroughput 1PcT 064.0PcWi  1843.14 ±0.70%  1872.47 ±0.50%      1.59%
TaskBurstWorkThroughput 1PcT 256.0PcWi  2025.88 ±0.62%  2015.75 ±0.54%     -0.50%
TaskSustainedWorkThroughput 1PcT        2906.67 ±0.25%  2913.72 ±0.76%      0.24%
--------------------------------------  --------------  --------------  ---------

CoreRT 4-core 4-thread 1-node           Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   403.73 ±0.21%   407.09 ±0.12%      0.83%
TaskBurstWorkThroughput 1PcT 001.0PcWi   976.50 ±0.11%   981.36 ±0.08%      0.50%
TaskBurstWorkThroughput 1PcT 004.0PcWi  2913.15 ±0.07%  2892.59 ±0.10%     -0.71%
TaskBurstWorkThroughput 1PcT 016.0PcWi  2610.12 ±0.06%  2589.73 ±0.08%     -0.78%
TaskBurstWorkThroughput 1PcT 064.0PcWi  3057.35 ±0.11%  3038.80 ±0.10%     -0.61%
TaskBurstWorkThroughput 1PcT 256.0PcWi  3235.46 ±0.11%  3207.21 ±0.13%     -0.87%
TaskSustainedWorkThroughput 1PcT        3943.24 ±0.17%  3924.89 ±0.08%     -0.47%
--------------------------------------  --------------  --------------  ---------

CoreCLR 48-core 48-thread 4-node        Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   878.83 ±0.40%   877.38 ±0.40%     -0.16%
TaskBurstWorkThroughput 1PcT 001.0PcWi  4541.56 ±0.74%  4647.15 ±0.90%      2.32%
TaskBurstWorkThroughput 1PcT 004.0PcWi  5295.10 ±0.30%  5431.67 ±0.33%      2.58%
TaskBurstWorkThroughput 1PcT 016.0PcWi  5301.83 ±0.26%  5356.74 ±0.44%      1.04%
TaskBurstWorkThroughput 1PcT 064.0PcWi  5284.73 ±0.22%  5321.00 ±0.39%      0.69%
TaskBurstWorkThroughput 1PcT 256.0PcWi  5323.88 ±0.10%  5264.35 ±0.39%     -1.12%
TaskSustainedWorkThroughput 1PcT        7874.07 ±0.88%  8029.77 ±0.66%      1.98%
--------------------------------------  --------------  --------------  ---------

CoreRT 48-core 48-thread 4-node         Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi  3546.97 ±0.12%  3264.71 ±0.17%     -7.96%
TaskBurstWorkThroughput 1PcT 001.0PcWi  2133.60 ±0.23%  1823.23 ±0.49%    -14.55%
TaskBurstWorkThroughput 1PcT 004.0PcWi  4514.30 ±0.30%  3846.87 ±0.12%    -14.78%
TaskBurstWorkThroughput 1PcT 016.0PcWi  4867.67 ±0.29%  4115.50 ±0.21%    -15.45%
TaskBurstWorkThroughput 1PcT 064.0PcWi  4927.05 ±0.32%  4144.38 ±0.35%    -15.89%
TaskBurstWorkThroughput 1PcT 256.0PcWi  4541.31 ±0.50%  3946.43 ±0.45%    -13.10%
TaskSustainedWorkThroughput 1PcT        7061.97 ±0.41%  6064.36 ±0.37%    -14.13%
--------------------------------------  --------------  --------------  ---------

After change to use thread-locals for more things (monitor lock contention counting in CoreCLR, all of the relevant counting in CoreRT):

CoreCLR 4-core 4-thread 1-node          Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   673.14 ±0.28%   672.85 ±0.25%     -0.04%
TaskBurstWorkThroughput 1PcT 001.0PcWi  1658.72 ±0.32%  1669.33 ±0.23%      0.64%
TaskBurstWorkThroughput 1PcT 004.0PcWi  1845.45 ±0.49%  1874.94 ±0.51%      1.60%
TaskBurstWorkThroughput 1PcT 016.0PcWi  1674.97 ±0.68%  1687.24 ±0.72%      0.73%
TaskBurstWorkThroughput 1PcT 064.0PcWi  1846.55 ±0.46%  1839.79 ±0.58%     -0.37%
TaskBurstWorkThroughput 1PcT 256.0PcWi  2024.60 ±0.35%  2034.10 ±0.31%      0.47%
TaskSustainedWorkThroughput 1PcT        2904.78 ±0.33%  3051.85 ±0.42%      5.06%
--------------------------------------  --------------  --------------  ---------

CoreRT 4-core 4-thread 1-node           Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   405.24 ±0.13%   406.74 ±0.17%      0.37%
TaskBurstWorkThroughput 1PcT 001.0PcWi   978.79 ±0.06%   985.89 ±0.06%      0.73%
TaskBurstWorkThroughput 1PcT 004.0PcWi  2934.53 ±0.09%  2922.13 ±0.11%     -0.42%
TaskBurstWorkThroughput 1PcT 016.0PcWi  2620.20 ±0.05%  2592.52 ±0.06%     -1.06%
TaskBurstWorkThroughput 1PcT 064.0PcWi  3055.01 ±0.16%  3041.20 ±0.07%     -0.45%
TaskBurstWorkThroughput 1PcT 256.0PcWi  3243.40 ±0.08%  3237.24 ±0.12%     -0.19%
TaskSustainedWorkThroughput 1PcT        3984.48 ±0.09%  3974.17 ±0.08%     -0.26%
--------------------------------------  --------------  --------------  ---------

CoreCLR 48-core 48-thread 4-node        Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi   874.75 ±0.33%   706.56 ±0.46%    -19.23%
TaskBurstWorkThroughput 1PcT 001.0PcWi  4472.33 ±1.28%  2294.10 ±1.53%    -48.70%
TaskBurstWorkThroughput 1PcT 004.0PcWi  5324.16 ±0.33%  5411.74 ±0.44%      1.64%
TaskBurstWorkThroughput 1PcT 016.0PcWi  5321.38 ±0.27%  5379.79 ±0.13%      1.10%
TaskBurstWorkThroughput 1PcT 064.0PcWi  5267.20 ±0.39%  5419.75 ±0.20%      2.90%
TaskBurstWorkThroughput 1PcT 256.0PcWi  5241.08 ±0.46%  5389.11 ±0.26%      2.82%
TaskSustainedWorkThroughput 1PcT        7535.42 ±0.43%  7880.48 ±0.70%      4.58%
--------------------------------------  --------------  --------------  ---------

CoreRT 48-core 48-thread 4-node         Left score      Right score     ∆ Score %
--------------------------------------  --------------  --------------  ---------
TaskBurstWorkThroughput 1PcT 000.5PcWi  3531.92 ±0.11%  3302.87 ±0.11%     -6.49%
TaskBurstWorkThroughput 1PcT 001.0PcWi  2153.34 ±0.23%  1971.45 ±0.36%     -8.45%
TaskBurstWorkThroughput 1PcT 004.0PcWi  4518.53 ±0.21%  3803.60 ±0.14%    -15.82%
TaskBurstWorkThroughput 1PcT 016.0PcWi  4818.32 ±0.24%  4002.37 ±0.28%    -16.93%
TaskBurstWorkThroughput 1PcT 064.0PcWi  4917.06 ±0.43%  4065.70 ±0.29%    -17.31%
TaskBurstWorkThroughput 1PcT 256.0PcWi  4507.61 ±0.50%  3863.19 ±0.48%    -14.30%
TaskSustainedWorkThroughput 1PcT        6984.64 ±0.40%  6150.49 ±0.24%    -11.94%
--------------------------------------  --------------  --------------  ---------

Summary:

  • Updated task burst tests to scale a bit better, code is here and I had modified it to use a medium-length delay (RandomMediumDelay) for the numbers above
  • None of these regressions or improvements are caused by either change (counting with interlocked or thread-local)
  • According to profiles, perf changes appear to be in code that was not changed. Perhaps slight changes in timing are affecting where and how badly contention occurs.
  • For monitor lock contention counting, I tried a test that has multiple threads wait on an already acquired lock, then when all threads reach a waiting state the lock is released, and this repeats. I did not see any noticeable contention from using interlocked counting for Monitor lock contentions before waiting.
  • Decided to go with the thread-local approach for all of the relevant counting, as in the profiles it doesn't appear to be any worse than counting with interlocked operations, and it shouldn't be a cause of contention in the future. At the moment the difference doesn't seem to be noticeable. For thread pool, probably there are other bottlenecks that would all need to be improved to see a difference.

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2019

The RT implementation is broken, ignore those perf numbers, I'll fix and retest

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2019

Perf numbers after fixes were similar to before, updated inline above

@kouvel
Copy link
Member Author

kouvel commented Mar 7, 2019

@dotnet-bot test this please

@kouvel
Copy link
Member Author

kouvel commented Mar 18, 2019

Ping for review please

@kouvel
Copy link
Member Author

kouvel commented Mar 18, 2019

@dotnet-bot test Ubuntu arm Cross Checked crossgen_comparison Build and Test

@jkotas
Copy link
Member

jkotas commented Mar 19, 2019

I have been waiting for API review to be done before looking at this.

@kouvel
Copy link
Member Author

kouvel commented Mar 19, 2019

Oh ok, I'll ping that one as well

@kouvel
Copy link
Member Author

kouvel commented Mar 20, 2019

The API review was approved. Couple of pending items is:

  • Whether to expose PendingLocalWorkItemCount and PendingGlobalWorkItemCount. I have asked for more info on whether they would be useful, by default I'll go ahead and remove them
  • Whether we want to expose CompletedWorkItemCount at all.
    • I think that's the most expensive one to track of the lot, as it involves a bit of extra work per work item
    • It is possible to mostly eliminate the extra cost for managed work items in CoreCLR and CoreRT. Work items that may have some extra cost are timer callbacks in CoreRT Windows and IO completion work items in CoreCLR Windows.
    • The info would enable easily determining work item throughput and whether the thread pool is stalled

@kouvel kouvel removed the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Mar 30, 2019
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jkotas
Copy link
Member

jkotas commented Apr 2, 2019

Preserve flush for hill climbing

Why do you think we need the flush for hill climbing? Any reasonable processor will flush the caches to the main memory at reasonable rate. I would think that it is fine for the hill climbing to miss last few microsecond worth of updates.

@kouvel
Copy link
Member Author

kouvel commented Apr 2, 2019

Why do you think we need the flush for hill climbing?

Hill climbing wants to know what the current counts are so that it can respond and having the flush may help it respond sooner. Some processors may take longer than others to sync those updates. I doubt that it would take long enough to sync those updates to be comparable to the hill climbing interval, I'm not sure about that. I don't think it will make a noticeable difference on recent processors, not sure about older ones.

@jkotas
Copy link
Member

jkotas commented Apr 2, 2019

having the flush may help it respond sooner.

At the cost of slowing down every processor in the system...

@kouvel
Copy link
Member Author

kouvel commented Apr 2, 2019

The hill climbing interval is between 10 and 200 ms, higher when the thread count is already at the min number of threads. I'm not sure if the perf hit would be significant on those time scales. I also doubt that the flush would improve anything, I just kept it because I don't have all of the information on why it was there to begin with.

@kouvel
Copy link
Member Author

kouvel commented Apr 2, 2019

Changed my mind again, I'll just remove it. If not having a flush would be an issue it likely would have shown up in a bunch of other places before.

@kouvel
Copy link
Member Author

kouvel commented Apr 3, 2019

@dotnet-bot test Windows_NT x64 Checked CoreFX Tests

@kouvel
Copy link
Member Author

kouvel commented Apr 20, 2019

Merged in #24113

@kouvel kouvel closed this Apr 20, 2019
@kouvel kouvel deleted the ThreadMetrics branch April 20, 2019 02:23
kouvel added a commit to kouvel/corert that referenced this pull request Apr 23, 2019
jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 23, 2019
* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/coreclr that referenced this pull request Apr 23, 2019
* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet#22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Apr 23, 2019
* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request Apr 23, 2019
* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit that referenced this pull request Apr 23, 2019
* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on #22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
stephentoub pushed a commit to dotnet/corefx that referenced this pull request May 9, 2019
* Expose and test APIs for some threading metrics (CoreFX)

- API review: https://github.com/dotnet/corefx/issues/35500
- Depends on dotnet/coreclr#22754, dotnet/corert#7066

* Separate and expose pending local vs global work item count

* Remove local/global variants of PendingWorkItemCount

* Remove unrelated test

* Add test for a fix to ThreadLocal.Values property throwing NullReferenceException when disposed

Fix is in dotnet/corert#7066

* Fix build

* Fix test

* Add API compat baselines for uapaot

* Fix test

* Use RemoteExecutor for MetricsTest

* Address feedback
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…et/coreclr#7066)

* Implement APIs for some threading metrics (CoreRT)

- API review: https://github.com/dotnet/corefx/issues/35500
- May depend on dotnet/coreclr#22754

* Use thread-locals for counting, use finalizer instead of runtime to detect thread exit

* Don't let the count properties throw OOM

* Remove some flushes

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>


Commit migrated from dotnet/coreclr@447b655
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…fx#37401)

* Expose and test APIs for some threading metrics (CoreFX)

- API review: https://github.com/dotnet/corefx/issues/35500
- Depends on dotnet/coreclr#22754, dotnet/corert#7066

* Separate and expose pending local vs global work item count

* Remove local/global variants of PendingWorkItemCount

* Remove unrelated test

* Add test for a fix to ThreadLocal.Values property throwing NullReferenceException when disposed

Fix is in dotnet/corert#7066

* Fix build

* Fix test

* Add API compat baselines for uapaot

* Fix test

* Use RemoteExecutor for MetricsTest

* Address feedback


Commit migrated from dotnet/corefx@34fe566
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants