-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Add ref/in/out overloads of methods for Vectors and Matrices #25388
Conversation
… applicable. Also, use FEATURE pattern for Numerics.Vectors.Tests so they compile on both netcore and netfx.
@@ -0,0 +1,15 @@ | |||
<?xml version="1.0" encoding="utf-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure how this file snuck in. I'll remove it.
@@ -52,6 +52,27 @@ public partial struct Matrix3x2 : System.IEquatable<System.Numerics.Matrix3x2> | |||
public static System.Numerics.Matrix3x2 operator -(System.Numerics.Matrix3x2 value) { throw null; } | |||
public static System.Numerics.Matrix3x2 Subtract(System.Numerics.Matrix3x2 value1, System.Numerics.Matrix3x2 value2) { throw null; } | |||
public override string ToString() { throw null; } | |||
#if FEATURE_REF_OVERLOADS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we build for netfx, and these APIs aren't in desktop (yet). But we wanted to make these APIs available in as many places as possible. So we are starting with netcoreapp and uap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weshaggard, in many projects we add to the refs without special-casing them for netfx. What's special about this one and others that still require special-casing?
@eerhardt, personally I'd prefer if these were added in a .netcoreapp.cs specific file rather than using #ifdefs, but I'm ok being overruled if this is the practice we've already established elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn’t put it in .netcoreapp.cs file was because it gets compiled in for both netcoreapp AND uap. When searching other scenarios, I found the FEATURE pattern being used.
Ex
corefx/src/System.Security.Cryptography.Cng/ref/System.Security.Cryptography.Cng.cs
Line 269 in 1ce37df
#if FEATURE_HASHDATA // uap and netcoreapp specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn’t put it in .netcoreapp.cs file was because it gets compiled in for both netcoreapp AND uap.
We use "netcoreapp" all over the place when it's also used for uap. That's now the case for the vast majority of new APIs across corefx. The terminology needs to be cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use "netcoreapp" all over the place when it's also used for uap.
This seems like an anti-pattern to me. There are plenty of places where .netcoreapp.cs
means "only used on netcoreapp". Here's just a couple I found really quickly:
<Compile Include="System\Reflection\Internal\Utilities\EncodingHelper.netcoreapp.cs" Condition="'$(TargetGroup)' == 'netcoreapp'" /> |
corefx/src/System.Threading.Tasks/tests/System.Threading.Tasks.Tests.csproj
Lines 55 to 58 in 6f08f5a
<ItemGroup Condition="'$(TargetGroup)' == 'netcoreapp'"> | |
<Compile Include="CancellationTokenTests.netcoreapp.cs" /> | |
<Compile Include="Task\TaskStatusTest.netcoreapp.cs" /> | |
</ItemGroup> |
corefx/src/System.Drawing.Primitives/tests/System.Drawing.Primitives.Tests.csproj
Lines 25 to 29 in 6f08f5a
<ItemGroup Condition="'$(TargetGroup)'=='netcoreapp'"> | |
<Compile Include="SizeFTests.netcoreapp.cs" /> | |
<Compile Include="ColorTests.netcoreapp.cs" /> | |
<Compile Include="SizeTests.netcoreapp.cs" /> | |
</ItemGroup> |
This one even mixes them both into the same project:
corefx/src/System.Runtime.Extensions/tests/System.Runtime.Extensions.Tests.csproj
Lines 34 to 39 in 6f08f5a
<ItemGroup Condition="'$(TargetGroup)'=='netcoreapp'"> | |
<Compile Include="System\Random.netcoreapp.cs" /> | |
</ItemGroup> | |
<ItemGroup Condition="'$(TargetGroup)'!='netstandard'"> | |
<Compile Include="System\BitConverterSpan.cs" /> | |
<Compile Include="System\BitConverter.netcoreapp.cs" /> |
I assume the intention of suffixing these files with a TFM was so that it was easily recognizable that this file was only compiled for this TFM. When we use one TFM to mean it is sometimes compiled for 2 TFMs, the only way you can know is to go look in the .csproj file.
I'd be happy to split these out into a separate file, but maybe a .netcoreapp.uap.cs
suffix would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, the terminology needs updating. "netcoreapp" has morphed into also including all new APIs, as all new APIs are being added for all platforms that use corefx/coreclr/corert.
I'm going to actually review this in a bit. However, my primary concern (from a brief glance) is that this will lead to poorer performance on architectures using the System V ABI, where these are already supposed to be passing around via register (and will now be passed as a reference instead). This might also lead to poorer performance if Windows ever gets |
Had wondered about this as BOTR ABI refers to the "x64 Software Conventions" which does specify use of XMM0:YMM15 registers; but via |
result.M22 = 1.0f; | ||
|
||
result.M31 = position.X; | ||
result.M32 = position.Y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than duplicating all of these implementations, would it make sense to implement the existing overload in terms of the new one? e.g.
public static Matrix3x2 CreateTranslation(Vector2 position)
{
Matrix3x2 result;
CreateTranslation(in position, out result);
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will force architectures using System V ABI
to have worse performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will force architectures using System V ABI to have worse performance.
To what extent? If significant, then ok, but there is a ton of duplicated code being added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Jit used __vectorcall
then wouldn't need any in
params for the Vector types as they are all register sized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub, I believe it would be measurable. It should be whatever the latency/overhead of copying from register to memory would be and back would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it would be measurable
Can we measure then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen various poor CQ issues in the JIT around handling the SysV conventions for structs; things may be better for Vectors/HVAs but I' be kind of surprised if this is the case. In particular the jit seems to spill promoted structs to memory before calls (then reloading into registers) and vice versa on method entry.
So, please look carefully at the actual code the jit generates in any sort of A/B experiment. The results may say more about the abilities or limitations of the jit than any inherent "best" way of passing args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are limitations or poor CQ with regard to Vectors/HVAs, then I would prefer to see those get fixed over adding new APIs which only patch the issue.
That being said, I will be sure to also look at the codegen when comparing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been able to run some benchmarks today. I'm using the suite that @mellinoe put together in the issue #157.
Machine 1: OS=ubuntu 16.04 Processor=Intel Xeon CPU E5-1620 0 3.60GHz, ProcessorCount=8 .NET Core 2.1.0-preview1-25919-02 (Framework 4.6.25917.03), 64bit RyuJIT
Baseline (none of my changes)
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 271.720 ns | 0.2741 ns | 0.2430 ns |
QuaterionByValue | 206.123 ns | 0.0605 ns | 0.0566 ns |
Vector3ByValue | 11.839 ns | 0.0379 ns | 0.0354 ns |
Vector3SimpleAddByValue | 2.118 ns | 0.0017 ns | 0.0016 ns |
Current changes (duplicated code)
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 275.4029 ns | 0.5192 ns | 0.4335 ns |
Matrix4x4ByRef | 201.6788 ns | 0.9473 ns | 0.7910 ns |
QuaterionByValue | 202.4441 ns | 0.1351 ns | 0.1128 ns |
QuaterionByRef | 282.7828 ns | 0.1976 ns | 0.1848 ns |
Vector3ByValue | 12.3704 ns | 0.0042 ns | 0.0035 ns |
Vector3ByRef | 0.7947 ns | 0.0009 ns | 0.0007 ns |
Vector3SimpleAddByValue | 1.8711 ns | 0.0018 ns | 0.0016 ns |
Vector3SimpleAddByRef | 0.8011 ns | 0.0211 ns | 0.0187 ns |
Doing some refactoring suggested here
See this commit for the code changes
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 250.2053 ns | 0.2409 ns | 0.2135 ns |
Matrix4x4ByRef | 235.3991 ns | 0.1826 ns | 0.1619 ns |
QuaterionByValue | 338.3308 ns | 0.3273 ns | 0.2733 ns |
QuaterionByRef | 283.7789 ns | 0.4899 ns | 0.4343 ns |
Vector3ByValue | 21.1915 ns | 0.0120 ns | 0.0106 ns |
Vector3ByRef | 0.7941 ns | 0.0013 ns | 0.0012 ns |
Vector3SimpleAddByValue | 1.8746 ns | 0.0022 ns | 0.0019 ns |
Vector3SimpleAddByRef | 0.8000 ns | 0.0135 ns | 0.0120 ns |
Machine 2: OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.19) Processor=, ProcessorCount=8 .NET Core 2.1.0-preview1-25907-02 (Framework 4.6.25901.06), 64bit RyuJIT
I also ran the same tests on my Windows machine:
Baseline (none of my changes)
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 205.941 ns | 0.8044 ns | 0.7524 ns |
QuaterionByValue | 157.711 ns | 0.5429 ns | 0.5078 ns |
Vector3ByValue | 9.050 ns | 0.0350 ns | 0.0328 ns |
Vector3SimpleAddByValue | 1.391 ns | 0.0098 ns | 0.0087 ns |
Current changes (duplicated code)
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 212.2870 ns | 1.2065 ns | 1.1285 ns |
Matrix4x4ByRef | 159.5631 ns | 0.5604 ns | 0.5242 ns |
QuaterionByValue | 158.7156 ns | 0.7299 ns | 0.6095 ns |
QuaterionByRef | 147.2884 ns | 0.6121 ns | 0.5111 ns |
Vector3ByValue | 9.0204 ns | 0.0310 ns | 0.0290 ns |
Vector3ByRef | 9.8258 ns | 0.0334 ns | 0.0296 ns |
Vector3SimpleAddByValue | 1.3929 ns | 0.0118 ns | 0.0110 ns |
Vector3SimpleAddByRef | 0.4443 ns | 0.0075 ns | 0.0066 ns |
Doing some refactoring suggested here
Method | Mean | Error | StdDev |
---|---|---|---|
Matrix4x4ByValue | 181.4270 ns | 0.5213 ns | 0.4876 ns |
Matrix4x4ByRef | 172.6201 ns | 0.6639 ns | 0.5885 ns |
QuaterionByValue | 181.0366 ns | 0.7396 ns | 0.6918 ns |
QuaterionByRef | 148.7591 ns | 0.6140 ns | 0.5744 ns |
Vector3ByValue | 19.4549 ns | 0.0934 ns | 0.0828 ns |
Vector3ByRef | 9.2186 ns | 0.0154 ns | 0.0129 ns |
Vector3SimpleAddByValue | 1.3787 ns | 0.0106 ns | 0.0099 ns |
Vector3SimpleAddByRef | 0.4396 ns | 0.0110 ns | 0.0097 ns |
On both Windows and Ubuntu, it appears that the Matrix4x4ByValue
test gets better with the refactoring, but both QuaterionByValue
and Vector3ByValue
get worse.
My thoughts are that we shouldn't do the refactoring at this time, as it adds risk to this change
Another observation is that @tannergooding's initial primary concern holds for the QuaterionByValue
vs. QuaterionByRef
test on Ubuntu.
Method | Mean | Error | StdDev |
---|---|---|---|
QuaterionByValue | 202.4441 ns | 0.1351 ns | 0.1128 ns |
QuaterionByRef | 282.7828 ns | 0.1976 ns | 0.1848 ns |
In this scenario, on my Ubuntu machine, using the new ref
overloads appears to lead to poorer performance. But on my Windows machine, this isn't true and it wasn't true for @mellinoe's reports either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But on my Windows machine, this isn't true and it wasn't true for @mellinoe's reports either.
On Windows, only the default x64 calling convention is used and there isn't support for __vectorcall
today. This means that (on Windows) all of these are passed as normal structs (requiring shadow copying to the stack) and will likely see a perf increase (due to no longer requiring shadow-copying of the values).
On Unix based systems, we use the System V ABI, which allows these values to be passed in register, without shadow copying to the stack. However, as @AndyAyersMS pointed out, the code gen for System V
ABI isn't the best and we are still shadow copying the values in some scenarios (which appears to be why you are seeing improvements).
I'm still working on collecting actual codegen data here.
#if FEATURE_REF_OVERLOADS | ||
Vector2.Distance(in a, in b, out actual); | ||
Assert.True(MathHelper.Equal(expected, actual), "Vector2f.Distance did not return the expected value."); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've generally separated such tests out into their own .netcoreapp.cs files.
9b9a89c
to
57a6e73
Compare
Can someone explain to me why are |
Does @mellinoe's answer here help? https://github.com/dotnet/corefx/issues/157#issuecomment-231807676 Specifically
|
Which runtimes to be precise? .NET Framework x86 (that doesn't use RyuJIT yet)? .NET Framework 2.0-3.5? Unknown runtime nobody uses? I think that such API clutter deserves a less vague explanation. |
@tannergooding writes:
Getting some good scenarios, even as micro-benchmarks if they're known to be representative, would be desirable either way we decide to go on this. @mellinoe wrote:
I am of two minds on this. I would really like to see us rely on good codegen for this. And note that, at least for the issues that I've seen, the problem isn't that the JIT isn't optimizing adequately, it's that the code that implements struct arguments and return values is unnecessarily generating conservative code - both in introducing excessive copies as well as marking structs as address-taken when they're not. I don't think that fixing it is a large task, and much of it is outlined in https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, but it hasn't bubbled to the top priority. |
As pointed out in dotnet/csharplang#1155 Will the Jit elide this copy? |
That's a different Vector3, it has property getters and setters.
That may turn out to be problematic, the JIT will have to figure out that v1 and v2 fields are read before the result is written. It's funny how people tend to think that passing values by refs has only positive consequences. |
I was rather troubled by the large performance differences, so I took a look at the benchmarks. What should have been obvious to me right away is that these benchmarks aren't using their results, so much or all of the code gets optimized away. By adding a static dummy float to the
There's a separate question about why in the Ref case more code is optimized away, but that's really an orthogonal issue. A more interesting question to me is how much of the One more note:
|
Aha, so as I suspected it's only
|
Hmm, turns out that these contain a hidden copy - they create the matrix in a local variable and then copy it to the output buffer. It looks like that copy isn't necessary, it's probably only needed if an exception can be thrown and that's not the case here. |
@eerhardt, given the feedback, what's the plan here? |
I think the plan is to close this PR and the related issue as 'won't fix'. And then log a coreclr bug for the Matrix4x4 performance issues illustrated above. @CarolEidt and @tannergooding - does that sound like a good plan? |
I think that's a great plan. |
This PR uses 'in' . Does that (in) actually pass by REference? https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/ref Only shows 'ref' and 'out' as methods to pass by reference. The benchmarks shown above only show a insiginifcant improvement in performance, if it was that slight, I wouldn't have put so much effort into making a case for making new methods that pass values by reference. My main system is not up for now but I'm sure there's benchmarks that show more notable improvement, especially in the case of matrix 4x4. |
@d3x0r - the
I've logged https://github.com/dotnet/coreclr/issues/15467 to improve the code gen for this scenario. |
Adding overloads to Vector and Matrix types for better memory and performance characteristics.
Fixes #157
Notes
The EOF newline changes came from merging these old commits into the master branch. I wanted to preserve the commit history, so I didn't try fixing up these newline changes.FIXED#if FEATURE_REF_OVERLOADS
in the tests. But I didn't see a great way to factor them without the#if
s. I felt keeping the tests contained together was a high priority, that way the "expected" values were only in a single place. I'm open to better ideas here.