-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add SoA raytracer as a CQ test for Intel hardware intrinsic #18839
Conversation
@fiigii, You can find open-source/vectorized versions of these functions here (MIT Licensed, Microsoft Owned): https://github.com/Microsoft/DirectXMath/blob/master/Inc/DirectXMathVector.inl They should be good starting points for speed/precision. |
@tannergooding Thank you so much! Will try to port it to C#. |
You need to add a coreclr/tests/src/Common/test_dependencies/test_dependencies.csproj Lines 25 to 27 in e0e2859
|
tests/src/JIT/Performance/CodeQuality/HWIntrinsic/X86/PacketTracer/PacketTracer.cs
Outdated
Show resolved
Hide resolved
The precision issue of vectorized Pow() is fixed, and this program no longer crashes after #18849 merged (Thank @tannergooding) |
@@ -103,6 +103,9 @@ | |||
<PackageReference Include="xunit.runner.utility"> | |||
<Version>$(XunitPackageVersion)</Version> | |||
</PackageReference> | |||
<PackageReference Include="System.Runtime.Intrinsics.Experimental"> | |||
<Version>$(MicrosoftPrivateCoreFxNETCoreAppPackageVersion)</Version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt I added this reference, but this still cannot be built. Do I need to add a new csproj
under tests\src\JIT\config
? Or do something like #16378? cc @AndyAyersMS
Benchmarks are still targeting netstandard1.4, and the new package is not compatible:
Seems like it ought to be simple to update the main benchmark dependence to netcoreapp2.1 or later, but this has proved difficult to fix. See #16126 for some attempts. You might be able to clone the benchmark config and target netcoreapp2.1 in the cloned version and then refer to for your new test. |
<Platform Condition=" '$(Platform)' == '' ">AnyCPU</Platform> | ||
<CLRTestKind>BuildOnly</CLRTestKind> | ||
<NugetTargetMoniker>.NETCoreApp,Version=v3.0</NugetTargetMoniker> | ||
<NugetTargetMonikerShort>netcoreapp3.0</NugetTargetMonikerShort> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndyAyersMS Thank you so much for the help. I created a "benchmark+intrinsic.csproj" and target it to netcoreapp3.0., so it can be built now.
All the issues have been addressed, I think this PR is ready to review. @CarolEidt @tannergooding @AndyAyersMS @mikedn @eerhardt PTAL Now, the SoA raytracer outputs exactly same picture as the original AoS raytracer, and I will provide detailed profiling data later. |
@dotnet-bot test Windows_NT x64 Checked jitincompletehwintrinsic please @dotnet-bot test Windows_NT x86 Checked jitincompletehwintrinsic please @dotnet-bot test Ubuntu x64 Checked jitincompletehwintrinsic please |
It would be nice to have an Having an None of this required now, of course, but as future "up for grabs" work items. |
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
// | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General nit: It would be useful to format all these documents according to the recommended conventions....
Things like one statement per line, braces on their own line, no stray newlines, no trailing whitespace, a newline after the top level using statements and before the first namespace, etc....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe, VS, for the most part, should do this if you run the "Format Document" command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fiigii, did this happen? It looks like we still have things like:
- Stray or missing newlines
- Unsorted usings
- Methods entirely on a single line
- etc
public Vector256<float> Distances; | ||
public Vector256<int> ThingIndeces; | ||
|
||
public static readonly Vector256<float> NullDistance = SetAllVector256<float>(float.MaxValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why float.MaxValue
, instead of something like NaN
or Infinity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to compare the NullDistance
against the result of Intersect
in method MinIntersect
.
internal struct Intersections | ||
{ | ||
public Vector256<float> Distances; | ||
public Vector256<int> ThingIndeces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
public static Int32RGBPacket256 ConvertToIntRGB(this VectorPacket256 colors) | ||
{ | ||
var one = SetAllVector256<float>(1.0f); | ||
var max = SetAllVector256<float>(255.0f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a static readonly
more efficient, since we don't have JIT support for this yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, even if we did have JIT support, since the constants are per method, a static readonly might still be better...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will cache them in static fields.
public static Camera Create(VectorPacket256 pos, VectorPacket256 lookAt) | ||
{ | ||
VectorPacket256 forward = (lookAt - pos).Normalize(); | ||
VectorPacket256 down = new VectorPacket256(SetAllVector256<float>(0), SetAllVector256<float>(-1), SetAllVector256<float>(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SetAllVector256<float>(0)
should be SetZeroVector256
?
VectorPacket256 forward = (lookAt - pos).Normalize(); | ||
VectorPacket256 down = new VectorPacket256(SetAllVector256<float>(0), SetAllVector256<float>(-1), SetAllVector256<float>(0)); | ||
VectorPacket256 right = SetAllVector256<float>(1.5f) * VectorPacket256.CrossProduct(forward, down).Normalize(); | ||
VectorPacket256 up = SetAllVector256<float>(1.5f) * VectorPacket256.CrossProduct(forward, right).Normalize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be more efficient to create a single instance of SetAllVector256(1.5f)
and reuse it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
{ | ||
var cmp = Compare(dis, NullDistance, FloatComparisonMode.EqualOrderedNonSignaling); | ||
var zero = SetZeroVector256<int>(); | ||
var mask = Avx2.CompareEqual(zero, zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we getting a mask of ones this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the most efficient way, some C++ compilers also generate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here would be useful then.
public ObjectPool(Func<T> generator, IProducerConsumerCollection<T> collection) | ||
: base(collection) | ||
{ | ||
if (generator == null) throw new ArgumentNullException("generator"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generator is null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameof(generator)
?
public T GetObject() | ||
{ | ||
T value; | ||
return base.TryTake(out value) ? value : _generator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out T value
?
{ | ||
var items = new List<T>(); | ||
T value; | ||
while (base.TryTake(out value)) items.Add(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out T value
?
|
||
protected override bool TryAdd(T item) | ||
{ | ||
PutObject(item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.... PutObject
itself calls base.TryAdd
, which returns a bool....
Seems we should have a TryPutObject
and shouldn't just always return true
|
||
protected override bool TryTake(out T item) | ||
{ | ||
item = GetObject(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is directly copied from AoS raytracer, dot not need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is an existing file, I agree it doesn't need to be changed in this PR.
It would still be nice to get it cleaned up separately, however.
|
||
public Packet256Tracer(int _width, int _hight) | ||
{ | ||
if (_width % VectorPacket256.Packet256Size != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parenthesis around mathematical expressions are helpful 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not everyone may agree, but that is my preference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
internal class Packet256Tracer | ||
{ | ||
public int Width { get; private set; } | ||
public int Hight { get; private set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Height
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
// See the LICENSE file in the project root for more information. | ||
// | ||
|
||
using System.Runtime.Intrinsics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These usings are all unused.
|
||
internal abstract class ObjectPacket256 | ||
{ | ||
public Surface Surface { get; private set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need private set
{ | ||
if ((_width % VectorPacket256.Packet256Size) != 0) | ||
{ | ||
_width += VectorPacket256.Packet256Size - _width % VectorPacket256.Packet256Size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: My preference is to add parenthesis where they help improve readability or where they help clarify operator precedence.
|
||
private static readonly Vector256<float> SevenToZero = SetVector256(7f, 6f, 5f, 4f, 3f, 2f, 1f, 0f); | ||
|
||
public Packet256Tracer(int _width, int _height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters names should not start with _
Vector256<float> Xs = Add(SetAllVector256(fx), SevenToZero); | ||
var dirs = GetPoints(Xs, SetAllVector256<float>(y), camera); | ||
var rayPacket256 = new RayPacket256(camera.Pos, dirs); | ||
var SoAcolors = TraceRay(rayPacket256, scene, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make 0
a named parameter to help improve readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0
seems already very clear as the first value of depth
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was indicating that, if you were to just read TraceRay(rayPacket256, scene, 0);
, it isn't obvious that 0 is for depth (you have to go look at the TraceRay
signature).
While TraceRay(rayPacket256, scene, depth: 0);
makes this immediately obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, misunderstood your words...
|
||
// `FastTranspose` returns an "incomplete" AoS structure, | ||
// which can be written into memory 16-byte by 16-byte. | ||
// Now, .NET Core does not guarantee the 32-byte alignment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think .NET Core guaranteed 16-byte alignment either (just that the first local was 16-byte aligned, or something similar)...
Avx2.ExtractVector128(output + 20, intAoS.Bs, 1); | ||
} | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you commented out some unused code?
float heightRate1 = Height / 2.0f; | ||
float heightRate2 = Height * 2.0f; | ||
|
||
var recenteredX = Divide(Subtract(x, SetAllVector256(widthRate1)), SetAllVector256(widthRate2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this more or less perforamnt than a SetAllVector256(Width)
and SetAllVector256(2)
along with a Multiple
and Divide
call?
@tannergooding Thank you so much for the review, I have addressed your feedback. |
logged the CRT issue at https://github.com/dotnet/coreclr/issues/19203 |
Removed the package dependency on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
I would still really like to see some of the document formatting fixed/improved (even if it is an up-for-grabs bug). There are a number of stray or missing newlines, multiple statements on a single line, unused or unsorted usings, code simplifications (sometimes from new language features, sometimes where the code was unnecessarily verbose), etc.
@dotnet-bot test OSX10.12 x64 Checked CoreFX Tests |
@dotnet-bot test OSX10.12 x64 Checked CoreFX Tests |
@CarolEidt, @AndyAyersMS, @eerhardt, @fiigii. Unless someone has feedback saying otherwise, I plan on merging this after the tests show as all green. |
I am fine with merging this. I had hoped to take time to review, but just haven't found the time, and would indeed like to see this added. I second Tanner's general comments about improving the documentation and formatting, and am fine with making that a future work item. One specific comment I had was that I hadn't (yet) found any attribution about where this came from. The existing RayTracer was derived from a sample program that was available on MSDN, and if this was also derived from that it would be good to note it somewhere. |
@CarolEidt Thanks for the comment. I will continue to improve the code/doc of this benchmark in future PRs. Especially, I have made a prototype of struct promotion to address the GC overhead from VectorPacket, so this benchmark may be changed to
This benchmark is not derived from the current RayTracer. I am just using its input data (models) and some top-level framework code with a totally new algorithm, so that these two benchmarks can be directly compared. I will add some comments to note the reused code in the next PR. Thanks! |
Merging. The CoreFX tests aren't impacted by this change and are failing with unrelated issues. |
This PR ports the current SIMD benchmark RayTracer ( using
Vector3
) to a SoA algorithm using AVX/AVX2 intrinsics. The new benchmark keeps the same shading approach of the original raytracer as much as possible, so they can generate the same images and be compared directly.Performance data (rendering a 2k image)
Updated the performance data with #19663
The data collected on
VTune characterization (module level)
Windows
Linux
According to the execution time and the module-level VTune data, we can see that
ucrtbase.dll
takes ~50% execution time.libm-2.23.so
has more reasonable performance (19.10%).Math.Pow
at https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/SIMD/RayTracer/Raytracer.cs#L153, which calls into CRT via P/Invoke.VectorMath.Pow
in managed code.The most obvious module-level difference between the baseline and SoA is that
coreclr.dll
/libcoreclr.so
/libc-2.23.so
)VTune characterization (managed code)
Windows
Linux
The codegen issues of RyuJIT have been logged at
VTune characterization (CoreCLR runtime)
Windows
Linux
Close https://github.com/dotnet/coreclr/issues/17798