-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions #24
Comments
Ok nevermind, looked at the very clean source code, it obviously don't support it. It would be nice to have, implementing it would not be easy though. Also I would add a IParallelProvider generic parameter For the machine learning PART, usually the input is computed in parallel, so there are x1,x2,x3,x4 inputs therefore there is a matrix operation a(W*(x1|x2|x3|x4) + b) = (y1|y2|y3|y4) allowing such operation with SIMD-smart-optimized matrix multiplication plus Parallel.For would make NetFabric.Numberics.Tensors very appetible for machine learning. Of course my example is with vector input, so W is 2D and x 1D but nobody prevents using as input 2D images (thus making W 3D tensor) and so on.. (even though 3D input or more dimensional input is very rare) Nonetheless I'm doing that right now, was searching for a opensource framework in C# doing that and this one is the most close to what I need. (Just needing the matrix multiplication, I'm not requiring other methods like Singular values decomposition and so on) |
Hi @Darelbi! I kicked off this library to streamline SIMD operations on spans. Along the way, I stumbled upon System.Numerics.Tensors, which shared some similarities but came with its own set of limitations. So, I've been refining my version to overcome these limitations and enhance both performance and functionality. I've laid a solid foundation and now I'm ready to ramp up improvements. Open to new ideas and contributions. Also intrigued by the idea of using Parallel.For. |
I experimented adding |
Hi @Darelbi, I've been playing around with this idea lately. Unfortunately, every attempt I've made runs into the snag that any Check out this prototype you can fiddle with: using System.Numerics;
const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 100)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source, destination, 0, source.Length);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = size / availableCores;
var actions = new Action[availableCores];
for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
{
var startIndex = coreIndex * chunkSize;
var endIndex = (coreIndex == availableCores - 1)
? size
: (coreIndex + 1) * chunkSize;
actions[coreIndex] = () => Apply<T, TResult, TOperator>(source, destination, startIndex, endIndex);
}
Parallel.Invoke(actions);
}
static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
where TOperator: IUnaryOperator<T, TResult>
{
for (var index = startIndex; index < endIndex; index++)
{
destination[index] = TOperator.Invoke(source[index]);
}
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
} Attempting to pin the span, as suggested here, doesn't work with generics. Also, giving a callback delegate a shot, as suggested here, lands us in the same heap issue (boxing). At this point, it seems like the only way forward is to switch to Here's a prototype using using System.Numerics;
const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 100)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source.Span, destination.Span, 0, source.Length);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = size / availableCores;
var actions = new Action[availableCores];
for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
{
var startIndex = coreIndex * chunkSize;
var endIndex = (coreIndex == availableCores - 1)
? size
: (coreIndex + 1) * chunkSize;
actions[coreIndex] = () => Apply<T, TResult, TOperator>(source.Span, destination.Span, startIndex, endIndex);
}
Parallel.Invoke(actions);
}
static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
where TOperator: IUnaryOperator<T, TResult>
{
for (var index = startIndex; index < endIndex; index++)
{
destination[index] = TOperator.Invoke(source[index]);
}
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
} Do you have any more suggestions? |
I made some enhancements. Now, each chunk has a minimum size to prevent spending more time managing threads than processing the data. The APIs now support both using System.Numerics;
const int size = 10_100;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Console.WriteLine("Array processing started.");
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
const int minChunkSize = 100;
public static void Apply<T, TOperator>(T[] source, T[] destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source.AsMemory(), destination.AsMemory());
public static void Apply<T, TResult, TOperator>(T[] source, TResult[] destination)
where TOperator: IUnaryOperator<T, TResult>
=> Apply<T, TResult, TOperator>(source.AsMemory(), destination.AsMemory());
public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 2 * minChunkSize)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source.Span, destination.Span);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = int.Max(size / availableCores, minChunkSize);
var actions = new Action[size / chunkSize];
for (var index = 0; index < actions.Length; index++)
{
var start = index * chunkSize;
var length = (index == actions.Length - 1)
? size - start
: chunkSize;
Console.WriteLine($"Core: {index} Start: {start} Length: {length}");
var sourceSlice = source.Slice(start, length);
var destinationSlice = destination.Slice(start, length);
actions[index] = () => Apply<T, TResult, TOperator>(sourceSlice.Span, destinationSlice.Span);
}
Console.WriteLine("Parallel processing started.");
Parallel.Invoke(actions);
}
public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
Console.WriteLine($"Processing chunk! Source: {source.Length} Destination: {destination.Length}");
// SIMD processing to be added here
for (var index = 0; index < source.Length && index < destination.Length; index++)
destination[index] = TOperator.Invoke(source[index]);
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
} |
I ran tests on branch #29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.
| Method | Job | Categories | Count | Mean | StdDev | Median | Ratio | |
Maybe run into bandwith limitation?
Il gio 25 apr 2024, 17:41 Antão Almada ***@***.***> ha
scritto:
… I ran tests on branch #29
<#29>, but
the results aren't too encouraging. The multicore performance is slower,
regardless of whether SIMD is employed or not. I need to explore the
hardware constraints to understand this better.
BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK 9.0.100-preview.1.24101.2
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT
Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX
Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
| Method | Job | Categories | Count | Mean | StdDev | Median | Ratio |
|----------------- |---------- |----------- |------
|-------------:|-----------:|-------------:|--------------:|-
| Baseline_Double | Scalar | Double | 1000 | 1,078.60 ns | 40.114 ns |
1,060.22 ns | baseline |
| System_Double | Scalar | Double | 1000 | 406.24 ns | 13.758 ns | 399.47
ns | 2.67x faster |
| NetFabric_Double | Scalar | Double | 1000 | 2,180.41 ns | 166.326 ns |
2,150.15 ns | 2.05x slower |
| Baseline_Double | Vector128 | Double | 1000 | 1,042.76 ns | 20.052 ns |
1,035.44 ns | 1.04x faster |
| System_Double | Vector128 | Double | 1000 | 205.04 ns | 4.889 ns |
203.97 ns | 5.28x faster |
| NetFabric_Double | Vector128 | Double | 1000 | 2,270.89 ns | 125.795 ns
| 2,307.96 ns | 2.07x slower |
| Baseline_Double | Vector256 | Double | 1000 | 1,444.38 ns | 116.838 ns |
1,456.48 ns | 1.31x slower |
| System_Double | Vector256 | Double | 1000 | 152.14 ns | 11.168 ns |
149.91 ns | 7.20x faster |
| NetFabric_Double | Vector256 | Double | 1000 | 2,201.52 ns | 80.728 ns |
2,205.31 ns | 2.04x slower |
| | | | | | | | |
| Baseline_Float | Scalar | Float | 1000 | 1,209.83 ns | 61.967 ns |
1,197.02 ns | baseline |
| System_Float | Scalar | Float | 1000 | 480.37 ns | 33.768 ns | 472.45 ns
| 2.54x faster |
| NetFabric_Float | Scalar | Float | 1000 | 2,359.35 ns | 93.242 ns |
2,387.54 ns | 1.96x slower |
| Baseline_Float | Vector128 | Float | 1000 | 770.33 ns | 53.606 ns |
750.29 ns | 1.57x faster |
| System_Float | Vector128 | Float | 1000 | 126.49 ns | 9.335 ns | 125.69
ns | 9.58x faster |
| NetFabric_Float | Vector128 | Float | 1000 | 2,152.34 ns | 89.694 ns |
2,153.25 ns | 1.79x slower |
| Baseline_Float | Vector256 | Float | 1000 | 762.05 ns | 79.493 ns |
753.45 ns | 1.56x faster |
| System_Float | Vector256 | Float | 1000 | 67.04 ns | 1.134 ns | 66.90 ns
| 18.42x faster |
| NetFabric_Float | Vector256 | Float | 1000 | 1,999.26 ns | 90.642 ns |
2,017.39 ns | 1.66x slower |
| | | | | | | | |
| Baseline_Half | Scalar | Half | 1000 | 12,504.44 ns | 286.312 ns |
12,399.19 ns | baseline |
| System_Half | Scalar | Half | 1000 | 12,231.32 ns | 120.729 ns |
12,238.40 ns | 1.02x faster |
| NetFabric_Half | Scalar | Half | 1000 | 9,433.74 ns | 867.650 ns |
9,546.42 ns | 1.35x faster |
| Baseline_Half | Vector128 | Half | 1000 | 9,697.71 ns | 240.589 ns |
9,676.38 ns | 1.29x faster |
| System_Half | Vector128 | Half | 1000 | 10,333.35 ns | 852.316 ns |
9,931.87 ns | 1.18x faster |
| NetFabric_Half | Vector128 | Half | 1000 | 8,915.24 ns | 799.399 ns |
8,905.60 ns | 1.51x faster |
| Baseline_Half | Vector256 | Half | 1000 | 10,267.79 ns | 924.079 ns |
9,858.21 ns | 1.26x faster |
| System_Half | Vector256 | Half | 1000 | 9,777.72 ns | 98.069 ns |
9,765.89 ns | 1.28x faster |
| NetFabric_Half | Vector256 | Half | 1000 | 9,393.03 ns | 475.270 ns |
9,403.79 ns | 1.36x faster |
| | | | | | | | |
| Baseline_Int | Scalar | Int | 1000 | 1,297.64 ns | 12.022 ns | 1,299.23
ns | baseline |
| System_Int | Scalar | Int | 1000 | 407.63 ns | 4.247 ns | 409.42 ns |
3.18x faster |
| NetFabric_Int | Scalar | Int | 1000 | 2,341.00 ns | 112.485 ns |
2,360.99 ns | 1.69x slower |
| Baseline_Int | Vector128 | Int | 1000 | 1,353.19 ns | 75.724 ns |
1,316.32 ns | 1.05x slower |
| System_Int | Vector128 | Int | 1000 | 115.52 ns | 6.332 ns | 114.52 ns |
11.38x faster |
| NetFabric_Int | Vector128 | Int | 1000 | 2,108.18 ns | 110.913 ns |
2,122.89 ns | 1.54x slower |
| Baseline_Int | Vector256 | Int | 1000 | 1,307.51 ns | 21.841 ns |
1,305.11 ns | 1.01x slower |
| System_Int | Vector256 | Int | 1000 | 64.33 ns | 1.039 ns | 64.19 ns |
20.18x faster |
| NetFabric_Int | Vector256 | Int | 1000 | 1,993.01 ns | 90.504 ns |
2,016.42 ns | 1.55x slower |
| | | | | | | | |
| Baseline_Long | Scalar | Long | 1000 | 1,045.51 ns | 18.504 ns |
1,044.03 ns | baseline |
| System_Long | Scalar | Long | 1000 | 406.87 ns | 7.117 ns | 405.92 ns |
2.57x faster |
| NetFabric_Long | Scalar | Long | 1000 | 2,256.12 ns | 163.947 ns |
2,250.57 ns | 2.18x slower |
| Baseline_Long | Vector128 | Long | 1000 | 1,071.94 ns | 48.088 ns |
1,050.91 ns | 1.04x slower |
| System_Long | Vector128 | Long | 1000 | 207.46 ns | 4.846 ns | 205.69 ns
| 5.03x faster |
| NetFabric_Long | Vector128 | Long | 1000 | 2,197.30 ns | 162.174 ns |
2,164.07 ns | 2.15x slower |
| Baseline_Long | Vector256 | Long | 1000 | 1,047.96 ns | 16.598 ns |
1,042.90 ns | 1.00x slower |
| System_Long | Vector256 | Long | 1000 | 123.71 ns | 0.750 ns | 123.83 ns
| 8.46x faster |
| NetFabric_Long | Vector256 | Long | 1000 | 2,191.66 ns | 103.227 ns |
2,201.34 ns | 2.03x slower |
| | | | | | | | |
| Baseline_Short | Scalar | Short | 1000 | 1,050.32 ns | 13.160 ns |
1,051.75 ns | baseline |
| System_Short | Scalar | Short | 1000 | 413.54 ns | 14.802 ns | 409.65 ns
| 2.52x faster |
| NetFabric_Short | Scalar | Short | 1000 | 2,185.30 ns | 169.597 ns |
2,129.97 ns | 2.16x slower |
| Baseline_Short | Vector128 | Short | 1000 | 1,042.56 ns | 10.547 ns |
1,041.07 ns | 1.01x faster |
| System_Short | Vector128 | Short | 1000 | 57.45 ns | 2.324 ns | 56.74 ns
| 18.53x faster |
| NetFabric_Short | Vector128 | Short | 1000 | 2,001.51 ns | 93.791 ns |
2,016.08 ns | 1.89x slower |
| Baseline_Short | Vector256 | Short | 1000 | 1,125.94 ns | 93.649 ns |
1,092.19 ns | 1.05x slower |
| System_Short | Vector256 | Short | 1000 | 39.64 ns | 3.571 ns | 38.01 ns
| 26.02x faster |
| NetFabric_Short | Vector256 | Short | 1000 | 1,980.78 ns | 87.917 ns |
2,002.02 ns | 1.85x slower |
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2R4E3CZQUJEGBZ33SFCRDY7EW3JAVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZDCNBWGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ |
Thanks for It. Very interesting. Btw what are the specs of your system?
Il sab 4 mag 2024, 00:58 Antão Almada ***@***.***> ha scritto:
… @Darelbi <https://github.com/Darelbi> I kept on researching and wrote an
article explaining the current steps:
https://aalmada.github.io/posts/Unleashing-parallelism/
Feedback and ideas are welcome!
—
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2R4E6M6CNQ4JXLMALBV6TZAQI73AVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3DIMBXGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I've been testing it on multiple systems:
The benchmarks on the article are for the AMD. |
Hi!, Thank you for this amazing library, however it is not clear by documentation if it supports matrix/tensor multiplication..
Does it employ also thread parallelism? (Parallel.For, in addition to SIMD instructions)
If it supports tensor/matrix multiplication it would be great for use in machine learning, just in case it supports it already (in example a forward pass in a neural network is just a(Wx+b) where W is matrices of weights, x input vector, b bias and a activation function) , how do I use the tensor/matrix multiplication? thanks
The text was updated successfully, but these errors were encountered: