Questions #24

Darelbi · 2024-03-29T20:08:36Z

Hi!, Thank you for this amazing library, however it is not clear by documentation if it supports matrix/tensor multiplication..

Does it employ also thread parallelism? (Parallel.For, in addition to SIMD instructions)

If it supports tensor/matrix multiplication it would be great for use in machine learning, just in case it supports it already (in example a forward pass in a neural network is just a(Wx+b) where W is matrices of weights, x input vector, b bias and a activation function) , how do I use the tensor/matrix multiplication? thanks

Darelbi · 2024-03-29T20:23:13Z

Ok nevermind, looked at the very clean source code, it obviously don't support it.

It would be nice to have, implementing it would not be easy though. Also I would add a IParallelProvider generic parameter
allowing people to switch a Parallel.For when needed.

For the machine learning PART, usually the input is computed in parallel, so there are x1,x2,x3,x4 inputs

therefore there is a matrix operation a(W*(x1|x2|x3|x4) + b) = (y1|y2|y3|y4) allowing such operation with SIMD-smart-optimized matrix multiplication plus Parallel.For would make NetFabric.Numberics.Tensors very appetible for machine learning. Of course my example is with vector input, so W is 2D and x 1D but nobody prevents using as input 2D images (thus making W 3D tensor) and so on.. (even though 3D input or more dimensional input is very rare)

Nonetheless I'm doing that right now, was searching for a opensource framework in C# doing that and this one is the most close to what I need. (Just needing the matrix multiplication, I'm not requiring other methods like Singular values decomposition and so on)

aalmada · 2024-04-16T18:00:53Z

Hi @Darelbi! I kicked off this library to streamline SIMD operations on spans. Along the way, I stumbled upon System.Numerics.Tensors, which shared some similarities but came with its own set of limitations. So, I've been refining my version to overcome these limitations and enhance both performance and functionality. I've laid a solid foundation and now I'm ready to ramp up improvements. Open to new ideas and contributions. Also intrigued by the idea of using Parallel.For.

aalmada · 2024-04-18T09:20:12Z

I experimented adding Parallel.For but run into an issue. Spans are ref struct so they can't be used in lambdas. There's one unsafe workaround: https://stackoverflow.com/a/66747462/861773

aalmada · 2024-04-24T15:02:49Z

Hi @Darelbi,

I've been playing around with this idea lately. Unfortunately, every attempt I've made runs into the snag that any ref struct, like Span<T>, can't reside in the heap. That means they're a no-go for lambda expressions which are required for all parallelization solutions.

Check out this prototype you can fiddle with:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source, destination, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source, destination, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Attempting to pin the span, as suggested here, doesn't work with generics.

Also, giving a callback delegate a shot, as suggested here, lands us in the same heap issue (boxing).

At this point, it seems like the only way forward is to switch to Memory<T>, which means reworking the entire public API, but it may be worth it.

Here's a prototype using Memory<T>:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source.Span, destination.Span, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Do you have any more suggestions?

aalmada · 2024-04-24T21:35:26Z

I made some enhancements. Now, each chunk has a minimum size to prevent spending more time managing threads than processing the data. The APIs now support both Memory<T> and Span<T>, but CPU parallelization is only available for Memory<T> APIs. To prevent ambiguity, overloads for arrays are also required. This implies that all operations must include these overloads, which is quite a bit of work...

using System.Numerics;

const int size = 10_100;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Console.WriteLine("Array processing started.");
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");

static class Tensor
{   
    const int minChunkSize = 100;

    public static void Apply<T, TOperator>(T[] source, T[] destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TResult, TOperator>(T[] source, TResult[] destination)
        where TOperator: IUnaryOperator<T, TResult>
        => Apply<T, TResult, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 2 * minChunkSize)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = int.Max(size / availableCores, minChunkSize);

        var actions = new Action[size / chunkSize];
        for (var index = 0; index < actions.Length; index++)
        {
            var start = index * chunkSize;
            var length = (index == actions.Length - 1) 
                ? size - start
                : chunkSize;

            Console.WriteLine($"Core: {index} Start: {start} Length: {length}");

            var sourceSlice = source.Slice(start, length);
            var destinationSlice = destination.Slice(start, length);
            actions[index] = () => Apply<T, TResult, TOperator>(sourceSlice.Span, destinationSlice.Span);
        }
        Console.WriteLine("Parallel processing started.");
        Parallel.Invoke(actions);
    }

    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        Console.WriteLine($"Processing chunk! Source: {source.Length} Destination: {destination.Length}");
        // SIMD processing to be added here
        for (var index = 0; index < source.Length && index < destination.Length; index++)
            destination[index] = TOperator.Invoke(source[index]);
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

aalmada · 2024-04-25T16:41:02Z

I ran tests on branch #29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.


BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK 9.0.100-preview.1.24101.2
  [Host]    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  Scalar    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT
  Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX
  Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

| Method | Job |----------------- |---------- | Baseline_Double | Scalar | Double | System_Double | Scalar | Double | NetFabric_Double | Scalar | Double | Baseline_Double | Vector128 | Double | System_Double | Vector128 | Double | NetFabric_Double | Vector128 | Double | Baseline_Double | Vector256 | Double | System_Double | Vector256 | Double | NetFabric_Double | Vector256 | Double | | | | Baseline_Float | Scalar | Float | System_Float | Scalar | Float | NetFabric_Float | Scalar | Float | Baseline_Float | Vector128 | Float | System_Float | Vector128 | Float | NetFabric_Float | Vector128 | Float | Baseline_Float | Vector256 | Float | System_Float | Vector256 | Float | NetFabric_Float | Vector256 | Float | | | | Baseline_Half | Scalar | Half | System_Half | Scalar | Half | NetFabric_Half | Scalar | Half | Baseline_Half | Vector128 | Half | System_Half | Vector128 | Half | NetFabric_Half | Vector128 | Half | Baseline_Half | Vector256 | Half | System_Half | Vector256 | Half | NetFabric_Half | Vector256 | Half | | | | Baseline_Int | Scalar | Int | System_Int | Scalar | Int | NetFabric_Int | Scalar | Int | Baseline_Int | Vector128 | Int | System_Int | Vector128 | Int | NetFabric_Int | Vector128 | Int | Baseline_Int | Vector256 | Int | System_Int | Vector256 | Int | NetFabric_Int | Vector256 | Int | | | | Baseline_Long | Scalar | Long | System_Long | Scalar | Long | NetFabric_Long | Scalar | Long | Baseline_Long | Vector128 | Long | System_Long | Vector128 | Long | NetFabric_Long | Vector128 | Long | Baseline_Long | Vector256 | Long | System_Long | Vector256 | Long | NetFabric_Long | Vector256 | Long | | | | Baseline_Short | Scalar | Short | System_Short | Scalar | Short | NetFabric_Short | Scalar | Short | Baseline_Short | Vector128 | Short | System_Short | Vector128 | Short | NetFabric_Short | Vector128 | Short | Baseline_Short | Vector256 | Short | System_Short | Vector256 | Short | NetFabric_Short | Vector256 | Short | Categories | Count | Mean | StdDev | Median | Ratio |
|----------- |------ |-------------:|-----------:|-------------:|--------------:|-
| 1000 | 1,078.60 ns | 40.114 ns | 1,060.22 ns | baseline |
| 1000 | 406.24 ns | 13.758 ns | 399.47 ns | 2.67x faster |
| 1000 | 2,180.41 ns | 166.326 ns | 2,150.15 ns | 2.05x slower |
| 1000 | 1,042.76 ns | 20.052 ns | 1,035.44 ns | 1.04x faster |
| 1000 | 205.04 ns | 4.889 ns | 203.97 ns | 5.28x faster |
| 1000 | 2,270.89 ns | 125.795 ns | 2,307.96 ns | 2.07x slower |
| 1000 | 1,444.38 ns | 116.838 ns | 1,456.48 ns | 1.31x slower |
| 1000 | 152.14 ns | 11.168 ns | 149.91 ns | 7.20x faster |
| 1000 | 2,201.52 ns | 80.728 ns | 2,205.31 ns | 2.04x slower |
| | | | | |
| 1000 | 1,209.83 ns | 61.967 ns | 1,197.02 ns | baseline |
| 1000 | 480.37 ns | 33.768 ns | 472.45 ns | 2.54x faster |
| 1000 | 2,359.35 ns | 93.242 ns | 2,387.54 ns | 1.96x slower |
| 1000 | 770.33 ns | 53.606 ns | 750.29 ns | 1.57x faster |
| 1000 | 126.49 ns | 9.335 ns | 125.69 ns | 9.58x faster |
| 1000 | 2,152.34 ns | 89.694 ns | 2,153.25 ns | 1.79x slower |
| 1000 | 762.05 ns | 79.493 ns | 753.45 ns | 1.56x faster |
| 1000 | 67.04 ns | 1.134 ns | 66.90 ns | 18.42x faster |
| 1000 | 1,999.26 ns | 90.642 ns | 2,017.39 ns | 1.66x slower |
| | | | | |
| 1000 | 12,504.44 ns | 286.312 ns | 12,399.19 ns | baseline |
| 1000 | 12,231.32 ns | 120.729 ns | 12,238.40 ns | 1.02x faster |
| 1000 | 9,433.74 ns | 867.650 ns | 9,546.42 ns | 1.35x faster |
| 1000 | 9,697.71 ns | 240.589 ns | 9,676.38 ns | 1.29x faster |
| 1000 | 10,333.35 ns | 852.316 ns | 9,931.87 ns | 1.18x faster |
| 1000 | 8,915.24 ns | 799.399 ns | 8,905.60 ns | 1.51x faster |
| 1000 | 10,267.79 ns | 924.079 ns | 9,858.21 ns | 1.26x faster |
| 1000 | 9,777.72 ns | 98.069 ns | 9,765.89 ns | 1.28x faster |
| 1000 | 9,393.03 ns | 475.270 ns | 9,403.79 ns | 1.36x faster |
| | | | | |
| 1000 | 1,297.64 ns | 12.022 ns | 1,299.23 ns | baseline |
| 1000 | 407.63 ns | 4.247 ns | 409.42 ns | 3.18x faster |
| 1000 | 2,341.00 ns | 112.485 ns | 2,360.99 ns | 1.69x slower |
| 1000 | 1,353.19 ns | 75.724 ns | 1,316.32 ns | 1.05x slower |
| 1000 | 115.52 ns | 6.332 ns | 114.52 ns | 11.38x faster |
| 1000 | 2,108.18 ns | 110.913 ns | 2,122.89 ns | 1.54x slower |
| 1000 | 1,307.51 ns | 21.841 ns | 1,305.11 ns | 1.01x slower |
| 1000 | 64.33 ns | 1.039 ns | 64.19 ns | 20.18x faster |
| 1000 | 1,993.01 ns | 90.504 ns | 2,016.42 ns | 1.55x slower |
| | | | | |
| 1000 | 1,045.51 ns | 18.504 ns | 1,044.03 ns | baseline |
| 1000 | 406.87 ns | 7.117 ns | 405.92 ns | 2.57x faster |
| 1000 | 2,256.12 ns | 163.947 ns | 2,250.57 ns | 2.18x slower |
| 1000 | 1,071.94 ns | 48.088 ns | 1,050.91 ns | 1.04x slower |
| 1000 | 207.46 ns | 4.846 ns | 205.69 ns | 5.03x faster |
| 1000 | 2,197.30 ns | 162.174 ns | 2,164.07 ns | 2.15x slower |
| 1000 | 1,047.96 ns | 16.598 ns | 1,042.90 ns | 1.00x slower |
| 1000 | 123.71 ns | 0.750 ns | 123.83 ns | 8.46x faster |
| 1000 | 2,191.66 ns | 103.227 ns | 2,201.34 ns | 2.03x slower |
| | | | | |
| 1000 | 1,050.32 ns | 13.160 ns | 1,051.75 ns | baseline |
| 1000 | 413.54 ns | 14.802 ns | 409.65 ns | 2.52x faster |
| 1000 | 2,185.30 ns | 169.597 ns | 2,129.97 ns | 2.16x slower |
| 1000 | 1,042.56 ns | 10.547 ns | 1,041.07 ns | 1.01x faster |
| 1000 | 57.45 ns | 2.324 ns | 56.74 ns | 18.53x faster |
| 1000 | 2,001.51 ns | 93.791 ns | 2,016.08 ns | 1.89x slower |
| 1000 | 1,125.94 ns | 93.649 ns | 1,092.19 ns | 1.05x slower |
| 1000 | 39.64 ns | 3.571 ns | 38.01 ns | 26.02x faster |
| 1000 | 1,980.78 ns | 87.917 ns | 2,002.02 ns | 1.85x slower |

Darelbi · 2024-04-25T17:40:01Z

Maybe run into bandwith limitation? Il gio 25 apr 2024, 17:41 Antão Almada ***@***.***> ha scritto:

…

I ran tests on branch #29 <#29>, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better. BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update) Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores .NET SDK 9.0.100-preview.1.24101.2 [Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2 Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2 | Method | Job | Categories | Count | Mean | StdDev | Median | Ratio | |----------------- |---------- |----------- |------ |-------------:|-----------:|-------------:|--------------:|- | Baseline_Double | Scalar | Double | 1000 | 1,078.60 ns | 40.114 ns | 1,060.22 ns | baseline | | System_Double | Scalar | Double | 1000 | 406.24 ns | 13.758 ns | 399.47 ns | 2.67x faster | | NetFabric_Double | Scalar | Double | 1000 | 2,180.41 ns | 166.326 ns | 2,150.15 ns | 2.05x slower | | Baseline_Double | Vector128 | Double | 1000 | 1,042.76 ns | 20.052 ns | 1,035.44 ns | 1.04x faster | | System_Double | Vector128 | Double | 1000 | 205.04 ns | 4.889 ns | 203.97 ns | 5.28x faster | | NetFabric_Double | Vector128 | Double | 1000 | 2,270.89 ns | 125.795 ns | 2,307.96 ns | 2.07x slower | | Baseline_Double | Vector256 | Double | 1000 | 1,444.38 ns | 116.838 ns | 1,456.48 ns | 1.31x slower | | System_Double | Vector256 | Double | 1000 | 152.14 ns | 11.168 ns | 149.91 ns | 7.20x faster | | NetFabric_Double | Vector256 | Double | 1000 | 2,201.52 ns | 80.728 ns | 2,205.31 ns | 2.04x slower | | | | | | | | | | | Baseline_Float | Scalar | Float | 1000 | 1,209.83 ns | 61.967 ns | 1,197.02 ns | baseline | | System_Float | Scalar | Float | 1000 | 480.37 ns | 33.768 ns | 472.45 ns | 2.54x faster | | NetFabric_Float | Scalar | Float | 1000 | 2,359.35 ns | 93.242 ns | 2,387.54 ns | 1.96x slower | | Baseline_Float | Vector128 | Float | 1000 | 770.33 ns | 53.606 ns | 750.29 ns | 1.57x faster | | System_Float | Vector128 | Float | 1000 | 126.49 ns | 9.335 ns | 125.69 ns | 9.58x faster | | NetFabric_Float | Vector128 | Float | 1000 | 2,152.34 ns | 89.694 ns | 2,153.25 ns | 1.79x slower | | Baseline_Float | Vector256 | Float | 1000 | 762.05 ns | 79.493 ns | 753.45 ns | 1.56x faster | | System_Float | Vector256 | Float | 1000 | 67.04 ns | 1.134 ns | 66.90 ns | 18.42x faster | | NetFabric_Float | Vector256 | Float | 1000 | 1,999.26 ns | 90.642 ns | 2,017.39 ns | 1.66x slower | | | | | | | | | | | Baseline_Half | Scalar | Half | 1000 | 12,504.44 ns | 286.312 ns | 12,399.19 ns | baseline | | System_Half | Scalar | Half | 1000 | 12,231.32 ns | 120.729 ns | 12,238.40 ns | 1.02x faster | | NetFabric_Half | Scalar | Half | 1000 | 9,433.74 ns | 867.650 ns | 9,546.42 ns | 1.35x faster | | Baseline_Half | Vector128 | Half | 1000 | 9,697.71 ns | 240.589 ns | 9,676.38 ns | 1.29x faster | | System_Half | Vector128 | Half | 1000 | 10,333.35 ns | 852.316 ns | 9,931.87 ns | 1.18x faster | | NetFabric_Half | Vector128 | Half | 1000 | 8,915.24 ns | 799.399 ns | 8,905.60 ns | 1.51x faster | | Baseline_Half | Vector256 | Half | 1000 | 10,267.79 ns | 924.079 ns | 9,858.21 ns | 1.26x faster | | System_Half | Vector256 | Half | 1000 | 9,777.72 ns | 98.069 ns | 9,765.89 ns | 1.28x faster | | NetFabric_Half | Vector256 | Half | 1000 | 9,393.03 ns | 475.270 ns | 9,403.79 ns | 1.36x faster | | | | | | | | | | | Baseline_Int | Scalar | Int | 1000 | 1,297.64 ns | 12.022 ns | 1,299.23 ns | baseline | | System_Int | Scalar | Int | 1000 | 407.63 ns | 4.247 ns | 409.42 ns | 3.18x faster | | NetFabric_Int | Scalar | Int | 1000 | 2,341.00 ns | 112.485 ns | 2,360.99 ns | 1.69x slower | | Baseline_Int | Vector128 | Int | 1000 | 1,353.19 ns | 75.724 ns | 1,316.32 ns | 1.05x slower | | System_Int | Vector128 | Int | 1000 | 115.52 ns | 6.332 ns | 114.52 ns | 11.38x faster | | NetFabric_Int | Vector128 | Int | 1000 | 2,108.18 ns | 110.913 ns | 2,122.89 ns | 1.54x slower | | Baseline_Int | Vector256 | Int | 1000 | 1,307.51 ns | 21.841 ns | 1,305.11 ns | 1.01x slower | | System_Int | Vector256 | Int | 1000 | 64.33 ns | 1.039 ns | 64.19 ns | 20.18x faster | | NetFabric_Int | Vector256 | Int | 1000 | 1,993.01 ns | 90.504 ns | 2,016.42 ns | 1.55x slower | | | | | | | | | | | Baseline_Long | Scalar | Long | 1000 | 1,045.51 ns | 18.504 ns | 1,044.03 ns | baseline | | System_Long | Scalar | Long | 1000 | 406.87 ns | 7.117 ns | 405.92 ns | 2.57x faster | | NetFabric_Long | Scalar | Long | 1000 | 2,256.12 ns | 163.947 ns | 2,250.57 ns | 2.18x slower | | Baseline_Long | Vector128 | Long | 1000 | 1,071.94 ns | 48.088 ns | 1,050.91 ns | 1.04x slower | | System_Long | Vector128 | Long | 1000 | 207.46 ns | 4.846 ns | 205.69 ns | 5.03x faster | | NetFabric_Long | Vector128 | Long | 1000 | 2,197.30 ns | 162.174 ns | 2,164.07 ns | 2.15x slower | | Baseline_Long | Vector256 | Long | 1000 | 1,047.96 ns | 16.598 ns | 1,042.90 ns | 1.00x slower | | System_Long | Vector256 | Long | 1000 | 123.71 ns | 0.750 ns | 123.83 ns | 8.46x faster | | NetFabric_Long | Vector256 | Long | 1000 | 2,191.66 ns | 103.227 ns | 2,201.34 ns | 2.03x slower | | | | | | | | | | | Baseline_Short | Scalar | Short | 1000 | 1,050.32 ns | 13.160 ns | 1,051.75 ns | baseline | | System_Short | Scalar | Short | 1000 | 413.54 ns | 14.802 ns | 409.65 ns | 2.52x faster | | NetFabric_Short | Scalar | Short | 1000 | 2,185.30 ns | 169.597 ns | 2,129.97 ns | 2.16x slower | | Baseline_Short | Vector128 | Short | 1000 | 1,042.56 ns | 10.547 ns | 1,041.07 ns | 1.01x faster | | System_Short | Vector128 | Short | 1000 | 57.45 ns | 2.324 ns | 56.74 ns | 18.53x faster | | NetFabric_Short | Vector128 | Short | 1000 | 2,001.51 ns | 93.791 ns | 2,016.08 ns | 1.89x slower | | Baseline_Short | Vector256 | Short | 1000 | 1,125.94 ns | 93.649 ns | 1,092.19 ns | 1.05x slower | | System_Short | Vector256 | Short | 1000 | 39.64 ns | 3.571 ns | 38.01 ns | 26.02x faster | | NetFabric_Short | Vector256 | Short | 1000 | 1,980.78 ns | 87.917 ns | 2,002.02 ns | 1.85x slower | — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2R4E3CZQUJEGBZ33SFCRDY7EW3JAVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZDCNBWGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aalmada · 2024-05-03T22:57:43Z

@Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/
Feedback and ideas are welcome!

Darelbi · 2024-05-04T06:04:47Z

Thanks for It. Very interesting. Btw what are the specs of your system? Il sab 4 mag 2024, 00:58 Antão Almada ***@***.***> ha scritto:

…

@Darelbi <https://github.com/Darelbi> I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome! — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2R4E6M6CNQ4JXLMALBV6TZAQI73AVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3DIMBXGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aalmada · 2024-05-04T12:32:04Z

I've been testing it on multiple systems:

Apple M1 (arm64, 8 logical cores, AdvSIMD)
Intel i7-7567U (x64, 4 logical cores, AVX2)
AMD Ryzen 9 7940HS (x64, 16 logical cores, AVX512)

The benchmarks on the article are for the AMD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions #24

Questions #24

Darelbi commented Mar 29, 2024 •

edited

Loading

Darelbi commented Mar 29, 2024 •

edited

Loading

aalmada commented Apr 16, 2024

aalmada commented Apr 18, 2024

aalmada commented Apr 24, 2024

aalmada commented Apr 24, 2024 •

edited

Loading

aalmada commented Apr 25, 2024

Darelbi commented Apr 25, 2024 via email

aalmada commented May 3, 2024

Darelbi commented May 4, 2024 via email

aalmada commented May 4, 2024 •

edited

Loading

Questions #24

Questions #24

Comments

Darelbi commented Mar 29, 2024 • edited Loading

Darelbi commented Mar 29, 2024 • edited Loading

aalmada commented Apr 16, 2024

aalmada commented Apr 18, 2024

aalmada commented Apr 24, 2024

aalmada commented Apr 24, 2024 • edited Loading

aalmada commented Apr 25, 2024

Darelbi commented Apr 25, 2024 via email

aalmada commented May 3, 2024

Darelbi commented May 4, 2024 via email

aalmada commented May 4, 2024 • edited Loading

Darelbi commented Mar 29, 2024 •

edited

Loading

Darelbi commented Mar 29, 2024 •

edited

Loading

aalmada commented Apr 24, 2024 •

edited

Loading

aalmada commented May 4, 2024 •

edited

Loading