Investigate the performance issues and consider moving to GemmKernels.jl #2

GiggleLiu · 2023-07-05T13:19:29Z

          Sorry for the previous chaos, I thought these parts will not be publish as part of the package.

The following changes have been made:

The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
I relocated all the files into folder src, test and benchmark.
Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.

The new benchmark result is show here:

Originally posted by @ArrogantGao in #1 (comment)

The text was updated successfully, but these errors were encountered:

ArrogantGao · 2023-07-31T15:04:54Z

I further profiled the Tropical operations by GemmKernels and our CuTropicalGemm code via the Nvidia Nsight Compute tool. The esults demonstrate the current code's bottlenecks. During testing, I took m = 10240, n = 8192, k = 8192.

The main result is shown below:

where the image above shows the Pipe Utilization of CuTropicalGemm and the one below shows that of the GemmKernels. It seems that the ALU units (used to execute the FMNMX instruction) and the FMA (used to execute the FADD instruction) units can perform calculations in parallel, so that although we can not use the fused operations, the code still acheive a performance greater than fifty percent.

It can also be observed that for both cases, the utilization of the computational units only reach about 80% and 60%, respectively, which are far less than 96.3% of SGEMM with pure FMMA operations.
Furthermore, the analysis results for the registers, as shown in the graph, indicate that the performance bottleneck is not caused by the bandwidth of the registers, but rather by insufficient parallelism of the ALU and FMA computational units.

To further improve the performance, we may need to analysis the behavior of ALU and FMA.

ps: I also benchmarked the max-mul operation, the result is almost the same with that of max-add.

GiggleLiu · 2023-08-02T07:10:19Z

Thanks! If possible, could you please also print the generated PTX code and show the nsight-sys screen shot here? These information could help analysing the performance difference between C implementation and GemmKernels implementation.

For GemmKernel implementation, just go throught the steps here:
https://cuda.juliagpu.org/stable/development/profiling/

ArrogantGao · 2023-08-02T16:19:24Z

Sure! Since both the ptx code and the benchmark result are both quite long, I will upload the file directly here

Here are the benchmark result by Nvidia Nsight Compute of GemmKernels, CuTropicalGemm.MaxMulFP32! and CuTropicalGemm.MaxAddFP32!.
GemmKernels.pdf
CuTropical_MAXMUL.pdf
CuTropical_MAXADD.pdf

Figures of the ncu benchmark results:
Result of GemmKernels.jl

Result of CuTropicalGemm.MaxAddFP32!

Result of CuTropicalGemm.MaxMulFP32!

Result of sgemm.cu

Here are the PTX code of sgemm.cu, TropicalGemm.cu and GemmKernels.jl (the first two are directly generated from .cu code because CUDA.@device_code_ptx do not work with our Julia interface).
sgemm.txt
TropicalGemm.txt
GemmKernels.txt

maleadt · 2023-12-07T16:15:53Z

Can you share your benchmarking code? I did a test myself, and with two performance fixes to GemmKernels.jl (JuliaGPU/GemmKernels.jl#182, and a tuned configuration) I'm getting very similar performance. For example, on a RTX6000 Ada using 4096x4096 Float32 inputs:

CuTropicalGEMM:   11.866 ms (2 allocations: 48 bytes)
GemmKernels:   11.356 ms (47 allocations: 3.00 KiB)

GemmKernels.jl seems consistently a little faster than CuTropicalGEMM; even re-using this block/operator configuration for different input sizes (i.e. where additional tuning might result in even better performance).

I'm benchmarking using the following code:

using CUDA, GemmKernels, LinearAlgebra
using TropicalNumbers, CuTropicalGEMM
using BenchmarkTools

function main()
    M = K = N = 1024

    A = CUDA.rand(Float32, M, K)
    B = CUDA.rand(Float32, K, N)
    C = CUDA.zeros(Float32, M, N)

    print("CuTropicalGEMM: ")
    let
        tA = Tropical.(A)
        tB = Tropical.(B)
        tC = Tropical.(C)
        @btime begin
            mul!($tC, $tA, $tB)
            # XXX: not sure why `CUDA.@sync` doesn't work here;
            #      is CuTropicalGEMM doing its own stream management?
            device_synchronize()
        end
    end

    print("GemmKernels: ")
    let
        # result of tuning
        BLOCK_M = 128
        BLOCK_N = 64
        BLOCK_K = 32
        OP_M = 16
        OP_N = 4
        OP_K = 4
        OP_MB = 8
        OP_NB = 4
        OP_KB = 1
        kernel = Kernel.matmul_pipelined

        # pow2-sized, 128-bit aligned inputs, so we can use aligned layouts.
        # we don't have transposed inputs, so everything is column major.
        @assert stride(A, 2) % 16 == 0
        global_a_layout = Layout.UnsafeAlignedColMajor{eltype(A)}
        @assert stride(B, 2) % 16 == 0
        global_b_layout = Layout.UnsafeAlignedColMajor{eltype(B)}
        # we want to do a simple C = A * B, so no need to load C first.
        global_c_layout = Layout.Zero{eltype(C)}
        @assert stride(C, 2) % 16 == 0
        global_d_layout = Layout.UnsafeAlignedColMajor{eltype(C)}

        # shared layouts are similar.
        # the frequently-accessed a/b shmems are padded to avoid bank conflicts.
        shared_a_layout = Layout.Padded{Layout.UnsafeAlignedColMajor{eltype(A)}, 8}
        shared_b_layout = Layout.Padded{Layout.UnsafeAlignedColMajor{eltype(B)}, 8}
        shared_c_layout = shared_d_layout = Layout.UnsafeAlignedColMajor{eltype(C)}

        # we use the tropical FPU operator
        compute_type = promote_type(eltype(A), eltype(B))
        operator = Operator.TropicalFPUOp{OP_M, OP_N, OP_K, OP_MB, OP_NB, OP_KB,
                                          compute_type, eltype(C)}

        # the block shape is the result of tuning
        block_shape = (M = BLOCK_M, N = BLOCK_N, K = BLOCK_K)
        @assert M % block_shape.M == 0
        @assert N % block_shape.N == 0
        @assert K % block_shape.K == 0

        conf = GemmKernels.get_config(;
            gemm_shape = (M = M, N = N, K = K),
            block_shape,
            operator,

            global_a_layout, global_b_layout, global_c_layout, global_d_layout,
            shared_a_layout, shared_b_layout, shared_c_layout, shared_d_layout,

            is_a_col_major = true,
            is_b_col_major = true
        )

        @btime CUDA.@sync GemmKernels.matmul($conf, $A, $B, $C, $C; kernel=$kernel)
    end

    CUDA.unsafe_free!(A)
    CUDA.unsafe_free!(B)
    CUDA.unsafe_free!(C)
end

isinteractive() || display()

Now, GemmKernels.jl likely needs some improvements to be better across the board (e.g. more generalization to handle arbitrary input sizes, a better API, etc), but it nonetheless seems like a good starting point with all advantages that native Julia implementations have (arbitrary type support, ease of development, etc).

ArrogantGao · 2023-12-07T16:47:29Z

These results seems really great.
In the current version the stream is not working properly and will be fixed by this PR #27 and will be released soon.
Previously, we referred to benchmark directly using C-CUDA runs.

Actually I also have an implementation using GemmKernels.jl as shown in https://github.com/TensorBFS/CuTropicalGEMM.jl/blob/julia_cuda, which is designed to make GemmKernels.jl works with the package TropicalNumbers.jl.
However, in our previous tests I found that there seems to be great performance issues when using Operator.TropicalFPUOp , and I am happy to see that it is fixed.
I will retry the benchmark tomorrow, thanks you very much!

GiggleLiu mentioned this issue Jul 5, 2023

Added wrapped C cuda code and runable examples #1

Merged

GiggleLiu added the good first issue Good for newcomers label Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the performance issues and consider moving to GemmKernels.jl #2

Investigate the performance issues and consider moving to GemmKernels.jl #2

GiggleLiu commented Jul 5, 2023

ArrogantGao commented Jul 31, 2023

GiggleLiu commented Aug 2, 2023 •

edited

Loading

ArrogantGao commented Aug 2, 2023 •

edited

Loading

maleadt commented Dec 7, 2023

ArrogantGao commented Dec 7, 2023

Investigate the performance issues and consider moving to GemmKernels.jl #2

Investigate the performance issues and consider moving to GemmKernels.jl #2

Comments

GiggleLiu commented Jul 5, 2023

ArrogantGao commented Jul 31, 2023

GiggleLiu commented Aug 2, 2023 • edited Loading

ArrogantGao commented Aug 2, 2023 • edited Loading

maleadt commented Dec 7, 2023

ArrogantGao commented Dec 7, 2023

GiggleLiu commented Aug 2, 2023 •

edited

Loading

ArrogantGao commented Aug 2, 2023 •

edited

Loading