Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the performance issues and consider moving to GemmKernels.jl #2

Open
GiggleLiu opened this issue Jul 5, 2023 · 5 comments
Labels
good first issue Good for newcomers

Comments

@GiggleLiu
Copy link
Member

          Sorry for the previous chaos, I thought these parts will not be publish as part of the package.

The following changes have been made:

  • The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
  • I relocated all the files into folder src, test and benchmark.
  • Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.

The new benchmark result is show here:
image

Originally posted by @ArrogantGao in #1 (comment)

@ArrogantGao
Copy link
Collaborator

I further profiled the Tropical operations by GemmKernels and our CuTropicalGemm code via the Nvidia Nsight Compute tool. The esults demonstrate the current code's bottlenecks. During testing, I took m = 10240, n = 8192, k = 8192.

The main result is shown below:
image
image
where the image above shows the Pipe Utilization of CuTropicalGemm and the one below shows that of the GemmKernels. It seems that the ALU units (used to execute the FMNMX instruction) and the FMA (used to execute the FADD instruction) units can perform calculations in parallel, so that although we can not use the fused operations, the code still acheive a performance greater than fifty percent.

It can also be observed that for both cases, the utilization of the computational units only reach about 80% and 60%, respectively, which are far less than 96.3% of SGEMM with pure FMMA operations.
Furthermore, the analysis results for the registers, as shown in the graph, indicate that the performance bottleneck is not caused by the bandwidth of the registers, but rather by insufficient parallelism of the ALU and FMA computational units.
image
To further improve the performance, we may need to analysis the behavior of ALU and FMA.

ps: I also benchmarked the max-mul operation, the result is almost the same with that of max-add.

@GiggleLiu
Copy link
Member Author

GiggleLiu commented Aug 2, 2023

Thanks! If possible, could you please also print the generated PTX code and show the nsight-sys screen shot here? These information could help analysing the performance difference between C implementation and GemmKernels implementation.

For GemmKernel implementation, just go throught the steps here:
https://cuda.juliagpu.org/stable/development/profiling/

@ArrogantGao
Copy link
Collaborator

ArrogantGao commented Aug 2, 2023

Sure! Since both the ptx code and the benchmark result are both quite long, I will upload the file directly here

Here are the benchmark result by Nvidia Nsight Compute of GemmKernels, CuTropicalGemm.MaxMulFP32! and CuTropicalGemm.MaxAddFP32!.
GemmKernels.pdf
CuTropical_MAXMUL.pdf
CuTropical_MAXADD.pdf

Figures of the ncu benchmark results:
Result of GemmKernels.jl
GemmKernels

Result of CuTropicalGemm.MaxAddFP32!
maxadd

Result of CuTropicalGemm.MaxMulFP32!
maxmul

Result of sgemm.cu
sgemm

Here are the PTX code of sgemm.cu, TropicalGemm.cu and GemmKernels.jl (the first two are directly generated from .cu code because CUDA.@device_code_ptx do not work with our Julia interface).
sgemm.txt
TropicalGemm.txt
GemmKernels.txt

@GiggleLiu GiggleLiu added the good first issue Good for newcomers label Sep 28, 2023
@maleadt
Copy link

maleadt commented Dec 7, 2023

Can you share your benchmarking code? I did a test myself, and with two performance fixes to GemmKernels.jl (JuliaGPU/GemmKernels.jl#182, and a tuned configuration) I'm getting very similar performance. For example, on a RTX6000 Ada using 4096x4096 Float32 inputs:

CuTropicalGEMM:   11.866 ms (2 allocations: 48 bytes)
GemmKernels:   11.356 ms (47 allocations: 3.00 KiB)

GemmKernels.jl seems consistently a little faster than CuTropicalGEMM; even re-using this block/operator configuration for different input sizes (i.e. where additional tuning might result in even better performance).

I'm benchmarking using the following code:

using CUDA, GemmKernels, LinearAlgebra
using TropicalNumbers, CuTropicalGEMM
using BenchmarkTools

function main()
    M = K = N = 1024

    A = CUDA.rand(Float32, M, K)
    B = CUDA.rand(Float32, K, N)
    C = CUDA.zeros(Float32, M, N)

    print("CuTropicalGEMM: ")
    let
        tA = Tropical.(A)
        tB = Tropical.(B)
        tC = Tropical.(C)
        @btime begin
            mul!($tC, $tA, $tB)
            # XXX: not sure why `CUDA.@sync` doesn't work here;
            #      is CuTropicalGEMM doing its own stream management?
            device_synchronize()
        end
    end

    print("GemmKernels: ")
    let
        # result of tuning
        BLOCK_M = 128
        BLOCK_N = 64
        BLOCK_K = 32
        OP_M = 16
        OP_N = 4
        OP_K = 4
        OP_MB = 8
        OP_NB = 4
        OP_KB = 1
        kernel = Kernel.matmul_pipelined

        # pow2-sized, 128-bit aligned inputs, so we can use aligned layouts.
        # we don't have transposed inputs, so everything is column major.
        @assert stride(A, 2) % 16 == 0
        global_a_layout = Layout.UnsafeAlignedColMajor{eltype(A)}
        @assert stride(B, 2) % 16 == 0
        global_b_layout = Layout.UnsafeAlignedColMajor{eltype(B)}
        # we want to do a simple C = A * B, so no need to load C first.
        global_c_layout = Layout.Zero{eltype(C)}
        @assert stride(C, 2) % 16 == 0
        global_d_layout = Layout.UnsafeAlignedColMajor{eltype(C)}

        # shared layouts are similar.
        # the frequently-accessed a/b shmems are padded to avoid bank conflicts.
        shared_a_layout = Layout.Padded{Layout.UnsafeAlignedColMajor{eltype(A)}, 8}
        shared_b_layout = Layout.Padded{Layout.UnsafeAlignedColMajor{eltype(B)}, 8}
        shared_c_layout = shared_d_layout = Layout.UnsafeAlignedColMajor{eltype(C)}

        # we use the tropical FPU operator
        compute_type = promote_type(eltype(A), eltype(B))
        operator = Operator.TropicalFPUOp{OP_M, OP_N, OP_K, OP_MB, OP_NB, OP_KB,
                                          compute_type, eltype(C)}

        # the block shape is the result of tuning
        block_shape = (M = BLOCK_M, N = BLOCK_N, K = BLOCK_K)
        @assert M % block_shape.M == 0
        @assert N % block_shape.N == 0
        @assert K % block_shape.K == 0

        conf = GemmKernels.get_config(;
            gemm_shape = (M = M, N = N, K = K),
            block_shape,
            operator,

            global_a_layout, global_b_layout, global_c_layout, global_d_layout,
            shared_a_layout, shared_b_layout, shared_c_layout, shared_d_layout,

            is_a_col_major = true,
            is_b_col_major = true
        )

        @btime CUDA.@sync GemmKernels.matmul($conf, $A, $B, $C, $C; kernel=$kernel)
    end

    CUDA.unsafe_free!(A)
    CUDA.unsafe_free!(B)
    CUDA.unsafe_free!(C)
end

isinteractive() || display()

Now, GemmKernels.jl likely needs some improvements to be better across the board (e.g. more generalization to handle arbitrary input sizes, a better API, etc), but it nonetheless seems like a good starting point with all advantages that native Julia implementations have (arbitrary type support, ease of development, etc).

@ArrogantGao
Copy link
Collaborator

These results seems really great.
In the current version the stream is not working properly and will be fixed by this PR #27 and will be released soon.
Previously, we referred to benchmark directly using C-CUDA runs.

Actually I also have an implementation using GemmKernels.jl as shown in https://github.com/TensorBFS/CuTropicalGEMM.jl/blob/julia_cuda, which is designed to make GemmKernels.jl works with the package TropicalNumbers.jl.
However, in our previous tests I found that there seems to be great performance issues when using Operator.TropicalFPUOp , and I am happy to see that it is fixed.
I will retry the benchmark tomorrow, thanks you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants