Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added wrapped C cuda code and runable examples #1

Merged
merged 3 commits into from
Jul 5, 2023
Merged

Added wrapped C cuda code and runable examples #1

merged 3 commits into from
Jul 5, 2023

Conversation

ArrogantGao
Copy link
Collaborator

No description provided.

@ArrogantGao
Copy link
Collaborator Author

I compared the speed of the our cuda code and tropical operator by GemmKernels in julia 1.7, the result is shown below
image
GemmKernels perform much better in julia 1.7 than in julia 1.6, but still can only reach about half of the maxium performance.

I wrapped our CUDA code as .so lib and added a julia interface, related tests are also created.

src/SGEMM/CuTropicalSGEMM_example.jl Outdated Show resolved Hide resolved
src/SGEMM/CuTropicalSGEMM_example.jl Outdated Show resolved Hide resolved
src/SGEMM/TropicalSGemm.cu Outdated Show resolved Hide resolved
src/SGEMM/benchmark_GemmKerenls_Tropical.jl Outdated Show resolved Hide resolved
@ArrogantGao
Copy link
Collaborator Author

ArrogantGao commented Jul 5, 2023

Sorry for the previous chaos, I thought these parts will not be publish as part of the package.

The following changes have been made:

  • The .so file is uploaded to gist as an artifact, so that there no more binary in the repo now.
  • I relocated all the files into folder src, test and benchmark.
  • Scripts used for benchmarks are given, including the fall back implementation in CUDA.jl. However I found something strange: it seems that CUDA.@sync do not work when using the function from a .so lib, so I failed the benchmark our code in julia.

The new benchmark result is show here:
image

@GiggleLiu
Copy link
Member

Hi @maleadt, I am mentoring an Open Source Promotion Plan student to implement Tropical GEMM on GPUs. Regarding the recent update in GemmKernels.jl: JuliaGPU/GemmKernels.jl#101, I was suggesting him to try the GemmKernels.jl to make the implementation compatible with Julia CUDA ecosystem. However from the above benchmark, we can see its performance is not as good as the 600 line C code.

We might need your help to decide which way to go is technically more feasible:

  1. We can either implement the TropicalGEMM in C and port it to CUDA ecosystem, or
  2. try polishing the GemmKernels implementation.

Also, @ArrogantGao found CUDA.@sync do not work when call a function from a .so lib. I can not find any issue discussing how CUDA.@sync interact with .so lib. So if you could provide a direction to investigate, that would be very helpful. If there are any existing sample project for reference, then it would be perfect.

NOTE: All the benchmarks and implementations are included in this repo.

Copy link
Member

@GiggleLiu GiggleLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look great, well done!

@@ -0,0 +1 @@
{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vscode configuration files should not be commited.

Artifacts.toml Show resolved Hide resolved
@@ -0,0 +1,627 @@
// This CUDA code is modified based on github repo https://github.com/Yinghan-Li/YHs_Sample, which is under GPL 3.0 License
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy, the GPL3 license, that is sexy. If we decide to keep this version in our code base, we have to include GPL3 license.

To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well.

@maleadt
Copy link

maleadt commented Jul 5, 2023

try polishing the GemmKernels implementation.

I would recommend doing so. An all-Julia implementation is always preferable, for so many reasons: support for different datatypes, easier to tune using metaprogramming instead of the hard-coded 128x128x8 here, easier for other people to contribute to, etc. The code generated by GemmKernels.jl is generally pretty good, so it should be possible to compare the generated PTX code of both implementations, and/or use NSight Compute to compare executions. Maybe it's something simple, like GemmKernels.jl not using ldg. It's possible that it's more serious, like how we use 64-bit integers for pointer arithmetic (JuliaGPU/CUDA.jl#1895), but I wouldn't expect so with a memory-bound kernel.

@ArrogantGao
Copy link
Collaborator Author

Remove the .vscode file and changed the license to GPL 3.0 (indeed, I also like that better).

@GiggleLiu
Copy link
Member

GiggleLiu commented Jul 5, 2023

@maleadt Thank you first your prompt reply. @ArrogantGao Let us do some profiling and get some understanding about the performance issues.
This papge is how to profile GPU code with CUDA.jl: https://cuda.juliagpu.org/stable/development/profiling/

Let me merge the PR first, and move the discussion to: #2 , we can update the profiling result and generated ptx code there.

@GiggleLiu GiggleLiu merged commit 895cfa5 into main Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants