-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polynomial arithmetic implemented in CUDA #2
Comments
wondering if you have considered opencl for wider hardware compatibility?
based on experience, this can be achieved with c++ templating - it would dramatically increase the code complexity and decrease readability, but it requires the least amount of tooling changes (any c++98 compiler could do)
not a direct answer, but I think you could borrow the idea from the c++17 polymorphic allocator: |
Thanks Wei. Online I read OpenCL was ~40% slower than metal and ~30% slower than CUDA. I'd be really curious what kinds of performance an OpenCL implementation of miniSTARK would get though. Also there are some things I'd be excited to try out with the CUDA implementation that I don't think are possible with OpenCL or Metal. For instance the Decoupled Lookback algorithm could be used in a few places to get some significant performance gains. C++ templating is currently used for the GPU kernels (here for instance). The issue was trying to figure out a nice way to create a type without using keywords specific to the Metal Shader Language. For instance https://github.com/andrewmilson/ministark/blob/main/gpu-poly/src/metal/felt_u64.h.metal#L61 uses the "constant" keyword (won't work if it's removed) which isn't standard C++ as far as I'm aware. I guess all the The allocator stuff sounds cool but Metal Shader Language only supports C++14 :( |
Hey Andrew! CUDA has unified memory which abstracts away CPU<->GPU transfers. The memory pages are migrated implicitly by the CUDA driver according to where the memory is accessed. Here are some resources you might find helpful:
Fault-driven migration comes with an additional overhead of the GPU MMU system stalling until the required memory range is available on GPU. To overcome this overhead, you can distribute memory between CPU and GPU with memory mappings from GPU to CPU to facilitate fault-free memory access. Look at cudaMemPrefetch and cudaMemAdvise APIs. Hope this helps! |
Hahah the CUDA legend himself! This is super helpful. Thanks mate |
This likely requires a fair bit of work.
Since Metal and CUDA are both C++ based it would be great if field implementations (and other functionality) could be shared between the CUDA and Metal code. Some issues might be the the address space keywords that Metal uses i.e. "constant" (which are being used for constants in field implementations).
Also the first gpu-poly version was written for my M1 Mac. M1 has a unified memory architecture so the memory doesn't have to be moved to and from the GPU. This will no longer be the case if CUDA support is added. Might be worth creating a new Buffer type that abstracts away CPU<->GPU memory movement from the library.
The text was updated successfully, but these errors were encountered: