Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polynomial arithmetic implemented in CUDA #2

Open
andrewmilson opened this issue Nov 13, 2022 · 4 comments
Open

Polynomial arithmetic implemented in CUDA #2

andrewmilson opened this issue Nov 13, 2022 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@andrewmilson
Copy link
Owner

andrewmilson commented Nov 13, 2022

This likely requires a fair bit of work.

Since Metal and CUDA are both C++ based it would be great if field implementations (and other functionality) could be shared between the CUDA and Metal code. Some issues might be the the address space keywords that Metal uses i.e. "constant" (which are being used for constants in field implementations).

Also the first gpu-poly version was written for my M1 Mac. M1 has a unified memory architecture so the memory doesn't have to be moved to and from the GPU. This will no longer be the case if CUDA support is added. Might be worth creating a new Buffer type that abstracts away CPU<->GPU memory movement from the library.

@andrewmilson andrewmilson added enhancement New feature or request help wanted Extra attention is needed labels Nov 13, 2022
@powergun
Copy link

wondering if you have considered opencl for wider hardware compatibility?

Since Metal and CUDA are both C++ based it would be great if field implementations (and other functionality) could be shared between the CUDA and Metal code

based on experience, this can be achieved with c++ templating - it would dramatically increase the code complexity and decrease readability, but it requires the least amount of tooling changes (any c++98 compiler could do)

M1 has a unified memory architecture so the memory doesn't have to be moved to and from the GPU. This will no longer be the case if CUDA support is added.

not a direct answer, but I think you could borrow the idea from the c++17 polymorphic allocator:
basically, it lets people write different memory allocator implementations that conform to the new c++17 allocator interface; each allocator could use heap, arena, or device-specific address space; the STL data structures using the allocator interface won't notice the difference and would just work...

@andrewmilson
Copy link
Owner Author

Thanks Wei.

Online I read OpenCL was ~40% slower than metal and ~30% slower than CUDA. I'd be really curious what kinds of performance an OpenCL implementation of miniSTARK would get though. Also there are some things I'd be excited to try out with the CUDA implementation that I don't think are possible with OpenCL or Metal. For instance the Decoupled Lookback algorithm could be used in a few places to get some significant performance gains.

C++ templating is currently used for the GPU kernels (here for instance). The issue was trying to figure out a nice way to create a type without using keywords specific to the Metal Shader Language. For instance https://github.com/andrewmilson/ministark/blob/main/gpu-poly/src/metal/felt_u64.h.metal#L61 uses the "constant" keyword (won't work if it's removed) which isn't standard C++ as far as I'm aware. I guess all the Ns could be replaced by 18446744069414584321 but I'd really like to keep things readable. If constant can be removed somehow then I think the types can be shared between Metal and CUDA code.

The allocator stuff sounds cool but Metal Shader Language only supports C++14 :(

@cheinger
Copy link

Hey Andrew!

CUDA has unified memory which abstracts away CPU<->GPU transfers. The memory pages are migrated implicitly by the CUDA driver according to where the memory is accessed. Here are some resources you might find helpful:

Fault-driven migration comes with an additional overhead of the GPU MMU system stalling until the required memory range is available on GPU. To overcome this overhead, you can distribute memory between CPU and GPU with memory mappings from GPU to CPU to facilitate fault-free memory access. Look at cudaMemPrefetch and cudaMemAdvise APIs.

Hope this helps!

@andrewmilson
Copy link
Owner Author

Hahah the CUDA legend himself! This is super helpful. Thanks mate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants