Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify build strategy for heterogeneous applications (and clean all build options) #318

Open
valassi opened this issue Dec 17, 2021 · 4 comments

Comments

@valassi
Copy link
Member

valassi commented Dec 17, 2021

I am opening an issue that is a bit of catch-all container in the area of build options, of c++ vs cuda, of host vs device.

This started off from the work I want to do to integrate the bridge, with a simple test emulating fortran random/sampling to connect to cuda ME.

Take a component like rambo for instance. This can do the work on the host or on the device, even if in both cases the ME is computed on the device. The point is that I need a build of the c++/host version of rambo to link to the cuda/device version of the GPU. So far, fo rthings like rambo we only had EITHER a gcc build of the c++/host version, OR a nvcc build of the cuda/device version. Now I'd like to add also a c++/host version that I can use with the ME cuda. (All these issues will become the norm for truly heterogeneous workloads as in #85).

The easiest would be essentially to build rambo c++/host with nvcc. After all, nvcc is a c++ compiler too. In a way, it would be nice for instance to test SIMD C++ vectorization in an nvcc build. The problem is that setting simply CXX=nvcc runs into various other issues. Some may be fixed using the options to forward unknow to compiler/linker, but not all issues. There are also some -ccbin and -Xcompiler to clean up. Also, is CXXFLAGS really needed on all link instructions in the Makefile? There is quite some cleanup to do.

On the code side, there are (my fault) many different namespaces for cuda and c++. I am converging on the idea of having just two, say mg5amcOnCpu and mg5amcOnGpu: the latter is for a CUDACC (ie nvcc) build, the former for a gcc/clang build.

Also on the code side, setting things like rambo as both device and host should ensure that a single nvcc build makes it usable both on the CPU and on the GPU.

So, in principle, one could aim for

  • CPU-only application: mg5amcOnCPU namespace, build all using your favorite gcc/clang/ipcx compiler
  • CPU+GPU application (which for instance requires cudaMallocHost instead of malloc on the host): mg5amcOnGpu namespace, build all using nvcc, making sure that it delegates the c++ stuff correctly to your favorite glcc/clang/icpx compiler (so in principle you should get the same performance, even from the ME vectorization in c++)

This is not urgent, but about some of the issues it's better to think earlier rather than later..

@valassi
Copy link
Member Author

valassi commented Dec 18, 2021

On second thought. It does not make sense to build the c++ simd versions with nvcc anyway also because it uses a different definition of neppV and fptype_sv every time, the SIMD types are different. One probably needs to do separate builds and link them together, as in the multi simd idea #177.

Probably best to rethink the API and cleanly separate data classes and processing classes, and strip ownership of data from the kernel launchers. That is, three sets of classes: data classes (own the data), data access classes/methods (interpret the AOSOA patterns if/where required), computational classes. The distinction betweebn host and device, and between gcc vs nvcc, is slightly different in each of the cases.

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 21, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 21, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 21, 2021
… to be both global and host (madgraph5#318)

(The code builds but RamboSamplingKernels is incomplete - and not yet linked to check_sa.cc)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 21, 2021
…cc and runTest.cc - same performance

(The code builds and runs on host/c++ and device/cuda - but not yet on host/cuda, issue madgraph5#318)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 21, 2021
…cc and runTest.cc - same performance

(The code builds and runs on host/c++ and device/cuda - but not yet on host/cuda, issue madgraph5#318)
@valassi
Copy link
Member Author

valassi commented Jan 11, 2022

There is one important issue (I realised this while looking at #307): note that things like cxtype have currently two diferent definitions within gcc and nvcc, an dthey even live within the same typedef! This is really horrible, it's a recipe for clashes and disasters. Should clean up these namespaces.

@valassi
Copy link
Member Author

valassi commented Jan 12, 2022

Another random comment: note that things like FFV functions can only be EITHER global OR host in nvcc builds. There are two ways out

  • foresee FFV as host+device functions, and add global wrappers to call them as kernels: this looks very cumbersome (but would allow building FFVs also as host functions in nvcc)
  • or cleanly decide that FFV on the host is only built with gcc (as we are doing now), while FFV on the device is built as a global kernel with nvcc (as we do now)

The second option sounds much better - but then we need to link both gcc and nvcc objects together, at least for ME calculations. Probably it's what we do anyway (see also #319, using CXX=nvcc is really cumbersome).

@valassi
Copy link
Member Author

valassi commented Jul 19, 2023

Note: in MR #723 fixing #725, I improved the separation of cpu and gpu namespaces (so now it is a bit sfare to mix the two codes... though it is maybe a better idea not to do that anyway). So #723 is doing a lot of work described here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant