Implement HIP/ROCm build rules and test packages #40627

fwyzard · 2023-01-26T22:09:20Z

Here is a first attempt at defining the build rules to build HIP files for the ROCm backend.
I am confident that these are not final, but having something to play with will likely help me make progress towards more robust rules.

affected files

These rules apply to

*.hip.cc files under src, plugins, test, and bin directories
alpaka/*.dev.cc under the same directories, when building an Alpaka-based package for the ROCm backend

compilation

file.hip.cc should be compiled by hipcc using the variables defined in the rocm scram tool:

${HIPCC} ${ROCM_FLAGS} -c ${CPPFLAGS} ${CXXFLAGS} -MMD -MF tmp/${SCRAM_ARCH}/src/TARGET/file.hip.cc.d file.hip.cc -o tmp/${SCRAM_ARCH}/src/TARGET/file.hip.cc.o

where

${ROCM_FLAGS} are the ROCm specific flags for the target; mostly they are defined by the rocm tool, but each target could add or remove flags as usual;
${CPPFLAGS} are the preprocessor flags (-D..., -U..., -I..., etc.) for the target;
${CXXFLAGS} are the c++ compiler flags for the target, filtered out by the ${ROCM_HOST_REM_CXXFLAGS} variable.

Note: as hipcc supports the same options as the corresponding clang compiler, some of the gcc-specific options should be removed; ${ROCM_HOST_REM_CXXFLAGS} should match the ${REM_CXXFLAGS} used by llvm-cxxcompiler; it's possible that the two have diverged since we defined the tool and more (or less) options may need to be removed, for example those related to the LTO builds.

Note 2: currently rocm has -fno-gpu-rdc, but we should change it to -fgpu-rdc.

linking a device library

When building a library (not a plugin or binary), all the *.hip.cc.o files should be bundled into a static library, for use by other libraries and plugins that need to link to ROCm device code.
For example, one library could define a __device__ function, and a plugin could define a __global__ kernel that calls that function.

ar crs tmp/${SCRAM_ARCH}/src/TARGET/libTARGET_rocm.a tmp/${SCRAM_ARCH}/src/TARGET/*.hip.cc.o

This is similar to the libTARGET_nv.a static library we build for CUDA packages, but (for the moment) it will contain also the compiled host code. If it works, I'll look into keeping only the device code at a later time.

If a library, plugin or binary with any *.hip.cc files "uses" a library with *.hip.cc files, that library's libTARGET_rocm.a should be linked in (see next part).
I think this is what are doing for the CUDA libTARGET_nv.a library as well ?

linking a host library or binary

When building a library, plugin or binary that contains *.hip.cc files, we should do the linking with hipcc instead of g++:

${HIPCC} ${CXXFLAGS} ${ROCM_FLAGS} --hip-link ...

it should link any dependency's _rocm.a library:

... -L${CMSSW_BASE}/static/${SCRAM_ARCH} -lDEPENDENCY_rocm ...

and it should use the target's flag for building a shared library or binary.

Note: with respect to the CUDA rules, using hipcc --hip-link avoids the need for the extra linking step.
If we settle on this approach, we may be able to simplify the CUDA rules in a similar way.

The text was updated successfully, but these errors were encountered:

fwyzard · 2023-01-26T22:09:34Z

assign core

fwyzard · 2023-01-26T22:09:40Z

assign heterogeneous

cmsbuild · 2023-01-26T22:09:43Z

New categories assigned: heterogeneous,core

@fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2023-01-26T22:09:45Z

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2023-01-26T22:10:56Z

@smuzaffar could you try to set up these rules ?
(I'm away tomorrow and fully busy on Monday and Tuesday morning, so it's not terribly urgent...)

fwyzard · 2023-01-29T22:18:01Z

@smuzaffar thank you for cms-sw/cmsdist#8268 and cms-sw/cmsdist#8271, I've experimented a bit with them and I have some follow up:

when a library has any *.hip.cc files we should make a static library libPackage_rocm.a from all the *.hip.cc.o object files; later we may find a more compact way, but for the moment it should be good enough;
when a package has any *.hip.cc files we link it using g++ -shared; instead, we should use hipcc $(ROCM_FLAGS) $(ROCM_LDFLAGS) -shared ($(ROCM_LDFLAGS) so far is empty, but we may need to add some extra options there, I'm still experimenting)
same for a test binary, if it has any *.hip.cc files we should link it with hipcc;
when a package or binary "uses" a library Package that has a libPackage_rocm.a static library, we try to link it with -L... -lPackage_rocm.
For some reason this does not work :-/
Instead it works if we list library as a component to link explicitly, i.e. as .../libPackage_rocm.a.

If you could make these changes, I can experiment with them in a couple of days.

fwyzard · 2023-01-30T21:39:41Z

* when a package or binary "uses" a library Package that has a libPackage_rocm.a static library, we try to link it with -L... -lPackage_rocm.
For some reason this does not work :-/
Instead it works if we list library as a component to link explicitly, i.e. as .../libPackage_rocm.a.

This is fixed in ROCm 5.2.x and later.

fwyzard · 2023-01-30T21:51:45Z

smuzaffar · 2023-02-01T07:49:40Z

@fwyzard , I have few questions

what will happen to .dev.cc files for alpaka/rocm backend? If I understand correctly then, for alpaka/rocm, these files should be compiled with hipcc ( just like for alpaka/cuda we use nvcc) ... right?
dev.cc.o object files should also be considered for extracting the rocm device code .. right ? I mean the objcopy -j '__CLANG_OFFLOAD_BUNDLE*' should also be run for dev.cc.o ?
What about linking for alpaka/rocm ? I guess we still use hipcc for lnking ... right?

fwyzard · 2023-02-01T08:03:05Z

what will happen to .dev.cc files for alpaka/rocm backend? If I understand correctly then, for alpaka/rocm, these files should be compiled with hipcc ( just like for alpaka/cuda we use nvcc) ... right?

yes: for the alpaka/rocm backend, *.dev.cc files should be considered the same as *.hip.cc files

dev.cc.o object files should also be considered for extracting the rocm device code .. right ? I mean the objcopy -j '__CLANG_OFFLOAD_BUNDLE*' should also be run for dev.cc.o ?

yes (see above)

What about linking for alpaka/rocm ? I guess we still use hipcc for linking ... right?

I think we should never have both *.cu and *.hip.cc files in the same package; while it could be made to work, we discussed this with @makortel and agreed that we prefer to keep the different backends separate in different libraries and plugins.

smuzaffar · 2023-02-02T09:46:00Z

@fwyzard , as we do not distribute rocm along with cmssw itself and we only have symlinks pointing to /cvmfs/patatrack.cern.ch so I do not know what will happen on grid sites. As we only have symlinks so rpm does not know about libs provided by rocm package ( that is why we have these explicit Provides statements ). Last two IBs failed due to missing lib dependency , for now we avoid it by adding the extra Provides but this is not the correct solution. Rocm distribution is very big i.e. 21GB (/cvmfs/patatrack.cern.ch/externals/x86_64/rhel8/amd/rocm-5.4.2 ) so shipping it with CMSSW has its own side effects. So what would you suggest how should be deal with it? should we strip the distribution and package only what we need/use? order of 1-2GB should be fine but not more than that.

fwyzard · 2023-02-02T09:59:00Z

/cvmfs/patatrack.cern.ch is available outside CERN, so any site with access to CVMFS should be fine.

I know the ROCm distribution is big - that's why we are using symlinks in the first place - but I don't really know how we can trim it.

We definitely do not use all the components and libraries that it provides, but I'm not sure what we should include and what we can drop.

smuzaffar · 2023-02-02T10:05:02Z

/cvmfs/patatrack.cern.ch is available outside CERN, so any site with access to CVMFS should be fine.

correct but I am not sure if grid sites do a auto mount or explicit mount of cvmfs repos. Sites with explicit mount have to explicitly include patatrack.cern.ch in mount list.

May be add a rocm-runtime package which can just copy all the runtime libs and ship it along with cmssw?

fwyzard · 2023-02-02T10:10:08Z

Mhm, I don't think this aspect is terribly urgent, because so far I'm not aware of any site with AMD GPUs.
I would prefer to start having some code that uses ROCm in CMSSW (directly or via Alpaka) in order to get more experience on what components are needed, before thinking about the packaging.

smuzaffar · 2023-02-02T10:10:19Z

Also what about system rocm installation? I assume system with AMD GPU will have some local libs too. Do we also need to do some dynamic env setting to use system lib or libs from cms externals?

fwyzard · 2023-02-02T10:11:31Z

I have absolutely no idea about how to deal with the system driver and runtime... this too goes into the "let's get more experience part".

Of course if anybody with experience with ROCm and AMD GPUs wants to help, that would be great :-)

fwyzard · 2023-02-06T07:15:13Z

The CMSSW_13_0_X_2023-02-05-2300 IB incudes support for the latest version of ROCm, unit tests for ROCm-based modules, and support for ROCm as an Alpaka backend.

fwyzard · 2023-02-06T07:15:25Z

+heterogeneous

makortel · 2023-02-07T15:00:02Z

@smuzaffar Do we have anything to add to this issue from core side?

smuzaffar · 2023-02-07T15:04:29Z

humm, I thought build rules were not complete for alpaka/rocm backend. @fwyzard did you try enabling the rocm backend and did it work? If yes then we need to update the build rules to enable rocm by default

fwyzard · 2023-02-07T15:42:05Z

Mhm, my bad, I though we had already enabled them by default... instead, I have the change in my local area

diff --git a/Projects/CMSSW/Self.xml b/Projects/CMSSW/Self.xml
index 24bc3a9..154617f 100644
--- a/Projects/CMSSW/Self.xml
+++ b/Projects/CMSSW/Self.xml
@@ -25,7 +25,7 @@
   <flags CHECK_PRIVATE_HEADERS="1"/>
   <flags SCRAM_TARGETS="haswell sandybridge nehalem"/>
   <flags OVERRIDABLE_FLAGS="CPPDEFINES CXXFLAGS FFLAGS CFLAGS CPPFLAGS LDFLAGS CUDA_FLAGS CUDA_LDFLAGS LTO_FLAGS ROCM_FLAGS ROCM_LDFLAGS"/>
-  <flags ALPAKA_BACKENDS="cuda serial"/>
+  <flags ALPAKA_BACKENDS="cuda rocm serial"/>
   <flags CODE_CHECK_ALPAKA_BACKEND="serial"/>
   <flags ENABLE_LTO="@ENABLE_LTO@"/>
  </client>

but I never made a PR with it.

smuzaffar · 2023-02-07T15:49:41Z

@fwyzard , I just tested locally and surprised that alpaka/rocm backend works (at least builds). It generate a lot warnings though

>> Compiling alpaka/rocm dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testKernel.dev.cc
In file included from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/acc/AccGpuUniformCudaHipRt.hpp:26,
                 from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/acc/AccGpuHipRt.hpp:14,
                 from external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/alpaka.hpp:29,
                 from dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testVec.cc:1:
external//alpaka/develop-20230201-d2aabc77f57299d28056d96ad036358a/include/alpaka/rand/RandUniformCudaHipRand.hpp:28: warning: ignoring '#pragma clang diagnostic' [-Wunknown-pragmas]
   28 | #        pragma clang diagnostic push
      |
             from dev/src/HeterogeneousCore/AlpakaInterface/test/alpaka/testVec.cc:1:

fwyzard · 2023-02-07T15:51:45Z

Yes, that's something I'm following up inside Alpaka.

fwyzard · 2023-02-09T14:06:32Z

@smuzaffar some of the warnings can be fixed inside Alpaka (done in alpaka-group/alpaka#1914), while some are caused by the HIP headers.
Is there a way to change them to use -isystem instead of -I ?

That is, changing

... -I/cvmfs/cms-ci.cern.ch/week1/PR_19c02028/el8_amd64_gcc11/external/rocm/5.4.2-4bcebc22a189c738df04e9c24dd2bf21/include ...

to

... -isystem/cvmfs/cms-ci.cern.ch/week1/PR_19c02028/el8_amd64_gcc11/external/rocm/5.4.2-4bcebc22a189c738df04e9c24dd2bf21/include ...

?

fwyzard · 2023-02-09T14:27:56Z

I think I found it:

<flags SYSTEM_INCLUDE="1"/>

smuzaffar · 2023-02-09T15:57:24Z

yes SYSTEM_INCLUDE should change it to -system

fwyzard · 2023-03-06T14:42:58Z

I think that with the set of PRs that entered CMSSW_13_0_0 and 13_1_0_pre1, everything is in pace and working.

fwyzard · 2023-03-06T14:43:04Z

+heterogeneous

makortel · 2023-03-06T15:53:15Z

+core

cmsbuild · 2023-03-06T15:53:33Z

This issue is fully signed and ready to be closed.

makortel · 2023-03-06T15:53:35Z

@cmsbuild, please close

cmsbuild added core-pending heterogeneous-pending pending-signatures labels Jan 26, 2023

fwyzard changed the title ~~first attempt at HIP/ROCm build rules~~ Implement HIP/ROCm build rules and test packages Jan 31, 2023

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Feb 6, 2023

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels Mar 6, 2023

cmsbuild closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HIP/ROCm build rules and test packages #40627

Implement HIP/ROCm build rules and test packages #40627

fwyzard commented Jan 26, 2023

fwyzard commented Jan 26, 2023

fwyzard commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

fwyzard commented Jan 26, 2023

fwyzard commented Jan 29, 2023

fwyzard commented Jan 30, 2023

fwyzard commented Jan 30, 2023 •

edited by smuzaffar

Loading

smuzaffar commented Feb 1, 2023

fwyzard commented Feb 1, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

fwyzard commented Feb 6, 2023

fwyzard commented Feb 6, 2023

makortel commented Feb 7, 2023

smuzaffar commented Feb 7, 2023 •

edited

Loading

fwyzard commented Feb 7, 2023

smuzaffar commented Feb 7, 2023

fwyzard commented Feb 7, 2023

fwyzard commented Feb 9, 2023 •

edited

Loading

fwyzard commented Feb 9, 2023

smuzaffar commented Feb 9, 2023

fwyzard commented Mar 6, 2023

fwyzard commented Mar 6, 2023

makortel commented Mar 6, 2023

cmsbuild commented Mar 6, 2023

makortel commented Mar 6, 2023

Implement HIP/ROCm build rules and test packages #40627

Implement HIP/ROCm build rules and test packages #40627

Comments

fwyzard commented Jan 26, 2023

affected files

compilation

linking a device library

linking a host library or binary

fwyzard commented Jan 26, 2023

fwyzard commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

cmsbuild commented Jan 26, 2023

fwyzard commented Jan 26, 2023

fwyzard commented Jan 29, 2023

fwyzard commented Jan 30, 2023

fwyzard commented Jan 30, 2023 • edited by smuzaffar Loading

smuzaffar commented Feb 1, 2023

fwyzard commented Feb 1, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

smuzaffar commented Feb 2, 2023

fwyzard commented Feb 2, 2023

fwyzard commented Feb 6, 2023

fwyzard commented Feb 6, 2023

makortel commented Feb 7, 2023

smuzaffar commented Feb 7, 2023 • edited Loading

fwyzard commented Feb 7, 2023

smuzaffar commented Feb 7, 2023

fwyzard commented Feb 7, 2023

fwyzard commented Feb 9, 2023 • edited Loading

fwyzard commented Feb 9, 2023

smuzaffar commented Feb 9, 2023

fwyzard commented Mar 6, 2023

fwyzard commented Mar 6, 2023

makortel commented Mar 6, 2023

cmsbuild commented Mar 6, 2023

makortel commented Mar 6, 2023

fwyzard commented Jan 30, 2023 •

edited by smuzaffar

Loading

smuzaffar commented Feb 7, 2023 •

edited

Loading

fwyzard commented Feb 9, 2023 •

edited

Loading