Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to ROCm 5.4.2 and TBB 2021.8.0 #8273

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jan 30, 2023

Update ROCm to version 5.4.2.

Enable support for newer AMD GPUs:

  • AMD Instinct MI50/MI60 (gfx906)
  • AMD Instinct MI100/MI210/MI250 (gfx908)

Update TBB to version 2021.8.0.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 30, 2023

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_13_0_X/master.

@smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a60203/30243/summary.html
COMMIT: 646f4f0
CMSSW: CMSSW_13_0_X_2023-01-29-2300/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8273/30243/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ '[' -d /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/pkgconfig ']'
+ rm -f '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/*.la' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-amdgcn.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-nvptx.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhsakmt.a
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-amdgcn.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-nvptx.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhsakmt.a': Read-only file system
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.7NvLWG (%install)


RPM build errors:
line 35: It's not recommended to have unversioned Obsoletes: Obsoletes: external+rocm+5.4.2-689d2a6a8858d80b8fa33b31f7723664
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.7NvLWG (%install)


Add support for
  - AMD Instinct MI50/MI60 (gfx906)
  - AMD Instinct MI100/MI210/MI250 (gfx908)
@cmsbuild
Copy link
Contributor

Pull request #8273 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 30, 2023

please test

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a60203/30246/summary.html
COMMIT: 79f4570
CMSSW: CMSSW_13_0_X_2023-01-29-2300/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8273/30246/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ '[' -d /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/pkgconfig ']'
+ rm -f '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/*.la' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-amdgcn.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-nvptx.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhsakmt.a
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-amdgcn.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhipfort-nvptx.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/689d2a6a8858d80b8fa33b31f7723664/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-689d2a6a8858d80b8fa33b31f7723664/lib/libhsakmt.a': Read-only file system
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.TZqyUN (%install)


RPM build errors:
line 35: It's not recommended to have unversioned Obsoletes: Obsoletes: external+rocm+5.4.2-689d2a6a8858d80b8fa33b31f7723664
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.TZqyUN (%install)


@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 30, 2023

please test

@cmsbuild
Copy link
Contributor

Pull request #8273 was updated.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a60203/30256/summary.html
COMMIT: d343cd8
CMSSW: CMSSW_13_0_X_2023-01-30-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8273/30256/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ '[' -d /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/pkgconfig ']'
+ rm -f '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/*.la' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhipfort-amdgcn.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhipfort-nvptx.a /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhsakmt.a
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhipfort-amdgcn.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhipfort-nvptx.a': Read-only file system
rm: cannot remove '/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/BUILDROOT/7481b1c5a78445d17d8889da9095c769/opt/cmssw/el8_amd64_gcc11/external/rocm/5.4.2-7481b1c5a78445d17d8889da9095c769/lib/libhsakmt.a': Read-only file system
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.u4v7Yw (%install)


RPM build errors:
line 35: It's not recommended to have unversioned Obsoletes: Obsoletes: external+rocm+5.4.2-7481b1c5a78445d17d8889da9095c769
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.u4v7Yw (%install)


@fwyzard fwyzard force-pushed the IB/CMSSW_13_0_X/master_rocm_5.4.2 branch from d343cd8 to d4ec6e6 Compare January 30, 2023 21:25
@smuzaffar
Copy link
Contributor

@fwyzard , hipcc generates a lot of tmp files under /tmp directory. I tried to add -fno-temp-file flag but it still does create temp files. Do you have any idea how to tell it to either not create temp files or create them under cmssw/tmp/src/package directory?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

would it be OK to create the temporary files and leave them there, until they are deleted by hand ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

@smuzaffar , if we add -save-temps=obj the temporary files should be created in the same directory as the target files, e.g. under tmp/el8_amd64_gcc11/src/HeterogeneousTest/ROCmDevice/plugins/HeterogeneousTestROCmDevicePlugins/.

@smuzaffar
Copy link
Contributor

I noticed that setting TMPDIR to $CMSSW_BASE/tmp/$SCRAM_ARCH works and hipcc create temp files under cmssw tmp area. I think we can update build rules to explicitly set TMPDIR to point to the package tmp directory.

Create temp files under /tmp fills up the disk on our build (and lxplus) nodes , so I would like to avoid this.

@smuzaffar
Copy link
Contributor

@smuzaffar , if we add -save-temps=obj the temporary files should be created in the same directory as the target files, e.g. under tmp/el8_amd64_gcc11/src/HeterogeneousTest/ROCmDevice/plugins/HeterogeneousTestROCmDevicePlugins/.

ah ok, thanks. I am testing it now

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

However, I do not know if that may create problems during concurrent builds (scram b -j), as some files seem to have pretty generic names, like a.out-hip-amdgcn-amd-amdhsa-gfx900.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

I noticed that setting TMPDIR to $CMSSW_BASE/tmp/$SCRAM_ARCH works and hipcc create temp files under cmssw tmp area. I think we can update build rules to explicitly set TMPDIR to point to the package tmp directory.

I think this might be a better solution - maybe even use TMPDIR=$CMSSW_BASE/tmp/$SCRAM_ARCH/some/target/specific/dir to keep them per-package.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

By the way, can we make any further changes in a separate PR ?
If it doesn't break things, I'd like to merge this one so I can move forward with the CMSSW part...

@smuzaffar
Copy link
Contributor

hopefully current changes should not break any thing in cmssw as nothing depends on rocm in cmssw . Although we can make change here but that means restarting the PR tests so I would suggest to lets get this in once current tests are done (pr tests are already building cmssw now).

The change for TMPDIR for hipcc needs to go in cmssw-config and I will open a separate PR for that

@smuzaffar
Copy link
Contributor

@fwyzard , do we have rocm distributions for SLC7, RH9?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

We have them for RHEL7 under /cvmfs/patatrack.cern.ch/externals/x86_64/rhel7/amd/.

We have them for RHEL9 under /cvmfs/patatrack.cern.ch/externals/x86_64/rhel9/amd/ starting from ROCm 5.3.0, which was the first to be released for this architecture.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

@smuzaffar if on a CMSSW PR I ask the bot to please test with cms-sw/cmsdist#8273, will it rebuild the externals from scratch or reuse them from here ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

---> test testTriggerMonitors had ERRORS

looks pretty much unrelated ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jan 31, 2023

@smuzaffar can we merge this, or should I try to rerun the tests ?

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a60203/30269/summary.html
COMMIT: 0be4176
CMSSW: CMSSW_13_0_X_2023-01-30-2300/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8273/30269/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test testTriggerMonitors had ERRORS

Comparison Summary

Summary:

  • You potentially removed 13 lines from the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555486
  • DQMHistoTests: Total failures: 4
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555460
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.007 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 11634.0,... ): -0.001 KiB HLT/Filters
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

+externals
looks good, the unit test failure is not related to this change.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_13_0_X/master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard fwyzard deleted the IB/CMSSW_13_0_X/master_rocm_5.4.2 branch January 31, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants