Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gcc12]Update cuda version 12.2.1 #8643

Merged
merged 1 commit into from
Aug 10, 2023
Merged

Conversation

smuzaffar
Copy link
Contributor

No description provided.

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_13_3_X/g12.

@smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

please test for el8_amd64_gcc12

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 8, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34147/summary.html
COMMIT: 54b7f60
CMSSW: CMSSW_13_3_X_2023-08-07-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34147/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34147/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34147/git-merge-result

Comparison Summary

Summary:

  • You potentially added 5622 lines to the logs
  • Reco comparison results: 15162 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3150947
  • DQMHistoTests: Total failures: 36078
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3114846
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Aug 8, 2023

nice 👍🏻

@smuzaffar
Copy link
Contributor Author

Enable gpu

@smuzaffar
Copy link
Contributor Author

please test for el8_amd64_gcc12

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 8, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34151/summary.html
COMMIT: 54b7f60
CMSSW: CMSSW_13_3_X_2023-08-07-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34151/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34151/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34151/git-merge-result

Comparison Summary

Summary:

  • You potentially added 5794 lines to the logs
  • Reco comparison results: 15161 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3150947
  • DQMHistoTests: Total failures: 36079
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3114845
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor Author

@fwyzard , although cuda 12.2.1 looks good but all GPU tests were skipped due with message [a] any idea?

[a]

Failed to initialise the CUDA runtime, the test will be skipped.

@smuzaffar
Copy link
Contributor Author

gpu tests were run on htcondor gpu batch node which had the following gpu. Note that same type of gpu nodes are used by GPU IBs to run unit tests and except one all tests run

+ nvidia-smi
Tue Aug  8 13:22:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB           Off| 00000000:00:06.0 Off |                    0 |
| N/A   41C    P0               41W / 250W|    146MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     13141      G   /usr/bin/X                                   33MiB |
|    0   N/A  N/A     13215      G   /usr/bin/gnome-shell                         19MiB |
+---------------------------------------------------------------------------------------+

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

can you run cudaComputeCapabilities in a CMSSW environment on one of the node where the tests are skipped ?

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Aug 9, 2023

@fwyzard , on lxplus-gpu (which has [a] gpu) where unit tests are also skipped, I get this

lxplus> cmssw-el8 --nv
Singularity> cd /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34151/CMSSW_13_3_X_2023-08-07-1100/
Singularity> cmsenv
Singularity> echo $LD_LIBRARY_PATH | tr : 'n'Singularity> echo $LD_LIBRARY_PATH | tr : '\n'
/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34151/CMSSW_13_3_X_2023-08-07-1100/biglib/el8_amd64_gcc12
/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34151/CMSSW_13_3_X_2023-08-07-1100/lib/el8_amd64_gcc12
/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8643/34151/CMSSW_13_3_X_2023-08-07-1100/external/el8_amd64_gcc12/lib
/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_X_2023-08-07-1100/biglib/el8_amd64_gcc12
/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_X_2023-08-07-1100/lib/el8_amd64_gcc12
/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/llvm/16.0.3-6ebd5c85e6445ff7c541390dd84225aa/lib64
/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64
/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib
/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/cuda/12.2.1-bdf3fff69eaec65abe18a7569592cab6/lib64/stubs
/.singularity.d/libs
Singularity> cudaComputeCapabilities                       
cudaComputeCapabilities: CUDA driver is a stub library
Singularity> $CMSSW_BASE/test/el8_amd64_gcc12/test_calo_rechit 
Failed to initialise the CUDA runtime, the test will be skipped.

[a]

Singularity> nvidia-smi
Wed Aug  9 09:56:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:07:00.0 Off |                    0 |
| N/A   50C    P0    28W /  70W |      2MiB / 15360MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

Looks like the NVIDIA drivers and CUDA version on the machine are not compatible with CUDA 12.2.

If I try running the binary (without cmsenv), I get

Singularity> ./cudaComputeCapabilities
./cudaComputeCapabilities: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

If I add the CUDA libraries to the path, I get

Singularity> LD_LIBRARY_PATH=/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/cuda/12.2.1-bdf3fff69eaec65abe18a7569592cab6/lib64:$LD_LIBRARY_PATH ./cudaComputeCapabilities
cudaComputeCapabilities: CUDA driver version is insufficient for CUDA runtime version

If I try adding also the compatibility drivers, I get

Singularity> LD_LIBRARY_PATH=/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/cuda/12.2.1-bdf3fff69eaec65abe18a7569592cab6/drivers:/cvmfs/cms-ci.cern.ch/week1/PR_3f407229/el8_amd64_gcc12/external/cuda/12.2.1-bdf3fff69eaec65abe18a7569592cab6/lib64:$LD_LIBRARY_PATH ./cudaComputeCapabilities
cudaComputeCapabilities: system has unsupported display driver / cuda driver combination

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

Indeed, according to https://docs.nvidia.com/deploy/cuda-compatibility/ this combination is not
supported.

The machine has the driver version 520.61.05 .

To use the updated CUDA runtime directly, we need at least 525.60.13 :
image

To use the forward compatibility drivers, we need either an older driver (up to 470.57.02+ from CUDA 11.4) or a newer one (525.60.04+ from CUDA 12.0):
image

@smuzaffar
Copy link
Contributor Author

As gcc12 branch is already using cuda 12.0.1, so I would suggest to merge this for gcc12. Any objections @fwyzard ?

@smuzaffar
Copy link
Contributor Author

please test for el8_aarch64_gcc12

@smuzaffar
Copy link
Contributor Author

please test for el8_ppc64le_gcc12

@fwyzard
Copy link
Contributor

fwyzard commented Aug 9, 2023

As gcc12 branch is already using cuda 12.0.1, so I would suggest to merge this for gcc12. Any objections @fwyzard ?

👍🏻

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 9, 2023

-1

Failed Tests: UnitTests RelVals AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34188/summary.html
COMMIT: 54b7f60
CMSSW: CMSSW_13_3_X_2023-08-03-2300/el8_ppc64le_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8643/34188/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34188/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34188/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testCSCTriggerMapping had ERRORS
---> test testFWCoreConcurrency had ERRORS

RelVals

  • 11634.011634.0_TTbar_14TeV+2021/step1_TTbar_14TeV+2021.log
  • 11634.91111634.911_TTbar_14TeV+2021_DD4hep/step1_TTbar_14TeV+2021_DD4hep.log
  • 11834.011834.0_TTbar_14TeV+2021PU/step1_TTbar_14TeV+2021PU.log
Expand to see more relval errors ...

AddOn Tests

[hlt_mc_GRun:1] cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi -s GEN,SIM,DIGI,L1,DIGI2RAW --mc --scenario=pp -n 10 --conditions auto:run3_mc_GRun --relval 9000,50 --datatier "GEN-SIM-RAW" --eventcontent RAWSIM --customise=HLTrigger/Configuration/CustomConfigs.L1T --era Run3_2023 --fileout file:RelVal_Raw_GRun_MC.root : FAILED - elapsed time: 92 sec (ended on Wed Aug  9 23:44:17 2023) - exit: 34304
----- Begin Fatal Exception 09-Aug-2023 23:45:50 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'RelVal_Raw_GRun_MC.root'
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 09-Aug-2023 23:49:33 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'RelVal_Raw_GRun_MC.root'
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
Expand to see more addon errors ...

@smuzaffar
Copy link
Contributor Author

please test for el8_ppc64le_gcc12

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34204/summary.html
COMMIT: 54b7f60
CMSSW: CMSSW_13_3_X_2023-08-03-2300/el8_ppc64le_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8643/34204/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34204/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e0abc/34204/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testCSCTriggerMapping had ERRORS

RelVals

  • 24900.024900.0_CloseByPGun_CE_H_Coarse_Scint+2026D98/step2_CloseByPGun_CE_H_Coarse_Scint+2026D98.log
  • 24896.024896.0_CloseByPGun_CE_E_Front_120um+2026D98/step2_CloseByPGun_CE_E_Front_120um+2026D98.log
  • 23234.023234.0_TTbar_14TeV+2026D94/step2_TTbar_14TeV+2026D94.log
Expand to see more relval errors ...

AddOn Tests

[hlt_mc_GRun:1] cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi -s GEN,SIM,DIGI,L1,DIGI2RAW --mc --scenario=pp -n 10 --conditions auto:run3_mc_GRun --relval 9000,50 --datatier "GEN-SIM-RAW" --eventcontent RAWSIM --customise=HLTrigger/Configuration/CustomConfigs.L1T --era Run3_2023 --fileout file:RelVal_Raw_GRun_MC.root : FAILED - elapsed time: 45 sec (ended on Thu Aug 10 10:47:07 2023) - exit: 34304
----- Begin Fatal Exception 10-Aug-2023 10:48:12 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'RelVal_Raw_GRun_MC.root'
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 10-Aug-2023 10:51:32 CEST-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'RelVal_Raw_GRun_MC.root'
   Additional Info:
      [a] Input file file:RelVal_Raw_GRun_MC.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
Expand to see more addon errors ...

@smuzaffar smuzaffar merged commit 3050e37 into IB/CMSSW_13_3_X/g12 Aug 10, 2023
@smuzaffar smuzaffar deleted the smuzaffar-patch-2 branch September 8, 2023 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants