-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix check on shrEnergyM0TotalAccum
in HCAL-Alpaka kernel
#45452
fix check on shrEnergyM0TotalAccum
in HCAL-Alpaka kernel
#45452
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40907 |
A new Pull Request was created by @missirol for master. It involves the following packages:
@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
assign heterogeneous |
b527d4c
to
892a926
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40909 |
please test |
+heterogeneous |
ea1a1bd
to
34905bd
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40921 |
+1 |
+1 Size: This PR adds an extra 28KB to repository
Comparison SummarySummary:
|
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This PR fixes how the values in the array
shrEnergyM0TotalAccum
in shared memory are used in one of the HCAL-Alpaka kernels.The per-channel value of
shrEnergyM0TotalAccum[lch]
is filled summing values across "samples", and this is done concurrently on GPU. The issue is that the same value is accessed before threads are synchronised. In the examples tested offline, this tends to return almost-always the correct results on GPU, but incorrect ones on CPU (or rather, "serial_sync" backend) as the samples in the latter case are processed one at the time.This change and related validation (see below) were discussed offline last week with @fwyzard and @kakwok.
PR validation:
We found one event where the list of HBHE RecHits produced with the "serial_sync" backend did not match the one of the "cuda_async" backend. A reproducer is in [0]. Adding the printouts in [1], one can see that the value of
shrEnergyM0TotalAccum[lch]
is different on Alpaka-on-CPU [2] compared to Alpaka-on-GPU [3] (see "MAHI-05" in the printout). For that event,The fix in this PR leads to agreement for the HBHE-RecHits in the 4 cases (legacy, CUDA, Alpaka-on-CPU, Alpaka-on-GPU). The list of RecHits was also tested explicitly on a few more events, and no differences were found. @fwyzard also compared the trigger results of HLT for O(100k) HLTPhysics including this change, and saw that discrepancies between Alpaka-on-CPU and Alpaka-on-GPU are reduced, as expected (remaining discrepancies are likely the usual ones, and unrelated to HCAL).
These checks were done on top of
CMSSW_14_0_11_MULTIARCHS
.More details are also available in CMSHLT-3283.
[0] Reproducer (CMSSW_14_0_11_MULTIARCHS).
https://raw.githubusercontent.com/missirol/hltScripts/ee345c0d12fa9a6004e48ec828cbea1036dd6c50/hltTests/test_hcalAlpaka_debugEvent.sh
[1] Patch I used, on top of CMSSW_14_0_11_MULTIARCHS, just to add some printouts.
https://gist.github.com/missirol/3e37c1c0f0798ed3fa4728b47b42071b
[2] Output of [0]+[1] for Alpaka-on-CPU.
[3] Output of [0]+[1] for Alpaka-on-GPU.
If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:
CMSSW_14_0_X
Fix relevant to 2024 data-taking.