fix check on `shrEnergyM0TotalAccum` in HCAL-Alpaka kernel #45452

missirol · 2024-07-15T09:41:08Z

PR description:

This PR fixes how the values in the array shrEnergyM0TotalAccum in shared memory are used in one of the HCAL-Alpaka kernels.

The per-channel value of shrEnergyM0TotalAccum[lch] is filled summing values across "samples", and this is done concurrently on GPU. The issue is that the same value is accessed before threads are synchronised. In the examples tested offline, this tends to return almost-always the correct results on GPU, but incorrect ones on CPU (or rather, "serial_sync" backend) as the samples in the latter case are processed one at the time.

This change and related validation (see below) were discussed offline last week with @fwyzard and @kakwok.

PR validation:

We found one event where the list of HBHE RecHits produced with the "serial_sync" backend did not match the one of the "cuda_async" backend. A reproducer is in [0]. Adding the printouts in [1], one can see that the value of shrEnergyM0TotalAccum[lch] is different on Alpaka-on-CPU [2] compared to Alpaka-on-GPU [3] (see "MAHI-05" in the printout). For that event,

1 hit (id 1164465195) is present in legacy, CUDA, and Alpaka-on-GPU, but it is missing in Alpaka-on-CPU, and
2 other hits (ids 1162890307 and 1164463148) have different energy values in legacy vs Alpaka-on-CPU (in legacy they have energy==0, while in Alpaka-on-CPU they have energy > 0).

The fix in this PR leads to agreement for the HBHE-RecHits in the 4 cases (legacy, CUDA, Alpaka-on-CPU, Alpaka-on-GPU). The list of RecHits was also tested explicitly on a few more events, and no differences were found. @fwyzard also compared the trigger results of HLT for O(100k) HLTPhysics including this change, and saw that discrepancies between Alpaka-on-CPU and Alpaka-on-GPU are reduced, as expected (remaining discrepancies are likely the usual ones, and unrelated to HCAL).

These checks were done on top of CMSSW_14_0_11_MULTIARCHS.

More details are also available in CMSHLT-3283.

[0] Reproducer (CMSSW_14_0_11_MULTIARCHS).
https://raw.githubusercontent.com/missirol/hltScripts/ee345c0d12fa9a6004e48ec828cbea1036dd6c50/hltTests/test_hcalAlpaka_debugEvent.sh

[1] Patch I used, on top of CMSSW_14_0_11_MULTIARCHS, just to add some printouts.
https://gist.github.com/missirol/3e37c1c0f0798ed3fa4728b47b42071b

[2] Output of [0]+[1] for Alpaka-on-CPU.

Begin processing the 1st record. Run 382250, Event 216051612, LumiSection 194 on stream 0 at 07-Jul-2024 23:41:38.180 CEST
YYY id = 1164465195 sample = 0 gch = 3059 hashedId = 12035 adc = 4 capid = 1
YYY   charge = 85.437630 rawCharge = 85.437119 dfc = 18.986141 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.018619 energym0_per_ts_gain0 = -0.018619 shrEnergyM0TotalAccum[lch] = -0.018619
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 1 gch = 3059 hashedId = 12035 adc = 4 capid = 2
YYY   charge = 85.681648 rawCharge = 85.680008 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.024758 energym0_per_ts_gain0 = -0.024758 shrEnergyM0TotalAccum[lch] = -0.043377
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 2 gch = 3059 hashedId = 12035 adc = 4 capid = 3
YYY   charge = 85.681648 rawCharge = 85.681496 dfc = 19.040367 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.015026 energym0_per_ts_gain0 = -0.015026 shrEnergyM0TotalAccum[lch] = -0.058403
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 3 gch = 3059 hashedId = 12035 adc = 8 capid = 0
YYY   charge = 161.443497 rawCharge = 161.452850 dfc = 18.993355 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.041686 energym0_per_ts_gain0 = 0.041686 shrEnergyM0TotalAccum[lch] = -0.016716
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-05 | detId=0 chi2=-9999.000000 shrEnergyM0TotalAccum=-0.016716 energym0_per_ts_gain0=0.041686 ts4Thresh=0.000000
YYY id = 1164465195 sample = 4 gch = 3059 hashedId = 12035 adc = 8 capid = 1
YYY   charge = 161.382202 rawCharge = 161.391434 dfc = 18.986137 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.042369 energym0_per_ts_gain0 = 0.042369 shrEnergyM0TotalAccum[lch] = 0.025652
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 5 gch = 3059 hashedId = 12035 adc = 7 capid = 2
YYY   charge = 142.802750 rawCharge = 142.808441 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.021113 energym0_per_ts_gain0 = 0.021113 shrEnergyM0TotalAccum[lch] = 0.046765
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 6 gch = 3059 hashedId = 12035 adc = 5 capid = 3
YYY   charge = 104.722015 rawCharge = 104.724304 dfc = 19.040363 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.000264 energym0_per_ts_gain0 = 0.000264 shrEnergyM0TotalAccum[lch] = 0.047030
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 7 gch = 3059 hashedId = 12035 adc = 4 capid = 0
YYY   charge = 85.470085 rawCharge = 85.469696 dfc = 18.993351 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.019324 energym0_per_ts_gain0 = -0.019324 shrEnergyM0TotalAccum[lch] = 0.027706
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-01 | detId=1164465195 chi2=-9999.000000
XXX1 1164465195

[3] Output of [0]+[1] for Alpaka-on-GPU.

Begin processing the 1st record. Run 382250, Event 216051612, LumiSection 194 on stream 0 at 07-Jul-2024 23:42:09.899 CEST
YYY id = 1164465195 sample = 0 gch = 3059 hashedId = 12035 adc = 4 capid = 1
YYY   charge = 85.437630 rawCharge = 85.437119 dfc = 18.986141 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.018619 energym0_per_ts_gain0 = -0.018619 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 1 gch = 3059 hashedId = 12035 adc = 4 capid = 2
YYY   charge = 85.681648 rawCharge = 85.680008 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.024758 energym0_per_ts_gain0 = -0.024758 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 2 gch = 3059 hashedId = 12035 adc = 4 capid = 3
YYY   charge = 85.681648 rawCharge = 85.681496 dfc = 19.040367 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.015026 energym0_per_ts_gain0 = -0.015026 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 3 gch = 3059 hashedId = 12035 adc = 8 capid = 0
YYY   charge = 161.443497 rawCharge = 161.452850 dfc = 18.993355 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.041686 energym0_per_ts_gain0 = 0.041686 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 4 gch = 3059 hashedId = 12035 adc = 8 capid = 1
YYY   charge = 161.382202 rawCharge = 161.391434 dfc = 18.986137 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.042369 energym0_per_ts_gain0 = 0.042369 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 5 gch = 3059 hashedId = 12035 adc = 7 capid = 2
YYY   charge = 142.802750 rawCharge = 142.808441 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.021113 energym0_per_ts_gain0 = 0.021113 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 6 gch = 3059 hashedId = 12035 adc = 5 capid = 3
YYY   charge = 104.722015 rawCharge = 104.724304 dfc = 19.040363 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.000264 energym0_per_ts_gain0 = 0.000264 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 7 gch = 3059 hashedId = 12035 adc = 4 capid = 0
YYY   charge = 85.470085 rawCharge = 85.469696 dfc = 18.993351 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.019324 energym0_per_ts_gain0 = -0.019324 shrEnergyM0TotalAccum[lch] = 0.027706
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-05 | detId=0 chi2=0.000000 shrEnergyM0TotalAccum=0.027706 energym0_per_ts_gain0=0.041686 ts4Thresh=0.000000
MAHI-01 | detId=1164465195 chi2=0.000000
MAHI-02 | detId=1164465195 chi2=0.000000
XXX1 1164465195
XXX2 1164465195

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

CMSSW_14_0_X

Fix relevant to 2024 data-taking.

cmsbuild · 2024-07-15T09:41:33Z

cms-bot internal usage

cmsbuild · 2024-07-15T09:47:46Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40907

cmsbuild · 2024-07-15T09:48:05Z

A new Pull Request was created by @missirol for master.

It involves the following packages:

RecoLocalCalo/HcalRecProducers (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@abdoulline, @apsallid, @bsunanda, @mariadalfonso, @youyingli this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

RecoLocalCalo/HcalRecProducers/plugins/alpaka/Mahi.dev.cc

jfernan2 · 2024-07-15T10:01:38Z

assign heterogeneous

cmsbuild · 2024-07-15T10:01:57Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-07-15T10:14:23Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40909

cmsbuild · 2024-07-15T10:14:39Z

Pull request #45452 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

fwyzard · 2024-07-15T10:18:56Z

please test

fwyzard · 2024-07-15T10:19:00Z

+heterogeneous

cmsbuild · 2024-07-15T14:05:08Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-45452/40921

cmsbuild · 2024-07-15T14:05:31Z

Pull request #45452 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

jfernan2 · 2024-07-15T14:44:44Z

+1

cmsbuild · 2024-07-15T18:04:37Z

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd9a63/40407/summary.html
COMMIT: 34905bd
CMSSW: CMSSW_14_1_X_2024-07-15-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/45452/40407/install.sh to create a dev area with all the needed externals and cmssw changes.

DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

You potentially removed 7 lines from the logs
Reco comparison results: 5 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3345094
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3345068
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

fwyzard · 2024-07-16T07:30:41Z

+heterogeneous

cmsbuild · 2024-07-16T07:31:00Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

rappoccio · 2024-07-16T14:47:31Z

+1

cmsbuild added this to the CMSSW_14_1_X milestone Jul 15, 2024

cmsbuild added reconstruction-pending pending-signatures tests-pending orp-pending code-checks-pending labels Jul 15, 2024

missirol mentioned this pull request Jul 15, 2024

fix check on shrEnergyM0TotalAccum in HCAL-Alpaka kernel [14_0_X] #45453

Merged

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 15, 2024

missirol commented Jul 15, 2024

View reviewed changes

RecoLocalCalo/HcalRecProducers/plugins/alpaka/Mahi.dev.cc Outdated Show resolved Hide resolved

fwyzard reviewed Jul 15, 2024

View reviewed changes

RecoLocalCalo/HcalRecProducers/plugins/alpaka/Mahi.dev.cc Outdated Show resolved Hide resolved

cmsbuild added the heterogeneous-pending label Jul 15, 2024

missirol force-pushed the devel_hcalAlpaka_fixShrEnergyM0TotalAccum_141X branch from b527d4c to 892a926 Compare July 15, 2024 10:10

cmsbuild added code-checks-pending and removed code-checks-approved labels Jul 15, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 15, 2024

cmsbuild added tests-started heterogeneous-approved and removed tests-pending heterogeneous-pending labels Jul 15, 2024

missirol force-pushed the devel_hcalAlpaka_fixShrEnergyM0TotalAccum_141X branch from ea1a1bd to 34905bd Compare July 15, 2024 14:03

cmsbuild added tests-pending code-checks-pending heterogeneous-pending and removed tests-started code-checks-approved heterogeneous-approved labels Jul 15, 2024

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 15, 2024

cmsbuild added reconstruction-approved tests-started and removed reconstruction-pending tests-pending labels Jul 15, 2024

cmsbuild added tests-approved and removed tests-started labels Jul 15, 2024

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Jul 16, 2024

cmsbuild added orp-approved and removed orp-pending labels Jul 16, 2024

cmsbuild merged commit 92286ac into cms-sw:master Jul 16, 2024
11 checks passed

makortel mentioned this pull request Jul 19, 2024

Added TestSourceProcessor helper #45443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix check on `shrEnergyM0TotalAccum` in HCAL-Alpaka kernel #45452

fix check on `shrEnergyM0TotalAccum` in HCAL-Alpaka kernel #45452

missirol commented Jul 15, 2024 •

edited

Loading

cmsbuild commented Jul 15, 2024 •

edited

Loading

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

jfernan2 commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

fwyzard commented Jul 15, 2024

fwyzard commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

jfernan2 commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

fwyzard commented Jul 16, 2024

cmsbuild commented Jul 16, 2024

rappoccio commented Jul 16, 2024

fix check on shrEnergyM0TotalAccum in HCAL-Alpaka kernel #45452

fix check on shrEnergyM0TotalAccum in HCAL-Alpaka kernel #45452

Conversation

missirol commented Jul 15, 2024 • edited Loading

PR description:

PR validation:

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

cmsbuild commented Jul 15, 2024 • edited Loading

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

jfernan2 commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

fwyzard commented Jul 15, 2024

fwyzard commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

jfernan2 commented Jul 15, 2024

cmsbuild commented Jul 15, 2024

Comparison Summary

fwyzard commented Jul 16, 2024

cmsbuild commented Jul 16, 2024

rappoccio commented Jul 16, 2024

fix check on `shrEnergyM0TotalAccum` in HCAL-Alpaka kernel #45452

fix check on `shrEnergyM0TotalAccum` in HCAL-Alpaka kernel #45452

missirol commented Jul 15, 2024 •

edited

Loading

cmsbuild commented Jul 15, 2024 •

edited

Loading