Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix check on shrEnergyM0TotalAccum in HCAL-Alpaka kernel #45452

Conversation

missirol
Copy link
Contributor

@missirol missirol commented Jul 15, 2024

PR description:

This PR fixes how the values in the array shrEnergyM0TotalAccum in shared memory are used in one of the HCAL-Alpaka kernels.

The per-channel value of shrEnergyM0TotalAccum[lch] is filled summing values across "samples", and this is done concurrently on GPU. The issue is that the same value is accessed before threads are synchronised. In the examples tested offline, this tends to return almost-always the correct results on GPU, but incorrect ones on CPU (or rather, "serial_sync" backend) as the samples in the latter case are processed one at the time.

This change and related validation (see below) were discussed offline last week with @fwyzard and @kakwok.

PR validation:

We found one event where the list of HBHE RecHits produced with the "serial_sync" backend did not match the one of the "cuda_async" backend. A reproducer is in [0]. Adding the printouts in [1], one can see that the value of shrEnergyM0TotalAccum[lch] is different on Alpaka-on-CPU [2] compared to Alpaka-on-GPU [3] (see "MAHI-05" in the printout). For that event,

  • 1 hit (id 1164465195) is present in legacy, CUDA, and Alpaka-on-GPU, but it is missing in Alpaka-on-CPU, and
  • 2 other hits (ids 1162890307 and 1164463148) have different energy values in legacy vs Alpaka-on-CPU (in legacy they have energy==0, while in Alpaka-on-CPU they have energy > 0).

The fix in this PR leads to agreement for the HBHE-RecHits in the 4 cases (legacy, CUDA, Alpaka-on-CPU, Alpaka-on-GPU). The list of RecHits was also tested explicitly on a few more events, and no differences were found. @fwyzard also compared the trigger results of HLT for O(100k) HLTPhysics including this change, and saw that discrepancies between Alpaka-on-CPU and Alpaka-on-GPU are reduced, as expected (remaining discrepancies are likely the usual ones, and unrelated to HCAL).

These checks were done on top of CMSSW_14_0_11_MULTIARCHS.

More details are also available in CMSHLT-3283.

[0] Reproducer (CMSSW_14_0_11_MULTIARCHS).
https://raw.githubusercontent.com/missirol/hltScripts/ee345c0d12fa9a6004e48ec828cbea1036dd6c50/hltTests/test_hcalAlpaka_debugEvent.sh

[1] Patch I used, on top of CMSSW_14_0_11_MULTIARCHS, just to add some printouts.
https://gist.github.com/missirol/3e37c1c0f0798ed3fa4728b47b42071b

[2] Output of [0]+[1] for Alpaka-on-CPU.

Begin processing the 1st record. Run 382250, Event 216051612, LumiSection 194 on stream 0 at 07-Jul-2024 23:41:38.180 CEST
YYY id = 1164465195 sample = 0 gch = 3059 hashedId = 12035 adc = 4 capid = 1
YYY   charge = 85.437630 rawCharge = 85.437119 dfc = 18.986141 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.018619 energym0_per_ts_gain0 = -0.018619 shrEnergyM0TotalAccum[lch] = -0.018619
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 1 gch = 3059 hashedId = 12035 adc = 4 capid = 2
YYY   charge = 85.681648 rawCharge = 85.680008 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.024758 energym0_per_ts_gain0 = -0.024758 shrEnergyM0TotalAccum[lch] = -0.043377
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 2 gch = 3059 hashedId = 12035 adc = 4 capid = 3
YYY   charge = 85.681648 rawCharge = 85.681496 dfc = 19.040367 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.015026 energym0_per_ts_gain0 = -0.015026 shrEnergyM0TotalAccum[lch] = -0.058403
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 3 gch = 3059 hashedId = 12035 adc = 8 capid = 0
YYY   charge = 161.443497 rawCharge = 161.452850 dfc = 18.993355 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.041686 energym0_per_ts_gain0 = 0.041686 shrEnergyM0TotalAccum[lch] = -0.016716
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-05 | detId=0 chi2=-9999.000000 shrEnergyM0TotalAccum=-0.016716 energym0_per_ts_gain0=0.041686 ts4Thresh=0.000000
YYY id = 1164465195 sample = 4 gch = 3059 hashedId = 12035 adc = 8 capid = 1
YYY   charge = 161.382202 rawCharge = 161.391434 dfc = 18.986137 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.042369 energym0_per_ts_gain0 = 0.042369 shrEnergyM0TotalAccum[lch] = 0.025652
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 5 gch = 3059 hashedId = 12035 adc = 7 capid = 2
YYY   charge = 142.802750 rawCharge = 142.808441 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.021113 energym0_per_ts_gain0 = 0.021113 shrEnergyM0TotalAccum[lch] = 0.046765
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 6 gch = 3059 hashedId = 12035 adc = 5 capid = 3
YYY   charge = 104.722015 rawCharge = 104.724304 dfc = 19.040363 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.000264 energym0_per_ts_gain0 = 0.000264 shrEnergyM0TotalAccum[lch] = 0.047030
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY id = 1164465195 sample = 7 gch = 3059 hashedId = 12035 adc = 4 capid = 0
YYY   charge = 85.470085 rawCharge = 85.469696 dfc = 18.993351 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.019324 energym0_per_ts_gain0 = -0.019324 shrEnergyM0TotalAccum[lch] = 0.027706
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-01 | detId=1164465195 chi2=-9999.000000
XXX1 1164465195

[3] Output of [0]+[1] for Alpaka-on-GPU.

Begin processing the 1st record. Run 382250, Event 216051612, LumiSection 194 on stream 0 at 07-Jul-2024 23:42:09.899 CEST
YYY id = 1164465195 sample = 0 gch = 3059 hashedId = 12035 adc = 4 capid = 1
YYY   charge = 85.437630 rawCharge = 85.437119 dfc = 18.986141 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.018619 energym0_per_ts_gain0 = -0.018619 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 1 gch = 3059 hashedId = 12035 adc = 4 capid = 2
YYY   charge = 85.681648 rawCharge = 85.680008 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.024758 energym0_per_ts_gain0 = -0.024758 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 2 gch = 3059 hashedId = 12035 adc = 4 capid = 3
YYY   charge = 85.681648 rawCharge = 85.681496 dfc = 19.040367 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.015026 energym0_per_ts_gain0 = -0.015026 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 3 gch = 3059 hashedId = 12035 adc = 8 capid = 0
YYY   charge = 161.443497 rawCharge = 161.452850 dfc = 18.993355 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.041686 energym0_per_ts_gain0 = 0.041686 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 4 gch = 3059 hashedId = 12035 adc = 8 capid = 1
YYY   charge = 161.382202 rawCharge = 161.391434 dfc = 18.986137 pedestal = 108.625000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.042369 energym0_per_ts_gain0 = 0.042369 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 5 gch = 3059 hashedId = 12035 adc = 7 capid = 2
YYY   charge = 142.802750 rawCharge = 142.808441 dfc = 19.040367 pedestal = 116.514000
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.021113 energym0_per_ts_gain0 = 0.021113 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 6 gch = 3059 hashedId = 12035 adc = 5 capid = 3
YYY   charge = 104.722015 rawCharge = 104.724304 dfc = 19.040363 pedestal = 104.394997
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = 0.000264 energym0_per_ts_gain0 = 0.000264 shrEnergyM0TotalAccum[lch] = 0.027706
YYY id = 1164465195 sample = 7 gch = 3059 hashedId = 12035 adc = 4 capid = 0
YYY   charge = 85.470085 rawCharge = 85.469696 dfc = 18.993351 pedestal = 109.536003
YYY   gain = 0.000630 gain0 = 0.000630 respCorrection = 1.275295 energym0_per_ts = -0.019324 energym0_per_ts_gain0 = -0.019324 shrEnergyM0TotalAccum[lch] = 0.027706
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
YYY startSample = 3 endSample = 5 param1 = 54308449 param2 = 134751264
MAHI-05 | detId=0 chi2=0.000000 shrEnergyM0TotalAccum=0.027706 energym0_per_ts_gain0=0.041686 ts4Thresh=0.000000
MAHI-01 | detId=1164465195 chi2=0.000000
MAHI-02 | detId=1164465195 chi2=0.000000
XXX1 1164465195
XXX2 1164465195

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

CMSSW_14_0_X

Fix relevant to 2024 data-taking.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 15, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @missirol for master.

It involves the following packages:

  • RecoLocalCalo/HcalRecProducers (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@abdoulline, @apsallid, @bsunanda, @mariadalfonso, @youyingli this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@jfernan2
Copy link
Contributor

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol missirol force-pushed the devel_hcalAlpaka_fixShrEnergyM0TotalAccum_141X branch from b527d4c to 892a926 Compare July 15, 2024 10:10
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #45452 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@fwyzard
Copy link
Contributor

fwyzard commented Jul 15, 2024

please test

@fwyzard
Copy link
Contributor

fwyzard commented Jul 15, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #45452 was updated. @cmsbuild, @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd9a63/40407/summary.html
COMMIT: 34905bd
CMSSW: CMSSW_14_1_X_2024-07-15-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/45452/40407/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

  • You potentially removed 7 lines from the logs
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3345094
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3345068
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Jul 16, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 92286ac into cms-sw:master Jul 16, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants