Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MkFit] Fix UBSAN error - skip processing of layers with no hits. #37146

Merged

Conversation

osschar
Copy link
Contributor

@osschar osschar commented Mar 4, 2022

When processing layers with no hits the LayerOfHits vector was in default state with storage pointer set to nullptr.

Loaders of hits for vector processing were still being initialized, resulting in doing pointer arithmetic with nullptr -- though they were never actually used.

This should fix #37136. Note that the wf there was 10004.0, single photon 10GeV and there were indeed several occurrences of empty mkfit hit containers.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37146/28688

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2022

A new Pull Request was created by @osschar (Matevž Tadel) for master.

It involves the following packages:

  • RecoTracker/MkFitCore (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @ebrondol, @gpetruc, @mmusich, @mtosi, @dgulhan this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Mar 4, 2022

test parameters:

  • workflow = 10004.0

@makortel
Copy link
Contributor

makortel commented Mar 4, 2022

@cmsbuild, please test

@mmusich
Copy link
Contributor

mmusich commented Mar 7, 2022

@smuzaffar this has been running tests for 3 days (!?).
Is there something stuck?

@smuzaffar
Copy link
Contributor

smuzaffar commented Mar 7, 2022

@mmusich , UBSAN IB comparison does not work. The job timedout after 16 hours ( https://cmssdt.cern.ch/jenkins/job/compare-root-files-short-matrix/48727/ ) . Check the relvals for wf 10004.0 to see if USBAN issue is fixed or not?

I have restarted the comparison job now but I am afraid it will timeout again.

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cdca02/22911/summary.html
COMMIT: 81e7728
CMSSW: CMSSW_12_3_X_2022-03-07-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37146/22911/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-cdca02/10004.0_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 27 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3984447
  • DQMHistoTests: Total failures: 864
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3983561
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@clacaputo
Copy link
Contributor

@mmusich , UBSAN IB comparison does not work. The job timedout after 16 hours ( https://cmssdt.cern.ch/jenkins/job/compare-root-files-short-matrix/48727/ ) . Check the relvals for wf 10004.0 to see if USBAN issue is fixed or not?

I have restarted the comparison job now but I am afraid it will timeout again.

Hi @smuzaffar, where can I check wf 10004.0 relvals for UBSAN?

@smuzaffar
Copy link
Contributor

URL (i.e. https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cdca02/22856) for cms/37146/UBSAN/slc7_amd64_gcc11 context points to UBSAN results. Go there and then click on the summary.html page to see the UBSAN results

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 8, 2022

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cdca02/22856/summary.html
COMMIT: 81e7728
CMSSW: CMSSW_12_3_UBSAN_X_2022-03-04-1100/slc7_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37146/22856/install.sh to create a dev area with all the needed externals and cmssw changes.

@smuzaffar
Copy link
Contributor

ok, I have forced set the status for ubsan comparison check, but should add the comment for ubsan now

@clacaputo
Copy link
Contributor

If I check the output in 10004.0 step3 log, the runtime error is still there

/cvmfs/cms-ib.cern.ch/nweek-02722/slc7_amd64_gcc11/external/gcc/11.2.1-f478fee2760dbd22aaabb4e3a8fe1640/include/c++/11.2.1/bits/stl_vector.h:1046:34: runtime error: reference binding to null pointer of type 'value_type'
    #0 0x2abc3b477a03  (/cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37146/22856/CMSSW_12_3_UBSAN_X_2022-03-04-1100/lib/slc7_amd64_gcc11/libRecoTrackerMkFitCore.so+0x730a03)
    #1 0x2abc3b49126b  (/cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37146/22856/CMSSW_12_3_UBSAN_X_2022-03-04-1100/lib/slc7_amd64_gcc11/libRecoTrackerMkFitCore.so+0x74a26b)
    #2 0x2abc3ad1aefc  (/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_UBSAN_X_2022-03-04-1100/lib/slc7_amd64_gcc11/libRecoTrackerMkFitCMS.so+0x50efc)
    #3 0x2abc3a82fd8e  (/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_UBSAN_X_2022-03-04-1100/lib/slc7_amd64_gcc11/pluginRecoTrackerMkFitPlugins.so+0x65cd8e)

@mmusich
Copy link
Contributor

mmusich commented Mar 8, 2022

If I check the output in 10004.0 step3 log, the runtime error is still there

I observe the same locally:

cmsrel CMSSW_12_3_UBSAN_X_2022-03-04-1100
cd CMSSW_12_3_UBSAN_X_2022-03-04-1100/src/
cmsenv
git cms-merge-topic 37146
scramv1 b -j 20
runTheMatrix.py -l 10004.0 -t 4 -j 8
more 10004.0_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log | grep libRecoTrackerMkFitCore | wc -l 
2

@osschar
Copy link
Contributor Author

osschar commented Mar 8, 2022

Oh rats. How does one setup the build environment for this?
Reading gcc man page ... is USER_CXXFLAGS="-fsanitize=undefined -fsanitize=null" sufficient?

…eeds

Shortcut MkFitProducer to produce an empty track vector when input
seed vector is empty.

In MkBuilder::import_seeds() assert that the input vector is not
empty.
@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37146/28767

  • This PR adds an extra 44KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 9, 2022

Pull request #37146 was updated. @jpata, @cmsbuild, @clacaputo, @slava77 can you please check and sign again.

@slava77
Copy link
Contributor

slava77 commented Mar 10, 2022

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-cdca02/22994/summary.html
COMMIT: b548ebf
CMSSW: CMSSW_12_3_X_2022-03-09-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/37146/22994/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-cdca02/10004.0_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3806002
  • DQMHistoTests: Total failures: 15
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3805964
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 210 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@clacaputo
Copy link
Contributor

+reconstruction

  • technical PR, fix for UBSAN
  • checked locally
cmsrel CMSSW_12_3_UBSAN_X_2022-03-04-1100
cd CMSSW_12_3_UBSAN_X_2022-03-04-1100/src/
cmsenv
git cms-merge-topic 37146
scramv1 b -j 20
runTheMatrix.py -l 10004.0 -t 4 -j 8
more 10004.0_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSim+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log | grep libRecoTrackerMkFitCore | wc -l 
0
  • no reco changes expected or observed after b548ebf

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@smuzaffar smuzaffar modified the milestones: CMSSW_12_3_X, CMSSW_12_4_X Mar 11, 2022
@perrotta
Copy link
Contributor

+1

@slava77
Copy link
Contributor

slava77 commented Mar 15, 2022

@cms-sw/core-l2 @perrotta @qliphy
do you need a backport to 12_3_X?

@makortel
Copy link
Contributor

I suppose it would be useful (to minimize UB in 12_3).

cmsbuild added a commit that referenced this pull request Mar 19, 2022
…-1100_mkFit_X

[MkFit] Fix UBSAN error in (almost) empty events (backport of #37146)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UBSAN] MkFinder is assigning nullptr to a reference
8 participants