Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROOT] Unit Tests testDataFormatsScoutingRunX failing in ROOT IBs #41222

Closed
aandvalenzuela opened this issue Mar 29, 2023 · 32 comments
Closed

[ROOT] Unit Tests testDataFormatsScoutingRunX failing in ROOT IBs #41222

aandvalenzuela opened this issue Mar 29, 2023 · 32 comments

Comments

@aandvalenzuela
Copy link
Contributor

Hello,

There are two Unit Tests (testDataFormatsScoutingRun2 and testDataFormatsScoutingRun3 from module DataFormats/Scouting) failing in both ROOT6 and ROOT628 IBs.

I have reproduced the failure using CMSSW_13_1_ROOT628_X_2023-03-27-2300:

28-Mar-2023 10:02:46 CEST  Closed file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/mc/RunIISummer20UL18RECO/DoubleElectron_Pt-1To300-gun/AODSIM/FlatPU0to70EdalIdealGT_EdalIdealGT_106X_upgrade2018_realistic_v11_L1v1_EcalIdealIC-v2/270000/4CDD9457-E14C-D84A-9BD4-3140CB6AEEB6.root
/tmp/avalenzu/CMSSW_13_1_ROOT628_X_2023-03-27-2300/src/DataFormats/Scouting/test/testDataFormatsScoutingRun2.sh: line 25: 19155 Segmentation fault      "${SCRAM_TEST_PATH}"/scoutingCollectionsDumper.py -v 1 -n 1 --skip 81 -i testDataFormatsScoutingRun2_step1.root -k Scouting &> testDataFormatsScoutingRun2_step2.log

malloc_consolidate(): invalid chunk size
 *** Break *** abort
malloc(): unsorted double linked list corrupted
 *** Break *** abort

They fail when running scoutingCollectionsDumper.py (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/testDataFormatsScoutingRun2.sh#L23) and I would say after analyzing the first event (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py#L174).

Find the full log here.

Let's follow-up the discussion started on #41189 (comment).

FYI, @Dr15Jones, @smuzaffar, @iarspider

@cmsbuild
Copy link
Contributor

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@aandvalenzuela
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smuzaffar
Copy link
Contributor

gdb shows this

Events processed = 1
[Thread 0x7fffd26c1700 (LWP 30936) exited]
malloc_consolidate(): invalid chunk size

Thread 1 "python3" received signal SIGABRT, Aborted.
0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
(gdb) where
#0  0x00007ffff6ad2aff in raise () from /lib64/libc.so.6
#1  0x00007ffff6aa5ea5 in abort () from /lib64/libc.so.6
#2  0x00007ffff6b15097 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6b1c4ec in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff6b1ceb8 in malloc_consolidate () from /lib64/libc.so.6
#5  0x00007ffff6b1f268 in _int_malloc () from /lib64/libc.so.6
#6  0x00007ffff6b20b72 in malloc () from /lib64/libc.so.6
#7  0x00007ffff5ec19ac in operator new (sz=5560) at ../../../../libstdc++-v3/libsupc++/new_op.cc:50
#8  0x00007fffef7b72e4 in cling::IncrementalParser::getAllTransactions() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#9  0x00007fffef734d82 in cling::Interpreter::runAndRemoveStaticDestructors() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_ROOT628_X_2023-03-28-2300/external/el8_amd64_gcc11/lib/libCling.so
#10 0x00007fffef507b8c in TCling::ResetGlobals (this=0x57e350)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:3718
#11 0x00007fffef4ff9f7 in TCling::ShutDown (this=0x57e350)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/metacling/src/TCling.cxx:1692
#12 0x00007ffff62de1a2 in TROOT::EndOfProcessCleanups (this=0x7ffff6655ba0 <ROOT::Internal::GetROOT1()::alloc>)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/core/base/src/TROOT.cxx:1236
#13 0x00007fffc6835010 in ?? ()
#14 0x00007fffffff19e0 in ?? ()
#15 0x00007ffff66bd0fd in WrapperCall (method=0, nargs=140737188853984, args_=0x7ffff7fb1040, self=0x7fffffff1880, result=0x5f4fc0)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/root-6.28.01/bindings/pyroot/cppyy/cppyy-backend/clingwrapper/src/clingwrapper.cxx:774

@smuzaffar
Copy link
Contributor

@vgvassilev , these tests are failing for root 6.28 and root master based builds but work for root 6.26. tests fail at the end when python (https://github.com/cms-sw/cmssw/blob/master/DataFormats/Scouting/test/scoutingCollectionsDumper.py) process is finished. Any idea why this is failing for root 628/master?

@vgvassilev
Copy link
Contributor

This area has been quite problematic due to the way that ROOT is being shut down. @pcanal, do you know how we can work this issue with EndOfProcessCleanups around?

@pcanal
Copy link
Contributor

pcanal commented Mar 30, 2023

The destruction order is delicate and I remember we had some issue with the tear down triggering from python. Any chance to have a standalone reproducer (so that I can start by bisecting where/when it started failing)?

@pcanal
Copy link
Contributor

pcanal commented Mar 30, 2023

Also could you run the failing example with valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ... to get more information about this (seemingly odd because it seems the allocation is failing) problem.

@smuzaffar
Copy link
Contributor

@pcanal , you can find the valgrid output here

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 31, 2023 via email

@vgvassilev
Copy link
Contributor

Also could you run the failing example with valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp ... to get more information about this (seemingly odd because it seems the allocation is failing) problem.

@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp. This would avoid the false positives coming from the python interpreter.

@smuzaffar
Copy link
Contributor

valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp log is available here

@pcanal
Copy link
Contributor

pcanal commented Mar 31, 2023

@pcanal probably meant valgrind --suppressions=$ROOTSYS/etc/valgrind-root-python.supp. This would avoid the false positives coming from the python interpreter.

Actually you need both.

@pcanal
Copy link
Contributor

pcanal commented Mar 31, 2023

Also this seem eerily similar to cms-sw/root#136

@Axel-Naumann
Copy link

Just providing the full context for @davidlange6 's (imo correct) issue diagnosis:

==19874== Invalid free() / delete / delete[] / realloc()
==19874==    at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCA85E: ???
==19874==    by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874==    by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874==    by 0x551E600F: ???
==19874==    by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874==    by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874==    by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874==    by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874==    by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874==  Address 0x4c4ca1c0 is 0 bytes inside a block of size 29 free'd
==19874==    at 0x403C4CD: operator delete(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCA85E: ???
==19874==    by 0x9EDBF06: cling::IncrementalExecutor::runAndRemoveStaticDestructors(cling::Transaction*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E65DB8: cling::Interpreter::runAndRemoveStaticDestructors() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9C38B8B: TCling::ResetGlobals() (TCling.cxx:3718)
==19874==    by 0x8026FE9: TROOT::EndOfProcessCleanups() (TROOT.cxx:1215)
==19874==    by 0x551E600F: ???
==19874==    by 0x7D400FC: WrapperCall(long, unsigned long, void*, void*, void*) (clingwrapper.cxx:774)
==19874==    by 0x7D403EA: Cppyy::CallV(long, void*, unsigned long, void*) (clingwrapper.cxx:825)
==19874==    by 0x7C82290: GILCallV(long, void*, CPyCppyy::CallContext*) (Executors.cxx:68)
==19874==    by 0x7C846EA: CPyCppyy::(anonymous namespace)::VoidExecutor::Execute(long, void*, CPyCppyy::CallContext*) (Executors.cxx:410)
==19874==    by 0x7C6730B: CPyCppyy::CPPMethod::ExecuteFast(void*, long, CPyCppyy::CallContext*) (CPPMethod.cxx:74)
==19874==  Block was alloc'd at
==19874==    at 0x4039F01: operator new(unsigned long) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==19874==    by 0x4ECCAAA1: ???
==19874==    by 0x4ECCA9C1: ???
==19874==    by 0x4ECCA92F: ???
==19874==    by 0x4ECCA2B2: ???
==19874==    by 0x4ECCA90B: ???
==19874==    by 0x4ECCA91C: ???
==19874==    by 0x4ECCA90F: ???
==19874==    by 0xB445061: (anonymous namespace)::GenericLLVMIRPlatformSupport::initialize(llvm::orc::JITDylib&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9EDAEE8: cling::IncrementalExecutor::runStaticInitializersOnce(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9E664A7: cling::Interpreter::executeTransaction(cling::Transaction&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)
==19874==    by 0x9EE97F9: cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/lcg/root/6.28.01-24fc933274de02f0e1bb17b71bafaf30/lib/libCling.so)

@Dr15Jones
Copy link
Contributor

@pcanal any progress on this?

@Axel-Naumann
Copy link

Axel-Naumann commented May 17, 2023

It's a traditional double delete, here as a global static in the interpreter memory. We need to find out which object it is, and why it's registered twice as static global. @pcanal do you think you can try to reproduce this? If so please take it. If not please assign it to me and I will debug next week (I'm off now).

@pcanal
Copy link
Contributor

pcanal commented May 22, 2023

@Dr15Jones Can you remind me the reproducer?

@Dr15Jones
Copy link
Contributor

@makortel could you help Philippe to reproduce the problem?

@davidlange6
Copy link
Contributor

something like

scram p CMSSW_13_2_ROOT628_X_2023-05-22-2300
cd CMSSW_13_2_ROOT628_X_2023-05-22-2300/src
cmsenv
git cms-addpkg DataFormats/Scouting
scram b
scram b runtests

@makortel
Copy link
Contributor

Just for the record, I shared a simpler recipe to Philippe.

@makortel
Copy link
Contributor

#41964 removes the failing tests (after better tests were developed).

@pcanal Is there a need to preserve a setup to reproduce the crash?

@pcanal
Copy link
Contributor

pcanal commented Jun 15, 2023

Yes, this is not resolved.

@makortel
Copy link
Contributor

@smuzaffar Would it be feasible to have a longer-than-IB-lifetime build for ROOT628? Or do we keep the tests failing for now?

@smuzaffar
Copy link
Contributor

I would say, let the test fail for now.

@Dr15Jones
Copy link
Contributor

So I did a small amount of testing and I see the failure using the following minimum python

import ROOT

f = ROOT.TFile.Open("testDataFormatsScoutingRun2_step1.root")
e = ROOT.fwlite.Event(f)

the job fails as it is attempting to shutdown.

@hahnjo
Copy link
Contributor

hahnjo commented Sep 4, 2023

I believe this may also be fixed via #42628, can you confirm?

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

It seems so. The last failures were in CMSSW_13_3_ROOT6_X_2023-08-28-2300, and #42628 was merged in CMSSW_13_3_X_2023-08-29-1100. So far there has been 6 IBs (x2 to cover both 6.28 and master) without these failures. I also tested the simple reproducer of #41222 (comment) in CMSSW_13_3_ROOT628_X_2023-09-03-2300, and the job succeeded now.

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

+core

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Sep 5, 2023
@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

I'm going to now proceed with the deletion of these unit tests (as they were superseded earlier, and were kept only to the investigation of this issue). If anyone subscribed thinks there would still be value in keeping the tests, scream soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants