Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RelVal 132.0: (PDF) Info file not found #43526

Open
iarspider opened this issue Dec 8, 2023 · 19 comments
Open

RelVal 132.0: (PDF) Info file not found #43526

iarspider opened this issue Dec 8, 2023 · 19 comments

Comments

@iarspider
Copy link
Contributor

iarspider commented Dec 8, 2023

Since CMSSW_14_0_X_2023-12-04-1100 RelVal 132.0 fails for all flavors/architectures:

----- Begin Fatal Exception 04-Dec-2023 16:45:52 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 2
   [1] Running path 'simulation_step'
   [2] Calling method for module Pythia6GeneratorFilter/'generator'
Exception Message:
A std::exception was thrown.
Info file not found for PDF set ''
----- End Fatal Exception -------------------------------------------------

Indeed, the code tries to open files like /_0000.dat, /.info or even "" (empty string). This is the origin of one such call (extracted with strace -f -k`: https://gist.github.com/iarspider/fdd17a07202e6ff4bb884f3751e37cd5

This seems to be a threading issue: running this relVal with single thread doesn't generate an exception, while running with 2+ threads does.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 8, 2023

A new Issue was created by @iarspider .

@smuzaffar, @makortel, @rappoccio, @antoniovilela, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@iarspider
Copy link
Contributor Author

assign GeneratorInterface/Pythia6Interface

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 8, 2023

New categories assigned: generators

@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

The problem is LHAPDF uses thread locals to store its 'cached' values. In particular there is a container ACTIVESETS defined in LHAGlue.cc

 /// Collection of active sets
  static thread_local map<int, PDFSetHandler> ACTIVESETS;

In our code, this gets initialized at begin LuminosityBlock by the call to pdfset_ via the call stack

#0  0x00007fffc09a6b20 in pdfset_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#1  0x00007fffc288b4a6 in pyinit (frame=..., beam=..., target=..., win=8000, _frame=3, _beam=1, _target=1) at pyinit.f:115
#2  0x00007fffc2c63aaf in gen::Pythia6Hadronizer::initializeForInternalPartons() ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#3  0x00007fffc2c64a60 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::beginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so

In the job I was running, this happened on thread 5.

Then later in the job, a call to PDFSetHandler::member(int) is made on Thread 1 which leads to the exception being thrown with the call stack

#0  0x00007ffff5bc92f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff0de36a0, tinfo=0x7fffc09ca818 <typeinfo for LHAPDF::ReadError>, dest=0x7fffc0927a90 <LHAPDF::ReadError::~ReadError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffc0909425 in LHAPDF::PDFSet::PDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [clone .cold] ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#2  0x00007fffc0940cef in LHAPDF::getPDFSet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#3  0x00007fffc094214d in LHAPDF::mkPDF(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#4  0x00007fffc09a0aed in (anonymous namespace)::PDFSetHandler::loadMember(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#5  0x00007fffc09a0c56 in (anonymous namespace)::PDFSetHandler::member(int) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#6  0x00007fffc09a3abb in structm_ () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/external/el8_amd64_gcc12/lib/libLHAPDF.so
#7  0x00007fffc28bd1cf in pypdfu (kf=2212, x=<optimized out>, q2=<optimized out>, xpq=...) at pypdfu.f:356
#8  0x00007fffc290553c in pysigh (nchn=0, sigs=0) at pysigh.f:420
#9  0x00007fffc28d4e7a in pyrand () at pyrand.f:1005
#10 0x00007fffc287ea1b in pyevnw () at pyevnw.f:111
#11 0x00007fffc2c6310a in gen::Pythia6Hadronizer::generatePartonsAndHadronize() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so
#12 0x00007fffc2c66a01 in edm::GeneratorFilter<gen::Pythia6Hadronizer, gen::ExternalDecayDriver>::filter(edm::Event&, edm::EventSetup const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-04-1100/lib/el8_amd64_gcc12/pluginGeneratorInterfacePythia6Filters.so

The problem is, the call to structm_ assumes that the ACTIVESETS was already initialized on this thread but we only call pyinit once on the thread that happens to call begin LuminosityBlock on the generator module.

@makortel
Copy link
Contributor

makortel commented Dec 8, 2023

Tagging a related LHAPDF update cms-sw/cmsdist#8852

@Dr15Jones
Copy link
Contributor

This was most likely caused by cms-sw/cmsdist#8852

@Dr15Jones
Copy link
Contributor

So it looks like the FORTRAN adapter layer for LHAPDF assumes all FORTRAN calls will occur on one and only one thread (or at least all FORTRAN calls will happen independently on each thread).

@Dr15Jones
Copy link
Contributor

I can confirm that the change from LHAPDF 6.4.0 to 6.5.4 is the reason. Between those two version there was the change from

 static map<int, PDFSetHandler> ACTIVESETS;

to

static thread_local map<int, PDFSetHandler> ACTIVESETS;

@Dr15Jones
Copy link
Contributor

Probably the only safe change would be to modify GeneratorFilter so it uses its own thread for all calls to FORTRAN. This is similar to how we handed the problems with Geant4's thread local use.

@makortel
Copy link
Contributor

makortel commented Dec 8, 2023

Would it be feasible to give feedback that this change didn't work for us?

@mkirsano
Copy link
Contributor

I can undertake to communicate the problem to the authors

@smuzaffar
Copy link
Contributor

@mkirsano , should we revert the lhapdf change ?

@mkirsano
Copy link
Contributor

I am trying to check if we can revert only lhapdf library.

@mkirsano
Copy link
Contributor

I checked that it is possible to use new pdf sets with lhapdf 6.4.0 (checked on pythia8, Generator step, 1 thread).
So I suggest to try to move back to lhapdf 6.4.0 and have only sets upgraded for the moment.

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#8869 reverts LHAPDF to 6.4.0 while keeping the pdf sets to 6.5.1

@mkirsano
Copy link
Contributor

The problem is reported to lhapdf-support. Will write also to the main developer.

@mkirsano
Copy link
Contributor

This is the answer from lhapdf developers:

The initialization needs to be called explicitly on each thread: they cannot inherit their memory from the calling process, as shared memory is open to run conditions and cross-thread corruptions. So where you say that you only call pyinit once, you should change the calling code to do so thread-locally!
Prof Andy Buckley
MSc Convener, School of Physics & Astronomy
Particle Physics Experiment Group, University of Glasgow, UK

@makortel
Copy link
Contributor

Thanks @mkirsano, sounds like the way forward is then what @Dr15Jones suggested in #43526 (comment).

(and for what it's worth, from our standpoint this is a terrible way to handle multithreading)

@makortel
Copy link
Contributor

Workflow 561.0 crashed in CMSSW_14_0_X_2023-12-13-2300 on el9_amd64_gcc12

Thread 1 (Thread 0x148801d8c640 (LWP 1351141) "cmsRun"):
#3  0x00001487fe3e17b0 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00001487c690d31c in LHAPDF::LogBicubicInterpolator::_interpolateXQ2(LHAPDF::KnotArray const&, double, unsigned long, double, unsigned long, int) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#6  0x00001487c68f92ca in LHAPDF::PDF::xfxQ2(int, double, double) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libLHAPDF.so
#7  0x00001487aae34879 in Pythia8::LHAPDF6::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8lhapdf6.so
#8  0x00001487c6f213b9 in Pythia8::LHAPDF::xfUpdate(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#9  0x00001487c6f24bae in Pythia8::PDF::xf(int, double, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#10 0x00001487c6ee9ce4 in Pythia8::MultipartonInteractions::upperEnvelope() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#11 0x00001487c6ef9ace in Pythia8::MultipartonInteractions::init(bool, int, Pythia8::BeamParticle*, Pythia8::BeamParticle*, std::shared_ptr<Pythia8::PartonVertex>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#12 0x00001487c6f3b961 in Pythia8::PartonLevel::init(std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::TimeShower>, std::shared_ptr<Pythia8::SpaceShower>, Pythia8::RHadrons*, std::shared_ptr<Pythia8::MergingHooks>, std::shared_ptr<Pythia8::PartonVertex>, std::shared_ptr<Pythia8::StringInteractions>, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#13 0x00001487c6fc6ef4 in Pythia8::Pythia::init() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el9_amd64_gcc12/lib/libpythia8.so
#14 0x00001487c87a56f1 in Pythia8Hadronizer::initializeForInternalPartons() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#15 0x00001487c87c3917 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::initLumi(edm::gen::GenStreamCache<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>*, edm::LuminosityBlock const&, edm::EventSetup const&) const [clone .constprop.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#16 0x00001487c87be701 in edm::ConcurrentGeneratorFilter<Pythia8Hadronizer, gen::ConcurrentExternalDecayDriver>::globalBeginLuminosityBlockProduce(edm::LuminosityBlock&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/pluginGeneratorInterfacePythia8Filters.so
#17 0x000014880444e2cd in edm::global::EDFilterBase::doBeginLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02815/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el9_amd64_gcc12/libFWCoreFramework.so

Current Modules:
Module: Pythia8ConcurrentGeneratorFilter:generator (crashed)
Module: none
Module: none
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el9_amd64_gcc12/CMSSW_14_0_X_2023-12-13-2300/pyRelValMatrixLogs/run/561.0_WprimeTolNu_M3000_13TeV_pythia8/step1_WprimeTolNu_M3000_13TeV_pythia8.log#/

I'm not sure if it is related to this issue, or (way) earlier crashes #35082 #35251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants