-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg Fault at beginning of event loop in threaded job #15145
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @smuzaffar, @Degano, @davidlt, @Dr15Jones can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A somewhat similar no-traceback crash: I looked at that one a bit on Tuesday, but my main discovery so far is the uninitialized variable that led to #15131. -dan |
The no-traceback crashes happen periodically and I've not made an effort of making issues for them since I could normally find ones with full tracebacks to look at :). We should definitely start cataloging these no-traceback ones. My suspicion is they are from code generated by cling. |
Another one: I did figure out that the reason the first one only has two threads processing events is because that workflow only has two events :) I'm going to start cataloging what the other threads are doing, as that may give some clue where the (assumed) conflict is occurring. In this case, two of the other threads are sitting in edm::SharedResourcesAcquirer::lock(), and the fourth is in edm::Principal::prefetch():
|
A thread waiting in edm::SharedResourcesAcquirer::lock() usually means the thread wants to run a legacy module but another thread is already running one. The thread running edm::Principal::prefetch() is the one running the legacy module since it has edm::EDProducer::doEvent in the call stack. That same thread also has the EventSetup mutex lock since it is running an ESProducer. P.S. I'm somewhere in the middle of the Gulf of Mexico :) |
I'm not seeing any pattern developing... Something I find surprising...the A fatal system signal has occurred: segmentation violation message occurs while processing the initial four records, but between that and the stack trace there are 800 lines of normal output, including "begin processing" messages for 8 events. So the stack trace may not be at all representative of what the other threads were doing at the time of the segfault. Odd that it took so long. Is there a way to have the signal handler suspend everything besides the stacktrace thread so that we get more informative stack traces?
|
Same thing, 6 events processed after segfault: most threads are idle because, while the segfault happened in the first 4 events, the job had started on all 10 events requested and some threads had run out of work.
|
Yet another workflow
|
First stack trace with the pause feature inspired by this issue!
|
and another, pointing strongly to an issue with calling in to TCling
|
Similar to previous-1
|
Looks like TrackingMonitor again. So probably something to do with the lazy evaluation mode of the expression parser. |
Another one, all four threads in TrackingMonitor: I believe we can declare this our current most frequent threading-related crash. Attempts to reproduce with a stripped-down config are so far not showing any success.
|
What about running the job under gdb and putting a break point on TrackingMonitor::analyze then stepping through the module until you can actually cause the '????' back trace? That way at least we will know what call chain the crash happens in (assuming the '????' also appear in normal running). |
I took a look at TrackingMonitor and it appears that the strings being parsed are [https://github.com/cms-sw/cmssw/blob/b2d14d7371fb80ecd6fce9613d2e847b94b95ee7/DQM/TrackingMonitor/python/TrackingMonitor_cfi.py#L8] numCut = cms.string(" pt >= 1 & quality('highPurity') ")
denCut = cms.string(" pt >= 1 ") an interesting thing to note is |
The enum |
Another one from one of the HI workflows that only does 2 events, so if there's interference we know it's from the other thread
|
New variation
|
|
Another one with only two threads, will have to try using these to narrow down what the other thread was just doing...
|
Threads 4 and 2 are both in TListOfFunctionsIter::Next, thread 4 has the mutex, thread 2 is waiting on it. I've been looking at the locking scheme, suspecting that we have another case of a race being amplified by a mutex, so any stack trace with a mutex held is of special interest.
|
Another interesting one
|
|
|
Similar to one last week. I've been trying putting loops in reco::findMethod() and a few other places to stress the locking scheme, but still no success replicating.
|
|
still there...
|
This one is kind of interesting because two of the threads are waiting gInterpreterMutex in TListOfFunctionsIter::Next(), with thread 1 in TClass::GetClass holding the mutex.
|
Two threads waiting on mutex, and one unlocking:
leads eventually to
and it looks like we have a set-before-initalized race on fMethodArgs. Full stack trace:
|
AFAIK this doesn't yet have the fMethodArgs bug fix (it looks like cms-sw/root and root.spec were just updated to add it, after this IB), but this crash is still interesting because it isn't all "???"--there's a stack trace with "__cf_16" crashing, which I'm pretty sure is cling-generated:
|
I talked with Shahzad today about this (I'm at CERN) and he's updating our version of ROOT to include the needed fix. |
This bug hasn't recurred in a month, so I think it is safe to conclude that the fMethodArgs bug was the cause, and it is now fixed. |
Superficially looks like the previous crashes, so maybe still a race condition somewhere that's harder to hit?
|
Another one along the same lines
|
Apparently still not quite fixed, but really rare:
|
Still seeing these at a low rate, this is partly a reminder to myself to continue looking at these, especially checking the other threads for clues.
|
Haven't seen one of these in a while...no obvious clues, but adding it to the collection:
|
closing this issue. Please open a new one if you still such such threading issues. |
This job crashed with a segmentation fault:
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc530/CMSSW_8_1_THREADED_X_2016-07-06-2300/pyRelValMatrixLogs/run/140.2_PhotonJets_Pt_10_13_HI+PhotonJets_Pt_10_13_HIINPUT+DIGIHI+RECOHI+HARVESTHI/step3_PhotonJets_Pt_10_13_HI+PhotonJets_Pt_10_13_HIINPUT+DIGIHI+RECOHI+HARVESTHI.log
What is interesting is only two of the threads had begun processing events before the crash (two of the threads are waiting in TBB probably because they have yet to get their task to run). So although the thread with the crash has no useable traceback, we know we are out of the Source and somewhere processing data for one of the events .
The relevant parts of the log are below:
The text was updated successfully, but these errors were encountered: