-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crash caused by SiPixelDigisClustersFromSoA
(run 357271)
#39045
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, heterogeneous |
FYI @cms-sw/trk-dpg-l2 @VinInn |
New categories assigned: heterogeneous,reconstruction @jpata,@fwyzard,@clacaputo,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I could reproduce it running on the original error stream file on the online machines: cmsrel CMSSW_12_4_6
mkdir CMSSW_12_4_6/run
cd CMSSW_12_4_6/run
cmsenv
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 357271 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = [ '/store/error_stream/run357271/run357271_ls1351_index000134_fu-c2b03-14-01_pid4069417.raw' ]
@EOF
cmsRun hlt.py I did get the same error, and there is a message shortly before it that may be related:
|
I did a bit of investigation and the code is crashing in https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_6/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc#L114 where
obtained by adding the following changes to diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..c0328d665b0 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -43,6 +43,8 @@ private:
const bool produceDigis_;
const bool storeDigis_;
const bool isPhase2_;
+ const std::string moduleType_;
+ const std::string moduleLabel_;
};
SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet& iConfig)
@@ -53,7 +55,9 @@ SiPixelDigisClustersFromSoA::SiPixelDigisClustersFromSoA(const edm::ParameterSet
iConfig.getParameter<int>("clusterThreshold_otherLayers")},
produceDigis_(iConfig.getParameter<bool>("produceDigis")),
storeDigis_(iConfig.getParameter<bool>("produceDigis") & iConfig.getParameter<bool>("storeDigis")),
- isPhase2_(iConfig.getParameter<bool>("isPhase2")) {
+ isPhase2_(iConfig.getParameter<bool>("isPhase2")),
+ moduleType_(iConfig.getParameter<std::string>("@module_type")),
+ moduleLabel_(iConfig.getParameter<std::string>("@module_label")) {
if (produceDigis_)
digiPutToken_ = produces<edm::DetSetVector<PixelDigi>>();
}
@@ -111,10 +115,18 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
auto fillClusters = [&](uint32_t detId) {
if (nclus < 0)
return; // this in reality should never happen
+ if (outputClusters->exists(detId)) {
+ edm::LogWarning("SiPixelDigisClustersFromSoA")
+ << "Problem det present twice in input! " << detId;
+ nclus = -1;
+ return;
+ }
edmNew::DetSetVector<SiPixelCluster>::FastFiller spc(*outputClusters, detId);
+ std::cout << ">> DetId: " << detId << std::endl;
auto layer = (DetId(detId).subdetId() == 1) ? ttopo.pxbLayer(detId) : 0;
auto clusterThreshold = clusterThresholds_.getThresholdForLayerOnCondition(layer == 1);
for (int32_t ic = 0; ic < nclus + 1; ++ic) {
+ std::cout << ">>>> clusId: " << ic << std::endl;
auto const& acluster = aclusters[ic];
// in any case we cannot go out of sync with gpu...
if (acluster.charge < clusterThreshold and !isPhase2_)
@@ -143,6 +155,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
spc.abort();
};
+ std::cout << "===========================================================================" << std::endl;
+ std::cout << moduleType_ << ":" << moduleLabel_ << " START" << std::endl;
+ std::cout << "===========================================================================" << std::endl;
for (uint32_t i = 0; i < nDigis; i++) {
// check for uninitialized digis
if (digis.rawIdArr(i) == 0)
@@ -172,6 +187,7 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
}
}
}
+ std::cout << ">> DetId digi: " << detId << " " << i << std::endl;
PixelDigi dig(digis.pdigi(i));
if (storeDigis_)
(*detDigis).data.emplace_back(dig);
@@ -186,7 +202,9 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
SiPixelCluster::PixelPos pix(row, col);
aclusters[digis.clus(i)].add(pix, digis.adc(i));
}
-
+ std::cout << "===========================================================================" << std::endl;
+ std::cout << moduleType_ << ":" << moduleLabel_ << " END" << std::endl;
+ std::cout << "===========================================================================" << std::endl;
// fill final clusters
if (detId > 0)
fillClusters(detId); The root cause of the problem therefore seems to be somewhere else, presumably in either |
For reference, this type of crash occurred online again in runs 357759 and 357778. (If it helps, we can get the problematic events from those runs, although there is already a recipe to reproduce the crash.) |
Hi, I was doing some investigations on top of those done by @ferencek and I've found that for the detID that are failing something weird happens. By modifying --- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -332,6 +332,23 @@ namespace pixelgpudetails {
//if (threadIdx.x==0) printf("Event: %u blockIdx.x: %u start: %u end: %u\n", eventno, blockIdx.x, begin, end);
int32_t first = threadIdx.x + blockIdx.x * blockDim.x;
+ if(first==0)
+ {
+ for (uint32_t i = 0; i < wordCounter; i++) {
+ uint32_t ww = word[i];
+ uint8_t fedId = fedIds[i / 2]; // +1200;
+
+
+ uint32_t link = sipixelconstants::getLink(ww); // Extract link
+ uint32_t roc = sipixelconstants::getROC(ww); // Extract Roc in link
+ pixelgpudetails::DetIdGPU detId = getRawId(cablingMap, fedId, link, roc);
+
+ uint32_t rawId = detId.rawId;
+
+ printf("wordGPU %d %d %d \n",i, ww, rawId);
+
+ }
+ }
for (int32_t iloop = first, nend = wordCounter; iloop < nend; iloop += blockDim.x * gridDim.x) {
auto gIndex = iloop;
xx[gIndex] = 0;
@@ -559,6 +577,10 @@ namespace pixelgpudetails {
cudaCheck(
cudaMemcpyAsync(word_d.get(), wordFed.word(), wordCounter * sizeof(uint32_t), cudaMemcpyDefault, stream));
+
+ for (unsigned int j = 0; j < wordCounter; j++) {
+ std::cout << "words "<< j << " - " << wordFed.word()[j] << std::endl;
+ }
cudaCheck(cudaMemcpyAsync(
fedId_d.get(), wordFed.fedId(), wordCounter * sizeof(uint8_t) / 2, cudaMemcpyDefault, stream)); what I see (I'm clamping the output here) is that the words on CPU are OK, e.g:
while on GPU (the last number being the raw detID and there's the incriminated one,
where it seems that going above
It isn't normal, is it? I'm trying to understand why this happens (and seems this could be the cause of the crashes, or at least being correlated to it) and posting it here if somebody sees something I don't see. |
Shouldn't that be
I'm surprised the compiler didn't warn, modern compilers are generally pretty good at spotting implicit type conversions in printf. |
You are right. Sorry for the huge noise. For the warning I imagine that's |
Using @AdrianoDee's code and printing the
The Since this is printed sequentially inside the Excuse my ignorance, I haven't found any documentation on where this sorting is done. |
I don't think there is any explicit sorting in the CUDA code path. The body of Could the out-of-order appearance of DetId's be a feature of the raw data itself? The legacy raw-to-digi code seems to handle that case cmssw/EventFilter/SiPixelRawToDigi/src/PixelDataFormatter.cc Lines 205 to 211 in 73bb0f6
The cmssw/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc Lines 166 to 172 in 73bb0f6
(is that warning message visible?) for digis (that use edm::DetSetVector ). For clusters (that use edmNew::DetSetVector , I think, the exception gets thrown herecmssw/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc Line 114 in 73bb0f6
in case the detId already exists in the edmNew::DetSetVector .
So for the legacy digi-cluster chain it seems to me the sorting of digis happens implicitly in filling first the |
For the record, we continue to see this crash online sporadically. Runs where this happened:
(I haven't investigated recent crashes, I hope the original reproducer is sufficient) FYI: @cms-sw/hlt-l2 |
Having tested the configuration posted by @missirol above with a dummy fix in which we sort the digis in --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
index d36c345ecf0..d5d2ae8a0c6 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelDigisClustersFromSoA.cc
@@ -17,6 +17,8 @@
#include "Geometry/Records/interface/TrackerTopologyRcd.h"
#include "Geometry/CommonTopologies/interface/SimplePixelTopology.h"
+#include <numeric>
+
// local include(s)
#include "PixelClusterizerBase.h"
#include "SiPixelClusterThresholds.h"
@@ -143,7 +145,13 @@ void SiPixelDigisClustersFromSoA::produce(edm::StreamID, edm::Event& iEvent, con
spc.abort();
};
- for (uint32_t i = 0; i < nDigis; i++) {
+ std::vector<uint32_t> sortIdxs(nDigis);
+ std::iota(sortIdxs.begin(), sortIdxs.end(), 0);
+ std::sort(
+ sortIdxs.begin(), sortIdxs.end(), [&](int32_t const i, int32_t const j) { return digis.rawIdArr(i) > digis.rawIdArr(j); });
+
+ for (uint32_t id = 0; id < nDigis; id++) {
+ auto i = sortIdxs[id]; everything run smoothly. I don't see any obvious drawback (the sorting itself takes |
@AdrianoDee the one thing that is not clear to me is if there is anything that makes use of the digis being sorted on the gpu ? |
GPU code expects data from a single module to be contiguous. |
OK, then the sorting (at least by module id) should be added to the cpu code ? |
Not sure what to do. The input format is not what expected (duplicated modules?). It's a bit like duplicated pixels. |
here |
In my opinion if one switch |
Actually, no, I do not get an assertion failure:
only the failure in the CPU conversion:
|
ok. let's protect the conversion from SoA and go on: (rechits module most probably did not run yet) |
Adding the |
@ferencek @nothingface0 while sorting the data technically fixes the crash, it is not clear to us if the resulting reconstruction is correct, and in fact if the input data are valid or not. |
assign trk-dpg |
proposed solution
this should protect the "map" (maybe also |
It is then not just roc=17 that is spurious but any roc>8 (seems to be confirmed by https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_0/EventFilter/SiPixelRawToDigi/src/PixelDataFormatter.cc#L49). In August of last year I privately emailed Danek about fed=1326, link=41, roc=9 after Marco Musich reported seeing these warnings when privately re-recoing CRUZET data. The answer from Danek was:
There was also a response from Danek to Marco on Mattermost. So in general spurious ROCs could be identified based on the cabling map info. However, I am somewhat puzzled by detId 344795140 appearing on the same link with detId 344794116. Shouldn't detId be defined by the cabling map and therefore couldn't be messed up by readout errors. |
the cabling map is a simple vector on gpu whose index is
so if roc is >MAX_ROC it will spill in next link... The protection
is not only safe: I would say it is mandatory given the algorithm above. |
IMHO We cannot ignore it. CMS cannot continue to loose data. I'm very surprised that such a serious problem is still not solved after two months (whatever solution would have been including reverting to legacy). |
I am trying to run with these changes: diff --git a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
index 670d5a9131b3..4ac62ada8160 100644
--- a/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
+++ b/RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
@@ -354,8 +354,15 @@ namespace pixelgpudetails {
uint32_t link = sipixelconstants::getLink(ww); // Extract link
uint32_t roc = sipixelconstants::getROC(ww); // Extract Roc in link
pixelgpudetails::DetIdGPU detId = getRawId(cablingMap, fedId, link, roc);
+ uint32_t rawId = detId.rawId;
+
+ // check for spurious channels
+ if (roc > MAX_ROC or link > MAX_LINK) {
+ printf("spurious link %d, ROC %d, found with index %d in detector %d\n", link, roc, gIndex, rawId);
+ continue;
+ }
uint8_t errorType = checkROC(ww, fedId, link, cablingMap, debug);
skipROC = (roc < pixelgpudetails::maxROCIndex) ? false : (errorType != 0);
if (includeErrors and skipROC) { This does avoid the crashes, but it prints a huge amount of messages, about 600 per event:
Does this really make sense ? |
Also, should the check be if (roc > MAX_ROC or link > MAX_LINK)
continue; or if (roc >= MAX_ROC or link > MAX_LINK)
continue; ?
uint32_t index = fedId * MAX_LINK * MAX_ROC + (link - 1) * MAX_ROC + roc; while |
too early. The check should go AFTER the error decoding |
MAX_ROC is 8 and rocs are numbered from 1 to 8 |
maybe the check should go in checkROC (but then one should "invent a new error code.....") |
The original error stream file
is no longer there :-/ |
True.. it can still be found at
|
Thanks Marino, and thanks Vincenzo for the suggestions. After moving the check lower in the code now I get
So the check seems consistent with the legacy unpacker. |
can't be anything done on the pixel f/w side? |
Status as of today: #39711 is merged, will wait for IBs this evening and then merge the 12_4 PR when completed, and cut a 12_4_10_patch1. |
By the way, the release is ready now: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_4_10_patch1 |
Thanks for building the release, @cms-sw/orp-l2 . Attn: @trtomei @silviodonato |
@missirol, can this be considered fixed, and therefore close the issue? |
I think so, but it would be good to have signature/confirmation from experts, e.g. @cms-sw/heterogeneous-l2 @cms-sw/trk-dpg-l2 . (in any case, I'll close it by the end of the week) |
+heterogeneous |
please close |
In run-357271, one HLT job crashed with the following error message:
(the monitoring tool does not provide the full error message from
cmsRun
, afaik)The error is reproducible (see recipe below). Since it originates from the GPU branch of the reconstruction sequence, it can be reproduced only on a machine with a GPU. The input file is currently on
lxplus
.FYI: @fwyzard @silviodonato
The text was updated successfully, but these errors were encountered: