-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepTauId throws (in event with zero taus?) #42444
Comments
A new Issue was created by @VinInn Vincenzo Innocente. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
I still do not understand how a memory corruption can modify the values in an otherwise empty |
the tau collection is a view: no way to survive even the first deference |
it sounds also there is a reproducibility problem. Could it be multithreaded reproducibility issue? |
assign reconstruction |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Let me tag @cms-sw/tau-pog-l2 also in this issue |
Just for the record, I'm hitting into this issue while trying to run an HLT job through VTune. Without VTune the configuration works, but within VTune the job crashes in a few different ways, this exception being one of them. |
Running VTune on a single-thread job in ASAN build resulted in
that hints towards ASAN own data structures being overwritten |
Running VTune on a single-thread job of
|
Looks like memory corruption inside Tensorflow is not unheard of
(no idea if any of these are in any way related) |
We updated tensorflow to 2.12 in 13_3_0_pre1 (built on Aug 3). I wonder if there would be any temporal correlation with the RECO profiling crashes? |
(maybe I'm going in a wrong direction, given that the DeepTauId throwing an exception on |
In my HLT menu test case with VTune, if I "disable" the two |
Hi @makortel. By "succeed", do you mean just to finish without crashing, or the outputs are always identical? Especially for other taggers that use TF. In the past, we couldn't conclude with 100% certainty whether memory corruption happens in the DeepTau module or it occures elsewhere and corrupts DeepTau-related memory. Is the machine on which one could reproduce the crash accessible by normal users? If yes, could you please share instructions on how one can reproduce the crash? |
Hi @kandrosov
I mean "finishes without crashing". I'm not checking the outputs in any way. I sent you privately the HLT recipe that I used. By adding an event rejecting EDFilter before/after the DeepTauId modules in the menu I see the behavior that
I started also look into the step3 of the workflow 136.889, that is used in the IgProf profiling in IBs (which is currently crashing in 13_3_X and 14_0_X). Running the step3 as it is indeed crashes also in VTune, in a way that looks like memory corruption. If I run everything up to and including the DeepTauId modules there (*), the job still crashes. But, contrary to the HLT test case, if I run everything up to the modules DeepTauId consumes (**), but not DeepTauId, the job crashes as well. This behavior would indicate that something else than (or maybe in addition to?) DeepTauId would cause memory corruption when the job is run through VTune. I'll continue to investigate (albeit slowly). (*) by adding a snippet process.deepTauSequence = cms.Sequence(process.deepTau2017v2p1ForMini+process.deepTau2018v2p5ForMini
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
process.raw2digi_step,
process.reconstruction_step,
process.tauPath
) to the end of the step3 configuration (**) by adding a snippet process.deepTauSequence = cms.Sequence(
process.offlineSlimmedPrimaryVertices
+ process.packedPFCandidates
+ process.hpsPFTauTransverseImpactParameters
+ process.slimmedTausNoDeepIDs
+ process.slimmedElectrons
+ process.slimmedMuons
)
process.tauPath = cms.Path(process.deepTauSequence, process.patAlgosToolsTask, process.patTask)
process.schedule = cms.Schedule(
process.raw2digi_step,
process.reconstruction_step,
process.tauPath
) to the end of the step3 configuration |
With the step3 of 136.889 I got to the point that running
Recipe to reproduce (at CERN) is more or less
|
To be again more specific for DeepTauId, I took workflow 140.201 that runs re-MINI, and was able to reproduce the behavior I saw in the earlier HLT job, i.e. including DeepTauId crashes (under VTune), but running only up to the modules consumed by the DeepTauId modules technically works. Recipe to reproduce at CERN
|
I repeated my tests in #42444 (comment) and #42444 (comment) with IgProf (*), and see exactly the same behavior. I.e. running up to the modules that (*) replace
with
FYI @gartung . I think we should add crash detection to the IB profiling jobs. Something along after detecting a crash in some step of the workflow, not running the subsequent steps, and communicating the failure to the IB dashboard (e.g. similarly to the HLT validation tests that show a red backround when there are failures). |
With IgProf I managed to run it together with Valgrind, and got (on 136.889 step 3 customized to run
|
Just shooting in the dark, I increased the stack size per thread from 10 MB to 20 MB with |
the heap-allocator "meta-data" are most probably corrupted. is this with JeMalloc? Maybe Tensorflow makes assumptions that JeMalloc does not comply.( alignment?) |
My recent tests were indeed with jemalloc. My HLT test case crashed with glibc malloc as well (#42444 (comment)) when ran through VTune. I could test the offline workflows and IgProf with glibc malloc too. |
The oneDNN library must be compiled with cmake option ONEDNN_ENABLE_MAX_CPU_ISA=1. Not sure how Tensorflow compiles oneDNN. |
|
Adding |
JIT profiling can be disabled in the code by passing 0 to dnnl_set_jit_profiling_flags |
There is an example of running perf on the jitted code with the appropriate environment variables |
Setting |
I was incorrect about the environment variables before. I think I had |
I made one more try with
|
FYI, we now have |
The results I got from running the run-ib-profiling still showed segfaults, although they are not in tensorflow using modules |
@smuzaffar I saw this commit while looking at libeigen on gitlab. Could it be related? |
DeepTauId model is manipulating very sparse tensors, can it be that eigen is considering zero tensors as empty tensors and then contracting them? This may be correlated with the fact that this crash does not happen on other TF models. |
There were some occurrences of |
I vaguely recall we saw some cases before too where the crash was in a destructor, like in
My interpretation was/is that this kind of crash is compatible with memory corruption, which seems to me to be overarching theme of these problems. |
@gartung , I have tested this eigen fix cms-externals/eigen-git-mirror#8 but both igprof and vtune still crash. I tried running #42444 (comment) after setting the env from /cvmfs/cms-ci.cern.ch/week0/cms-externals/eigen-git-mirror/8/37124/CMSSW_14_0_X_2024-01-30-2300 |
I changed run-ib-profiling to run Vtune instead of Igprof and there are segfaults running workflows involving Tensorflow |
cms-sw/cmsdist#9021 includes the Don't crash on empty tensor contraction fix. Note that this fix is already in eigen version used by TF_X Ibs (which are based on tensorlfow 2.15 + newer eigen) |
I started to wonder if a Tensorflow built with |
I ran some tests again with VTune (2024.2) on CMSSW_14_1_0_pre7 on EL8 natively. With 12634.21 (2023 TTBar+PU MC ProdLike) step3 I ran 4 attempts with all succeeding (I didn't check the profiles though). With 136.889 (2018D MET data) step3 I reproduce a failure, although now with exceptions of
or
I tested the environment variables
separately and all together, and all resulted a failure. |
Would it make sense (and work ?) to wrap the calls to TensorFlow with #include <ittnotify.h>
...
__itt_pause(); // Pause profiling
// call to TensorFlow goes here
__itt_resume(); // Resume profiling ? |
Mhm, no, because |
I have a draft pull request for cmsdist that sets enable-tf-mkldnn in the rpm spec file |
With the CMSSW installation provided by cms-sw/cmsdist#9471 (comment) workflow 136.889 step 3 + VTune still fails for me, with or without @gartung I thought disabling the OneDNN JITting helped earlier? |
I thought so too. Maybe it was a different tensor flow build option. |
Using the CMSSW installation provided by cms-sw/cmsdist#9471 (comment) workflow step3 + Vtune works. It does not segfault but does throw a DeepTauID exception. |
A DeepTauID exception was one symptom reported earlier #42444 (comment) . So I'm tempted to conclude cms-sw/cmsdist#9471 (comment) did not work. |
The original issue #40437 has taken a different path so I reopen here just for this.
Reminder of relevant post:
#40437 (comment)
#40437 (comment)
#28358 (closed. WHY?)
a log file
https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693/8001/DataProcessing/020da37f-6871-4688-a86e-2b7e8a6bc683-26-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log
here the "shooting gun": trying to reproduce the event happen to have ZERO taus
#40437 (comment)
The text was updated successfully, but these errors were encountered: