-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify and kill looping tracks #685
Conversation
* Add SimParams with threshold values for detecting and killing loopers * Increase maximum substeps in FieldPropagator to 1000 to match Geant4 * Kill looping tracks in along-step * Update SimpleCmsAlongStep test
* Add from_import method to SimParams * Update four-steel-slabs.root for tests
…W filter (this will make it easier to investigate the properties of abandoned looping tracks)
…imit even if they are not looping). Also: * Rename looping action * Clean up
@amandalund @stognini identified one of the stuck track volumes in cms2018, vol ID 3084, as "ESPM": googling shows that CMS through Geant4 (with and without vecgeom) has this problem: https://indico.cern.ch/event/629803/contributions/2820951/attachments/1573948/2484706/G4-Talk11.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, we may need to brainstorm for a few minutes about how to use the smaller value for "max substeps" while keeping the "kill looping particles" value the same. Maybe the value we save to the TransParameters
should be "maximum substeps before killing" and build that factor of 1000 into the geant exporter? And maybe we change "immediately kill" to have a second threshold value from an implicit "1 step" to an explicit "number of unconverged steps for unimportant tracks"?
@@ -62,7 +62,7 @@ class FieldPropagator | |||
inline CELER_FUNCTION result_type operator()(real_type dist); | |||
|
|||
//! Limit on substeps | |||
static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 128; } | |||
static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 1000; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's turn this down to 128 and adjust the looping kill threshold to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. One thing to keep in mind is that by lowering this threshold (and increasing the number of steps allowed while looping accordingly) we seem to be killing more "stuck" tracks in cms2018 than we were with the higher threshold. These were the tracks with an "abandoned looping" action with max_substeps
set to 100:
************************************************************************
* Row * particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
* 0 * 11 * 0.5121168 * 5065 * 26.678759 * 17 *
* 1 * 11 * 0.4632876 * 5082 * 41.447479 * 70 *
* 2 * 11 * 0.1013242 * 5093 * 23.316498 * 115 *
* 3 * 11 * 2.6904279 * 5063 * 46.786605 * 41 *
* 4 * 11 * 0.4679226 * 2221 * 0.0004375 * 30 *
* 5 * -11 * 0.7619003 * 5069 * 28.127852 * 193 *
* 6 * 11 * 1.0009525 * 5063 * 30.970636 * 25 *
* 7 * 11 * 1.0137611 * 2555 * 0.0003759 * 19 *
* 8 * 11 * 12.893576 * 3084 * 0.0001773 * 20 *
* 9 * 11 * 0.5493182 * 5082 * 25.224891 * 31 *
* 10 * 11 * 5.7339849 * 3084 * 0.0001562 * 33 *
* 11 * 11 * 0.6566941 * 5082 * 47.337937 * 26 *
* 12 * 11 * 56.431254 * 3084 * 0.0001120 * 47 *
* 13 * -11 * 16.367950 * 2411 * 0.0001352 * 36 *
* 14 * 11 * 5.9874424 * 3084 * 0.0002270 * 22 *
* 15 * 11 * 20.15793 * 3084 * 0.0001632 * 39 *
* 16 * -11 * 0.6087991 * 2423 * 0.0001202 * 73 *
* 17 * 11 * 19.771174 * 2449 * 0.0002086 * 17 *
* 18 * 11 * 4.9902261 * 3084 * 0.0001984 * 47 *
* 19 * -11 * 9.1363966 * 2603 * 0.0003791 * 67 *
* 20 * 11 * 6.1166736 * 3084 * 0.0001377 * 39 *
* 21 * 11 * 5.1025576 * 2803 * 0.0001873 * 12 *
* 22 * 11 * 10.336403 * 3084 * 0.0005061 * 19 *
* 23 * 11 * 1.6552678 * 2543 * 0.0002113 * 32 *
* 24 * 11 * 30.658153 * 2499 * 0.0001122 * 13 *
* 25 * 11 * 0.7747261 * 3084 * 0.0001552 * 45 *
* 26 * 11 * 10.615128 * 3084 * 0.0001538 * 59 *
* 27 * 11 * 9.8609919 * 2399 * 0.0021863 * 32 *
* 28 * -11 * 16.588916 * 3084 * 0.0001552 * 21 *
* 29 * -11 * 22.491110 * 3084 * 0.0003995 * 24 *
* 30 * 11 * 1.4691755 * 2373 * 0.0001356 * 17 *
* 31 * -11 * 12.016336 * 2951 * 0.0001250 * 20 *
* 32 * 11 * 1.4753258 * 2540 * 0.0009912 * 34 *
* 33 * 11 * 0.3667352 * 2903 * 0.0014047 * 13 *
* 34 * 11 * 0.5869807 * 2540 * 0.0004919 * 27 *
* 35 * 11 * 0.3228921 * 3084 * 0.0008961 * 66 *
* 36 * 11 * 2.3988821 * 5082 * 44.248788 * 18 *
* 37 * 11 * 6.2385344 * 2523 * 0.0065271 * 534 *
* 38 * 11 * 1.9358633 * 3084 * 0.0002501 * 26 *
* 39 * 11 * 0.0978481 * 3084 * 0.0001941 * 29 *
* 40 * 11 * 1.8782077 * 2779 * 0.0005870 * 44 *
* 41 * 11 * 0.1482836 * 2449 * 0.0030499 * 49 *
* 42 * 11 * 20.478014 * 2535 * 0.0003603 * 39 *
* 43 * 11 * 0.3548805 * 2543 * 0.0002631 * 41 *
* 44 * -11 * 8.5498498 * 2651 * 0.0002175 * 13 *
* 45 * -11 * 8.5716421 * 3084 * 0.0001628 * 20 *
* 46 * 11 * 1.7660504 * 2543 * 0.0001756 * 36 *
* 47 * 11 * 0.1619487 * 5079 * 17.756033 * 37 *
* 48 * 11 * 0.2548802 * 3084 * 0.0097583 * 75 *
* 49 * 11 * 0.5579446 * 3084 * 0.0006049 * 25 *
* 50 * -11 * 3.3970329 * 3084 * 0.0001787 * 18 *
* 51 * 11 * 0.1322118 * 2361 * 0.0002601 * 65 *
* 52 * -11 * 5.8025104 * 3084 * 0.0001559 * 27 *
* 53 * -11 * 10.213309 * 3084 * 0.0002972 * 17 *
* 54 * -11 * 12.832119 * 3084 * 0.0002463 * 29 *
* 55 * 11 * 0.3618347 * 2540 * 0.0034016 * 29 *
* 56 * 11 * 0.5063281 * 3084 * 0.0001194 * 22 *
* 57 * -11 * 11.250343 * 3084 * 0.0003123 * 18 *
* 58 * 11 * 0.6521067 * 3084 * 0.0002174 * 24 *
* 59 * 11 * 0.1810190 * 2543 * 0.0003782 * 71 *
* 60 * -11 * 26.993893 * 3084 * 0.0003740 * 23 *
* 61 * 11 * 0.6104166 * 2540 * 0.0001526 * 47 *
* 62 * -11 * 38.762917 * 3084 * 0.0006995 * 17 *
* 63 * -11 * 14.415399 * 2727 * 0.0001256 * 47 *
* 64 * 11 * 6.8315401 * 3084 * 0.0001870 * 24 *
* 65 * 11 * 2.4796035 * 3084 * 0.0002482 * 21 *
* 66 * -11 * 0.2472740 * 3084 * 0.0005049 * 113 *
* 67 * 11 * 1.5772067 * 3084 * 0.0002863 * 37 *
* 68 * -11 * 7.7838493 * 3084 * 0.0002034 * 31 *
* 69 * 11 * 0.3152484 * 3084 * 0.0005012 * 707 *
************************************************************************
out of 129477153 total tracks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; for this to be more equivalent with the old implementation, instead of "kill immediately" we'd have to write "kill if it's looped for the (max input looping steps / field propagator internal loop count) consecutive time"... which is I guess about 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes sorry, that's what I meant by "increasing the number of steps allowed while looping accordingly" -- these results (and the regression results above with 100 max_substeps
) already had that modification in along-step.
This is a very nice study! As @sethrj scrutinized proposed changes, I would rather add some general comments:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @amandalund ! I have only one small suggestion to consider.
I merged this branch before waiting for hip-ndebug to finish, but after the last changes, one of the Field propagator tests fails on hip-ndebug: SimpleCmsTest.vecgeom_failure[ RUN ] SimpleCmsTest.vecgeom_failure /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1275: Failure Value of: calc_radius() Actual: 125.08080441423482 Expected: 125 Which is: 125 (Relative error 0.00064643531387855542 exceeds tolerance 0.0001) /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1279: Failure Value of: result.distance Actual: 11.676851876556075 Expected: 14.946488966946923 Which is: 14.946488966946923 (Relative error -0.21875619736657978 exceeds tolerance 9.9999999999999998e-13) /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1281: Failure Expected equality of these values: 9984 stepper.count() Which is: 7800 [ FAILED ] SimpleCmsTest.vecgeom_failure (1 ms) TestEm15MscField.device[ RUN ] TestEm15MscField.device status: Reading and building Livermore PE model data status: Reading and building Seltzer Berger model data /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:587: Failure Expected equality of these values: 17 result.num_step_iters() Which is: 21 /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:588: Failure Value of: result.calc_avg_steps_per_primary() Actual: 41.625 Expected: 34 Which is: 34 (Relative error 0.22426470588235295 exceeds tolerance 9.9999999999999998e-13) /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:589: Failure Expected equality of these values: 5 result.calc_emptying_step() Which is: 9 /var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:590: Failure Expected equality of these values: RunResult::StepCount({1, 10}) Which is: (1, 10) result.calc_queue_hwm() Which is: (7, 11) [ FAILED ] TestEm15MscField.device (30 ms) Unfortunately, it looks like these do pass on the Crusher ndebug configuration (ROCM 5.1) even though they fail on the CI (ROCM 5.4). It still passes on crusher with ROCM 5.4 (Clang 15.0.0). The only difference I immediately see is that Crusher's GPU is |
In many of our regression problems we're seeing unconverged tracks: a few tracks are still alive after a very large number of steps. These appear to be looping particles (low energy, in low density materials, taking many substeps in the
FieldPropagator
). I added an along-step test that reproduces one of these tracks fromsimple-cms+field+msc-vecgeom-gpu
that was still alive after 100k steps in the transport loop, takingmax_substeps
= 128 iterations in theFieldPropagator
at each step (when I tried removing themax_substeps
limit removed it took ~150k substeps in the propagator before reaching a boundary).Geant4 will kill certain tracks if they are found to be looping:
This implements the same strategy for killing looping tracks as Geant4. Changes include:
max_substeps
in theFieldPropagator
from 128 to the Geant4 default of 1000SimParams
data/class and a sim track state to store the number of steps the particle has taken since it was flagged as loopingI reran the regression suite with this branch (with timing results from runs with unconverged tracks discarded, and problems with unconverged tracks in red). All the regression problems now converge. Each plot shows the mean total time over four runs:
regression GPU performance
regression CPU performance
We can use the ROOT MC truth output filtered on the implicit "abandon looping" action to investigate which tracks are being killed:
testem3-flat+field+msc-vecgeom-cpu abandoned looping tracks
simple-cms+field+msc-vecgeom-cpu abandoned looping tracks
cms2018+field+msc-vecgeom-cpu abandoned looping tracks
Questions/TODO:
G4PropagatorInField::GetMaxLoopCount()
)FieldPropagator
. Reducing that value from 1000 to 100 (and increasing the number of steps a track is allowed take while looping to compensate) gives a small speedup:regression GPU performance
regression CPU performance
so we may be able to play with those parameters to improve load balancing/performance.