Identify and kill looping tracks #685

amandalund · 2023-03-15T21:23:55Z

In many of our regression problems we're seeing unconverged tracks: a few tracks are still alive after a very large number of steps. These appear to be looping particles (low energy, in low density materials, taking many substeps in the FieldPropagator). I added an along-step test that reproduces one of these tracks from simple-cms+field+msc-vecgeom-gpu that was still alive after 100k steps in the transport loop, taking max_substeps = 128 iterations in the FieldPropagator at each step (when I tried removing the max_substeps limit removed it took ~150k substeps in the propagator before reaching a boundary).

Geant4 will kill certain tracks if they are found to be looping:

The propagation in field will flag a track as looping if it takes more than some max number of substeps (default 1000)
If the particle is stable and its energy is below some threshold (the "important" energy), transportation will kill it immediately if it was flagged as looping
If the energy is above the threshold, the track is allowed to continue stepping. If it is still looping after some threshold number of steps, it will be killed

This implements the same strategy for killing looping tracks as Geant4. Changes include:

Importing the looping threshold parameters from Geant4
Increasing max_substeps in the FieldPropagator from 128 to the Geant4 default of 1000
Adding SimParams data/class and a sim track state to store the number of steps the particle has taken since it was flagged as looping
Killing looping tracks in along-step if the conditions described above are met

I reran the regression suite with this branch (with timing results from runs with unconverged tracks discarded, and problems with unconverged tracks in red). All the regression problems now converge. Each plot shows the mean total time over four runs:

regression GPU performance

regression CPU performance

We can use the ROOT MC truth output filtered on the implicit "abandon looping" action to investigate which tracks are being killed:

testem3-flat+field+msc-vecgeom-cpu abandoned looping tracks

************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 17.410741 *        37 * 0.0021631 *        11 *
*        1 *       -11 * 20.725924 *        19 * 0.0027216 *        10 *
*        2 *        11 * 0.1155200 *       100 * 191.64264 *         7 *
*        3 *        11 * 2.0564479 *       100 * 410.76694 *         7 *
************************************************************************

out of 130261723 total tracks

simple-cms+field+msc-vecgeom-cpu abandoned looping tracks

************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 0.4284925 *         0 * 328.92998 *         7 *
*        1 *        11 * 0.5415748 *         6 * 336.01412 *        19 *
*        2 *        11 * 0.0328769 *         0 * 141.95439 *        17 *
*        3 *        11 * 8.7250492 *         6 * 1082.8599 *        24 *
*        4 *        11 * 0.3138696 *         6 * 281.41158 *        12 *
*        5 *        11 * 1.1964144 *         6 * 475.82615 *         5 *
*        6 *        11 * 0.2774569 *         6 * 219.11813 *         6 *
*        7 *        11 * 2.4181753 *         6 * 872.61418 *         4 *
*        8 *       -11 * 0.7633382 *         6 * 332.14011 *         8 *
*        9 *       -11 * 1.5476162 *         6 * 366.68110 *         6 *
*       10 *        11 * 6.6343545 *         6 * 1012.3510 *         7 *
*       11 *       -11 * 6.9846367 *         6 * 1171.6445 *        21 *
*       12 *       -11 * 3.8364578 *         6 * 641.42077 *        13 *
*       13 *        11 * 1.2989297 *         6 * 382.06274 *         5 *
*       14 *        11 * 9.3976182 *         6 * 1296.6854 *         2 *
*       15 *        11 * 1.5514753 *         6 * 402.37965 *         8 *
*       16 *        11 * 8.8955081 *         6 * 1302.1596 *         2 *
*       17 *       -11 * 2.8309174 *         6 * 1026.2208 *         2 *
*       18 *       -11 * 9.4258080 *         6 * 1170.6220 *         4 *
*       19 *       -11 * 2.8718643 *         6 * 660.33176 *         5 *
*       20 *        11 * 3.4736794 *         6 * 1081.4739 *         3 *
*       21 *       -11 * 0.9931597 *         6 * 364.33565 *        20 *
*       22 *        11 * 8.8151281 *         6 * 1003.4138 *        24 *
*       23 *       -11 * 9.3939252 *         6 * 1358.6969 *        15 *
*       24 *       -11 * 1.3479123 *         6 * 492.88287 *         4 *
*       25 *        11 * 5.6744346 *         6 * 1229.8893 *         9 *
*       26 *        11 * 3.0527115 *         6 * 625.72645 *         3 *
*       27 *        11 * 2.4160603 *         6 * 490.33345 *         4 *
*       28 *        11 * 0.2229918 *         0 * 224.05321 *        21 *
************************************************************************

out of 156691602 total tracks

cms2018+field+msc-vecgeom-cpu abandoned looping tracks

************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 0.5121168 *      5065 * 261.40531 *         8 *
*        1 *        11 * 0.1947578 *      5082 * 380.81933 *        55 *
*        2 *        11 * 0.2006966 *      5082 * 369.53779 *        77 *
*        3 *        11 * 0.0965214 *      5082 * 213.93839 *        23 *
*        4 *       -11 * 1.4034612 *      5082 * 375.11510 *        95 *
*        5 *       -11 * 2.6108996 *      5087 * 473.23694 *       146 *
*        6 *        11 * 1.1036962 *      5082 * 321.68408 *        36 *
*        7 *        11 * 9.0519847 *      3084 * 0.0020934 *         6 *
*        8 *       -11 * 14.733626 *      2363 * 0.0026204 *        35 *
*        9 *        11 * 27.626263 *      3084 * 0.0049625 *        11 *
*       10 *        11 *  5.879322 *      3084 * 0.0031583 *        42 *
*       11 *       -11 * 15.457228 *      3084 * 0.0058699 *         9 *
*       12 *        11 * 2.5345548 *      5082 * 503.93044 *        13 *
*       13 *       -11 * 2.6174246 *      5082 * 479.84149 *        76 *
*       14 *       -11 * 5.4691775 *      3084 * 0.0049343 *        17 *
*       15 *       -11 * 10.036764 *      3084 * 0.0053894 *        10 *
*       16 *       -11 * 8.9171694 *      3084 * 0.0028921 *        40 *
*       17 *       -11 * 9.7544970 *      3084 * 0.0030024 *        22 *
*       18 *        11 * 5.1466872 *      3084 * 0.0012914 *        12 *
*       19 *       -11 * 18.409088 *      3084 * 0.0077950 *        36 *
*       20 *       -11 *  52.40507 *      3084 * 0.0032222 *        46 *
************************************************************************

out of 129476491 total tracks

Questions/TODO:

Currently when a looping track is killed it is simply abandoned, but I'm not sure this is the correct approach -- should we be depositing the energy?
Also import the maximum substeps in the field propagator from Geant4 (G4PropagatorInField::GetMaxLoopCount())
The slightly worse performance (mainly in the problems where there are unconverged tracks) is likely due to increasing the max substeps in the FieldPropagator. Reducing that value from 1000 to 100 (and increasing the number of steps a track is allowed take while looping to compensate) gives a small speedup:

regression GPU performance

regression CPU performance

so we may be able to play with those parameters to improve load balancing/performance.

There are strategies for reducing the incidence of looping tracks (described in the Geant4 manual) that we may want to try

* Add SimParams with threshold values for detecting and killing loopers * Increase maximum substeps in FieldPropagator to 1000 to match Geant4 * Kill looping tracks in along-step * Update SimpleCmsAlongStep test

* Add from_import method to SimParams * Update four-steel-slabs.root for tests

…W filter (this will make it easier to investigate the properties of abandoned looping tracks)

…imit even if they are not looping). Also: * Rename looping action * Clean up

sethrj · 2023-03-15T22:27:35Z

@amandalund @stognini identified one of the stuck track volumes in cms2018, vol ID 3084, as "ESPM": googling shows that CMS through Geant4 (with and without vecgeom) has this problem:

https://indico.cern.ch/event/629803/contributions/2820951/attachments/1573948/2484706/G4-Talk11.pdf

sethrj

Looks good overall, we may need to brainstorm for a few minutes about how to use the smaller value for "max substeps" while keeping the "kill looping particles" value the same. Maybe the value we save to the TransParameters should be "maximum substeps before killing" and build that factor of 1000 into the geant exporter? And maybe we change "immediately kill" to have a second threshold value from an implicit "1 step" to an explicit "number of unconverged steps for unimportant tracks"?

src/celeritas/em/msc/UrbanMsc.hh

sethrj · 2023-03-16T12:25:51Z

src/celeritas/field/FieldPropagator.hh

@@ -62,7 +62,7 @@ class FieldPropagator
    inline CELER_FUNCTION result_type operator()(real_type dist);

    //! Limit on substeps
-    static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 128; }
+    static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 1000; }


As discussed, let's turn this down to 128 and adjust the looping kill threshold to match.

Will do. One thing to keep in mind is that by lowering this threshold (and increasing the number of steps allowed while looping accordingly) we seem to be killing more "stuck" tracks in cms2018 than we were with the higher threshold. These were the tracks with an "abandoned looping" action with max_substeps set to 100:

************************************************************************ * Row * particle * pre_energ * pre_volum * step_leng * track_ste * ************************************************************************ * 0 * 11 * 0.5121168 * 5065 * 26.678759 * 17 * * 1 * 11 * 0.4632876 * 5082 * 41.447479 * 70 * * 2 * 11 * 0.1013242 * 5093 * 23.316498 * 115 * * 3 * 11 * 2.6904279 * 5063 * 46.786605 * 41 * * 4 * 11 * 0.4679226 * 2221 * 0.0004375 * 30 * * 5 * -11 * 0.7619003 * 5069 * 28.127852 * 193 * * 6 * 11 * 1.0009525 * 5063 * 30.970636 * 25 * * 7 * 11 * 1.0137611 * 2555 * 0.0003759 * 19 * * 8 * 11 * 12.893576 * 3084 * 0.0001773 * 20 * * 9 * 11 * 0.5493182 * 5082 * 25.224891 * 31 * * 10 * 11 * 5.7339849 * 3084 * 0.0001562 * 33 * * 11 * 11 * 0.6566941 * 5082 * 47.337937 * 26 * * 12 * 11 * 56.431254 * 3084 * 0.0001120 * 47 * * 13 * -11 * 16.367950 * 2411 * 0.0001352 * 36 * * 14 * 11 * 5.9874424 * 3084 * 0.0002270 * 22 * * 15 * 11 * 20.15793 * 3084 * 0.0001632 * 39 * * 16 * -11 * 0.6087991 * 2423 * 0.0001202 * 73 * * 17 * 11 * 19.771174 * 2449 * 0.0002086 * 17 * * 18 * 11 * 4.9902261 * 3084 * 0.0001984 * 47 * * 19 * -11 * 9.1363966 * 2603 * 0.0003791 * 67 * * 20 * 11 * 6.1166736 * 3084 * 0.0001377 * 39 * * 21 * 11 * 5.1025576 * 2803 * 0.0001873 * 12 * * 22 * 11 * 10.336403 * 3084 * 0.0005061 * 19 * * 23 * 11 * 1.6552678 * 2543 * 0.0002113 * 32 * * 24 * 11 * 30.658153 * 2499 * 0.0001122 * 13 * * 25 * 11 * 0.7747261 * 3084 * 0.0001552 * 45 * * 26 * 11 * 10.615128 * 3084 * 0.0001538 * 59 * * 27 * 11 * 9.8609919 * 2399 * 0.0021863 * 32 * * 28 * -11 * 16.588916 * 3084 * 0.0001552 * 21 * * 29 * -11 * 22.491110 * 3084 * 0.0003995 * 24 * * 30 * 11 * 1.4691755 * 2373 * 0.0001356 * 17 * * 31 * -11 * 12.016336 * 2951 * 0.0001250 * 20 * * 32 * 11 * 1.4753258 * 2540 * 0.0009912 * 34 * * 33 * 11 * 0.3667352 * 2903 * 0.0014047 * 13 * * 34 * 11 * 0.5869807 * 2540 * 0.0004919 * 27 * * 35 * 11 * 0.3228921 * 3084 * 0.0008961 * 66 * * 36 * 11 * 2.3988821 * 5082 * 44.248788 * 18 * * 37 * 11 * 6.2385344 * 2523 * 0.0065271 * 534 * * 38 * 11 * 1.9358633 * 3084 * 0.0002501 * 26 * * 39 * 11 * 0.0978481 * 3084 * 0.0001941 * 29 * * 40 * 11 * 1.8782077 * 2779 * 0.0005870 * 44 * * 41 * 11 * 0.1482836 * 2449 * 0.0030499 * 49 * * 42 * 11 * 20.478014 * 2535 * 0.0003603 * 39 * * 43 * 11 * 0.3548805 * 2543 * 0.0002631 * 41 * * 44 * -11 * 8.5498498 * 2651 * 0.0002175 * 13 * * 45 * -11 * 8.5716421 * 3084 * 0.0001628 * 20 * * 46 * 11 * 1.7660504 * 2543 * 0.0001756 * 36 * * 47 * 11 * 0.1619487 * 5079 * 17.756033 * 37 * * 48 * 11 * 0.2548802 * 3084 * 0.0097583 * 75 * * 49 * 11 * 0.5579446 * 3084 * 0.0006049 * 25 * * 50 * -11 * 3.3970329 * 3084 * 0.0001787 * 18 * * 51 * 11 * 0.1322118 * 2361 * 0.0002601 * 65 * * 52 * -11 * 5.8025104 * 3084 * 0.0001559 * 27 * * 53 * -11 * 10.213309 * 3084 * 0.0002972 * 17 * * 54 * -11 * 12.832119 * 3084 * 0.0002463 * 29 * * 55 * 11 * 0.3618347 * 2540 * 0.0034016 * 29 * * 56 * 11 * 0.5063281 * 3084 * 0.0001194 * 22 * * 57 * -11 * 11.250343 * 3084 * 0.0003123 * 18 * * 58 * 11 * 0.6521067 * 3084 * 0.0002174 * 24 * * 59 * 11 * 0.1810190 * 2543 * 0.0003782 * 71 * * 60 * -11 * 26.993893 * 3084 * 0.0003740 * 23 * * 61 * 11 * 0.6104166 * 2540 * 0.0001526 * 47 * * 62 * -11 * 38.762917 * 3084 * 0.0006995 * 17 * * 63 * -11 * 14.415399 * 2727 * 0.0001256 * 47 * * 64 * 11 * 6.8315401 * 3084 * 0.0001870 * 24 * * 65 * 11 * 2.4796035 * 3084 * 0.0002482 * 21 * * 66 * -11 * 0.2472740 * 3084 * 0.0005049 * 113 * * 67 * 11 * 1.5772067 * 3084 * 0.0002863 * 37 * * 68 * -11 * 7.7838493 * 3084 * 0.0002034 * 31 * * 69 * 11 * 0.3152484 * 3084 * 0.0005012 * 707 * ************************************************************************

out of 129477153 total tracks

Good point; for this to be more equivalent with the old implementation, instead of "kill immediately" we'd have to write "kill if it's looped for the (max input looping steps / field propagator internal loop count) consecutive time"... which is I guess about 8?

Oh yes sorry, that's what I meant by "increasing the number of steps allowed while looping accordingly" -- these results (and the regression results above with 100 max_substeps) already had that modification in along-step.

src/celeritas/global/alongstep/AlongStep.hh

src/celeritas/field/FieldPropagator.hh

src/celeritas/io/ImportData.hh

src/celeritas/track/SimData.hh

src/celeritas/io/ImportData.hh

src/celeritas/ext/GeantImporter.cc

whokion · 2023-03-17T17:15:28Z

This is a very nice study! As @sethrj scrutinized proposed changes, I would rather add some general comments:

We may consider to impose a reasonable user step limit (as an example for CMS, a fixed default user step limit = 20cm ~ depth/length of EM calorimeter, or based on the radiation length = 10 * X_o (more general, but still arbitrary) of the material associated with the volume).
We may start to optimize/study the default field parameters, especially the error tolerance, pgrow for the next chord and etc. The goal is to minimize the number of substeps within a reasonable maximum (in general, 128 is already too big and should be considered as a rare case or a problematic case, which really need to be understood (this also should be treated differently from the real looping case, i.e., almost circular trajectory which does not move much in the line of fligt after a full turn).
Of course, need to understand the small substep length problem (bounded by the error tolerance) within the context of field driver/propagate algorithms, especial for those cases of the high momentum track with a small step in cms2018+field+msc-vecgeom-cpu.
Concerning to how to deal with looping tracks when killed, probably depositing energy at the local position may be a first/lazy choice if the particle energy is above a certain threshold (still a good approximation in the local energy measurement if the curvature of the trajectory is relatively smaller than the dimension of the associated volume, and may be the best that one can do or does not matter for tracking point of view as the track will be killed anyway beyond the last hit point, likely not inducing any change deposition in the tracker readout) - neither will be correct anyway, so take a less possibility of wrongness (?).

…ooping tracks

sethrj

This looks great, @amandalund ! I have only one small suggestion to consider.

src/celeritas/track/SimParams.cc

sethrj · 2023-03-23T15:01:17Z

I merged this branch before waiting for hip-ndebug to finish, but after the last changes, one of the Field propagator tests fails on hip-ndebug:

SimpleCmsTest.vecgeom_failure

[ RUN      ] SimpleCmsTest.vecgeom_failure
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1275: Failure
Value of: calc_radius()
  Actual: 125.08080441423482
Expected: 125
Which is: 125
(Relative error 0.00064643531387855542 exceeds tolerance 0.0001)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1279: Failure
Value of: result.distance
  Actual: 11.676851876556075
Expected: 14.946488966946923
Which is: 14.946488966946923
(Relative error -0.21875619736657978 exceeds tolerance 9.9999999999999998e-13)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1281: Failure
Expected equality of these values:
  9984
  stepper.count()
    Which is: 7800
[  FAILED  ] SimpleCmsTest.vecgeom_failure (1 ms)

as does one of the stepper tests:

TestEm15MscField.device

[ RUN      ] TestEm15MscField.device
status: Reading and building Livermore PE model data
status: Reading and building Seltzer Berger model data
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:587: Failure
Expected equality of these values:
  17
  result.num_step_iters()
    Which is: 21
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:588: Failure
Value of: result.calc_avg_steps_per_primary()
  Actual: 41.625
Expected: 34
Which is: 34
(Relative error 0.22426470588235295 exceeds tolerance 9.9999999999999998e-13)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:589: Failure
Expected equality of these values:
  5
  result.calc_emptying_step()
    Which is: 9
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:590: Failure
Expected equality of these values:
  RunResult::StepCount({1, 10})
    Which is: (1, 10)
  result.calc_queue_hwm()
    Which is: (7, 11)
[  FAILED  ] TestEm15MscField.device (30 ms)

Unfortunately, it looks like these do pass on the Crusher ndebug configuration (ROCM 5.1) even though they fail on the CI (ROCM 5.4).

It still passes on crusher with ROCM 5.4 (Clang 15.0.0). The only difference I immediately see is that Crusher's GPU is gfx90a whereas the CI is the slightly older gfx908?

amandalund added 5 commits March 14, 2023 01:51

Add simple-cms along-step test for unconverged track

8a9939b

Identify and kill looping tracks

c47871f

* Add SimParams with threshold values for detecting and killing loopers * Increase maximum substeps in FieldPropagator to 1000 to match Geant4 * Kill looping tracks in along-step * Update SimpleCmsAlongStep test

Import looping parameters from Geant4 Transportation process

f4f58ee

* Add from_import method to SimParams * Update four-steel-slabs.root for tests

Add implicit action for killed looping tracks and add action ID to RS…

1e7187e

…W filter (this will make it easier to investigate the properties of abandoned looping tracks)

Fix along-step propagation logic (tracks can have a geo propagation l…

a9283e8

…imit even if they are not looping). Also: * Rename looping action * Clean up

amandalund added the field Magnetic field and propagation label Mar 15, 2023

amandalund requested review from sethrj and whokion March 15, 2023 21:23

sethrj added the enhancement New feature or request label Mar 15, 2023

sethrj mentioned this pull request Mar 15, 2023

Implement on-device error handling and reporting #687

Open

sethrj requested changes Mar 16, 2023

View reviewed changes

amandalund added 7 commits March 20, 2023 21:54

Address review feedback

6756456

Deposit energy of tracks killed while looping locally

c3701ba

Import max substeps in field propagator from Geant4

cd949b4

Reduce max substeps in field propagator and compensate when killing l…

2f818d9

…ooping tracks

Add basic sim test

562de6c

Merge remote-tracking branch 'upstream/master' into loopers

dd8dca4

Merge remote-tracking branch 'upstream/develop' into loopers

f256e06

sethrj approved these changes Mar 22, 2023

View reviewed changes

src/celeritas/track/SimParams.cc Outdated Show resolved Hide resolved

amandalund added 3 commits March 22, 2023 16:31

Fix diagnostic bin_energy() function when device is disabled

5541b81

Use ceil_div()

df5da8f

Update field propagator test where max substep limit is reached

3b3f85d

sethrj merged commit 70afb74 into celeritas-project:develop Mar 22, 2023

amandalund mentioned this pull request Mar 23, 2023

Add StreamId to allow thread-safe data access in Actions #693

Merged

sethrj mentioned this pull request Mar 23, 2023

Fix HIP test failure in field propagation #697

Merged

amandalund deleted the loopers branch March 24, 2023 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and kill looping tracks #685

Identify and kill looping tracks #685

amandalund commented Mar 15, 2023 •

edited

Loading

sethrj commented Mar 15, 2023

sethrj left a comment

sethrj Mar 16, 2023

amandalund Mar 20, 2023

sethrj Mar 20, 2023

amandalund Mar 20, 2023

whokion commented Mar 17, 2023

sethrj left a comment

sethrj commented Mar 23, 2023 •

edited

Loading

Identify and kill looping tracks #685

Identify and kill looping tracks #685

Conversation

amandalund commented Mar 15, 2023 • edited Loading

sethrj commented Mar 15, 2023

sethrj left a comment

Choose a reason for hiding this comment

sethrj Mar 16, 2023

Choose a reason for hiding this comment

amandalund Mar 20, 2023

Choose a reason for hiding this comment

sethrj Mar 20, 2023

Choose a reason for hiding this comment

amandalund Mar 20, 2023

Choose a reason for hiding this comment

whokion commented Mar 17, 2023

sethrj left a comment

Choose a reason for hiding this comment

sethrj commented Mar 23, 2023 • edited Loading

amandalund commented Mar 15, 2023 •

edited

Loading

sethrj commented Mar 23, 2023 •

edited

Loading