Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and kill looping tracks #685

Merged
merged 15 commits into from
Mar 22, 2023
Merged

Conversation

amandalund
Copy link
Contributor

@amandalund amandalund commented Mar 15, 2023

In many of our regression problems we're seeing unconverged tracks: a few tracks are still alive after a very large number of steps. These appear to be looping particles (low energy, in low density materials, taking many substeps in the FieldPropagator). I added an along-step test that reproduces one of these tracks from simple-cms+field+msc-vecgeom-gpu that was still alive after 100k steps in the transport loop, taking max_substeps = 128 iterations in the FieldPropagator at each step (when I tried removing the max_substeps limit removed it took ~150k substeps in the propagator before reaching a boundary).

Geant4 will kill certain tracks if they are found to be looping:

  • The propagation in field will flag a track as looping if it takes more than some max number of substeps (default 1000)
  • If the particle is stable and its energy is below some threshold (the "important" energy), transportation will kill it immediately if it was flagged as looping
  • If the energy is above the threshold, the track is allowed to continue stepping. If it is still looping after some threshold number of steps, it will be killed

This implements the same strategy for killing looping tracks as Geant4. Changes include:

  • Importing the looping threshold parameters from Geant4
  • Increasing max_substeps in the FieldPropagator from 128 to the Geant4 default of 1000
  • Adding SimParams data/class and a sim track state to store the number of steps the particle has taken since it was flagged as looping
  • Killing looping tracks in along-step if the conditions described above are met

I reran the regression suite with this branch (with timing results from runs with unconverged tracks discarded, and problems with unconverged tracks in red). All the regression problems now converge. Each plot shows the mean total time over four runs:

regression GPU performance

regression-gpu

regression CPU performance

regression-cpu

We can use the ROOT MC truth output filtered on the implicit "abandon looping" action to investigate which tracks are being killed:

testem3-flat+field+msc-vecgeom-cpu abandoned looping tracks
************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 17.410741 *        37 * 0.0021631 *        11 *
*        1 *       -11 * 20.725924 *        19 * 0.0027216 *        10 *
*        2 *        11 * 0.1155200 *       100 * 191.64264 *         7 *
*        3 *        11 * 2.0564479 *       100 * 410.76694 *         7 *
************************************************************************

out of 130261723 total tracks

simple-cms+field+msc-vecgeom-cpu abandoned looping tracks
************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 0.4284925 *         0 * 328.92998 *         7 *
*        1 *        11 * 0.5415748 *         6 * 336.01412 *        19 *
*        2 *        11 * 0.0328769 *         0 * 141.95439 *        17 *
*        3 *        11 * 8.7250492 *         6 * 1082.8599 *        24 *
*        4 *        11 * 0.3138696 *         6 * 281.41158 *        12 *
*        5 *        11 * 1.1964144 *         6 * 475.82615 *         5 *
*        6 *        11 * 0.2774569 *         6 * 219.11813 *         6 *
*        7 *        11 * 2.4181753 *         6 * 872.61418 *         4 *
*        8 *       -11 * 0.7633382 *         6 * 332.14011 *         8 *
*        9 *       -11 * 1.5476162 *         6 * 366.68110 *         6 *
*       10 *        11 * 6.6343545 *         6 * 1012.3510 *         7 *
*       11 *       -11 * 6.9846367 *         6 * 1171.6445 *        21 *
*       12 *       -11 * 3.8364578 *         6 * 641.42077 *        13 *
*       13 *        11 * 1.2989297 *         6 * 382.06274 *         5 *
*       14 *        11 * 9.3976182 *         6 * 1296.6854 *         2 *
*       15 *        11 * 1.5514753 *         6 * 402.37965 *         8 *
*       16 *        11 * 8.8955081 *         6 * 1302.1596 *         2 *
*       17 *       -11 * 2.8309174 *         6 * 1026.2208 *         2 *
*       18 *       -11 * 9.4258080 *         6 * 1170.6220 *         4 *
*       19 *       -11 * 2.8718643 *         6 * 660.33176 *         5 *
*       20 *        11 * 3.4736794 *         6 * 1081.4739 *         3 *
*       21 *       -11 * 0.9931597 *         6 * 364.33565 *        20 *
*       22 *        11 * 8.8151281 *         6 * 1003.4138 *        24 *
*       23 *       -11 * 9.3939252 *         6 * 1358.6969 *        15 *
*       24 *       -11 * 1.3479123 *         6 * 492.88287 *         4 *
*       25 *        11 * 5.6744346 *         6 * 1229.8893 *         9 *
*       26 *        11 * 3.0527115 *         6 * 625.72645 *         3 *
*       27 *        11 * 2.4160603 *         6 * 490.33345 *         4 *
*       28 *        11 * 0.2229918 *         0 * 224.05321 *        21 *
************************************************************************

out of 156691602 total tracks

cms2018+field+msc-vecgeom-cpu abandoned looping tracks
************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 0.5121168 *      5065 * 261.40531 *         8 *
*        1 *        11 * 0.1947578 *      5082 * 380.81933 *        55 *
*        2 *        11 * 0.2006966 *      5082 * 369.53779 *        77 *
*        3 *        11 * 0.0965214 *      5082 * 213.93839 *        23 *
*        4 *       -11 * 1.4034612 *      5082 * 375.11510 *        95 *
*        5 *       -11 * 2.6108996 *      5087 * 473.23694 *       146 *
*        6 *        11 * 1.1036962 *      5082 * 321.68408 *        36 *
*        7 *        11 * 9.0519847 *      3084 * 0.0020934 *         6 *
*        8 *       -11 * 14.733626 *      2363 * 0.0026204 *        35 *
*        9 *        11 * 27.626263 *      3084 * 0.0049625 *        11 *
*       10 *        11 *  5.879322 *      3084 * 0.0031583 *        42 *
*       11 *       -11 * 15.457228 *      3084 * 0.0058699 *         9 *
*       12 *        11 * 2.5345548 *      5082 * 503.93044 *        13 *
*       13 *       -11 * 2.6174246 *      5082 * 479.84149 *        76 *
*       14 *       -11 * 5.4691775 *      3084 * 0.0049343 *        17 *
*       15 *       -11 * 10.036764 *      3084 * 0.0053894 *        10 *
*       16 *       -11 * 8.9171694 *      3084 * 0.0028921 *        40 *
*       17 *       -11 * 9.7544970 *      3084 * 0.0030024 *        22 *
*       18 *        11 * 5.1466872 *      3084 * 0.0012914 *        12 *
*       19 *       -11 * 18.409088 *      3084 * 0.0077950 *        36 *
*       20 *       -11 *  52.40507 *      3084 * 0.0032222 *        46 *
************************************************************************

out of 129476491 total tracks

Questions/TODO:

  • Currently when a looping track is killed it is simply abandoned, but I'm not sure this is the correct approach -- should we be depositing the energy?
  • Also import the maximum substeps in the field propagator from Geant4 (G4PropagatorInField::GetMaxLoopCount())
  • The slightly worse performance (mainly in the problems where there are unconverged tracks) is likely due to increasing the max substeps in the FieldPropagator. Reducing that value from 1000 to 100 (and increasing the number of steps a track is allowed take while looping to compensate) gives a small speedup:
regression GPU performance

regression-gpu-100iter

regression CPU performance

regression-cpu-100iter


so we may be able to play with those parameters to improve load balancing/performance.

  • There are strategies for reducing the incidence of looping tracks (described in the Geant4 manual) that we may want to try

* Add SimParams with threshold values for detecting and killing loopers
* Increase maximum substeps in FieldPropagator to 1000 to match Geant4
* Kill looping tracks in along-step
* Update SimpleCmsAlongStep test
* Add from_import method to SimParams
* Update four-steel-slabs.root for tests
…W filter (this will make it easier to investigate the properties of abandoned looping tracks)
…imit even if they are not looping). Also:

* Rename looping action
* Clean up
@amandalund amandalund added the field Magnetic field and propagation label Mar 15, 2023
@amandalund amandalund requested review from sethrj and whokion March 15, 2023 21:23
@sethrj sethrj added the enhancement New feature or request label Mar 15, 2023
@sethrj
Copy link
Member

sethrj commented Mar 15, 2023

@amandalund @stognini identified one of the stuck track volumes in cms2018, vol ID 3084, as "ESPM": googling shows that CMS through Geant4 (with and without vecgeom) has this problem:

https://indico.cern.ch/event/629803/contributions/2820951/attachments/1573948/2484706/G4-Talk11.pdf

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, we may need to brainstorm for a few minutes about how to use the smaller value for "max substeps" while keeping the "kill looping particles" value the same. Maybe the value we save to the TransParameters should be "maximum substeps before killing" and build that factor of 1000 into the geant exporter? And maybe we change "immediately kill" to have a second threshold value from an implicit "1 step" to an explicit "number of unconverged steps for unimportant tracks"?

src/celeritas/em/msc/UrbanMsc.hh Show resolved Hide resolved
@@ -62,7 +62,7 @@ class FieldPropagator
inline CELER_FUNCTION result_type operator()(real_type dist);

//! Limit on substeps
static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 128; }
static CELER_CONSTEXPR_FUNCTION short int max_substeps() { return 1000; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's turn this down to 128 and adjust the looping kill threshold to match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. One thing to keep in mind is that by lowering this threshold (and increasing the number of steps allowed while looping accordingly) we seem to be killing more "stuck" tracks in cms2018 than we were with the higher threshold. These were the tracks with an "abandoned looping" action with max_substeps set to 100:

************************************************************************
*    Row   *  particle * pre_energ * pre_volum * step_leng * track_ste *
************************************************************************
*        0 *        11 * 0.5121168 *      5065 * 26.678759 *        17 *
*        1 *        11 * 0.4632876 *      5082 * 41.447479 *        70 *
*        2 *        11 * 0.1013242 *      5093 * 23.316498 *       115 *
*        3 *        11 * 2.6904279 *      5063 * 46.786605 *        41 *
*        4 *        11 * 0.4679226 *      2221 * 0.0004375 *        30 *
*        5 *       -11 * 0.7619003 *      5069 * 28.127852 *       193 *
*        6 *        11 * 1.0009525 *      5063 * 30.970636 *        25 *
*        7 *        11 * 1.0137611 *      2555 * 0.0003759 *        19 *
*        8 *        11 * 12.893576 *      3084 * 0.0001773 *        20 *
*        9 *        11 * 0.5493182 *      5082 * 25.224891 *        31 *
*       10 *        11 * 5.7339849 *      3084 * 0.0001562 *        33 *
*       11 *        11 * 0.6566941 *      5082 * 47.337937 *        26 *
*       12 *        11 * 56.431254 *      3084 * 0.0001120 *        47 *
*       13 *       -11 * 16.367950 *      2411 * 0.0001352 *        36 *
*       14 *        11 * 5.9874424 *      3084 * 0.0002270 *        22 *
*       15 *        11 *  20.15793 *      3084 * 0.0001632 *        39 *
*       16 *       -11 * 0.6087991 *      2423 * 0.0001202 *        73 *
*       17 *        11 * 19.771174 *      2449 * 0.0002086 *        17 *
*       18 *        11 * 4.9902261 *      3084 * 0.0001984 *        47 *
*       19 *       -11 * 9.1363966 *      2603 * 0.0003791 *        67 *
*       20 *        11 * 6.1166736 *      3084 * 0.0001377 *        39 *
*       21 *        11 * 5.1025576 *      2803 * 0.0001873 *        12 *
*       22 *        11 * 10.336403 *      3084 * 0.0005061 *        19 *
*       23 *        11 * 1.6552678 *      2543 * 0.0002113 *        32 *
*       24 *        11 * 30.658153 *      2499 * 0.0001122 *        13 *
*       25 *        11 * 0.7747261 *      3084 * 0.0001552 *        45 *
*       26 *        11 * 10.615128 *      3084 * 0.0001538 *        59 *
*       27 *        11 * 9.8609919 *      2399 * 0.0021863 *        32 *
*       28 *       -11 * 16.588916 *      3084 * 0.0001552 *        21 *
*       29 *       -11 * 22.491110 *      3084 * 0.0003995 *        24 *
*       30 *        11 * 1.4691755 *      2373 * 0.0001356 *        17 *
*       31 *       -11 * 12.016336 *      2951 * 0.0001250 *        20 *
*       32 *        11 * 1.4753258 *      2540 * 0.0009912 *        34 *
*       33 *        11 * 0.3667352 *      2903 * 0.0014047 *        13 *
*       34 *        11 * 0.5869807 *      2540 * 0.0004919 *        27 *
*       35 *        11 * 0.3228921 *      3084 * 0.0008961 *        66 *
*       36 *        11 * 2.3988821 *      5082 * 44.248788 *        18 *
*       37 *        11 * 6.2385344 *      2523 * 0.0065271 *       534 *
*       38 *        11 * 1.9358633 *      3084 * 0.0002501 *        26 *
*       39 *        11 * 0.0978481 *      3084 * 0.0001941 *        29 *
*       40 *        11 * 1.8782077 *      2779 * 0.0005870 *        44 *
*       41 *        11 * 0.1482836 *      2449 * 0.0030499 *        49 *
*       42 *        11 * 20.478014 *      2535 * 0.0003603 *        39 *
*       43 *        11 * 0.3548805 *      2543 * 0.0002631 *        41 *
*       44 *       -11 * 8.5498498 *      2651 * 0.0002175 *        13 *
*       45 *       -11 * 8.5716421 *      3084 * 0.0001628 *        20 *
*       46 *        11 * 1.7660504 *      2543 * 0.0001756 *        36 *
*       47 *        11 * 0.1619487 *      5079 * 17.756033 *        37 *
*       48 *        11 * 0.2548802 *      3084 * 0.0097583 *        75 *
*       49 *        11 * 0.5579446 *      3084 * 0.0006049 *        25 *
*       50 *       -11 * 3.3970329 *      3084 * 0.0001787 *        18 *
*       51 *        11 * 0.1322118 *      2361 * 0.0002601 *        65 *
*       52 *       -11 * 5.8025104 *      3084 * 0.0001559 *        27 *
*       53 *       -11 * 10.213309 *      3084 * 0.0002972 *        17 *
*       54 *       -11 * 12.832119 *      3084 * 0.0002463 *        29 *
*       55 *        11 * 0.3618347 *      2540 * 0.0034016 *        29 *
*       56 *        11 * 0.5063281 *      3084 * 0.0001194 *        22 *
*       57 *       -11 * 11.250343 *      3084 * 0.0003123 *        18 *
*       58 *        11 * 0.6521067 *      3084 * 0.0002174 *        24 *
*       59 *        11 * 0.1810190 *      2543 * 0.0003782 *        71 *
*       60 *       -11 * 26.993893 *      3084 * 0.0003740 *        23 *
*       61 *        11 * 0.6104166 *      2540 * 0.0001526 *        47 *
*       62 *       -11 * 38.762917 *      3084 * 0.0006995 *        17 *
*       63 *       -11 * 14.415399 *      2727 * 0.0001256 *        47 *
*       64 *        11 * 6.8315401 *      3084 * 0.0001870 *        24 *
*       65 *        11 * 2.4796035 *      3084 * 0.0002482 *        21 *
*       66 *       -11 * 0.2472740 *      3084 * 0.0005049 *       113 *
*       67 *        11 * 1.5772067 *      3084 * 0.0002863 *        37 *
*       68 *       -11 * 7.7838493 *      3084 * 0.0002034 *        31 *
*       69 *        11 * 0.3152484 *      3084 * 0.0005012 *       707 *
************************************************************************

out of 129477153 total tracks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; for this to be more equivalent with the old implementation, instead of "kill immediately" we'd have to write "kill if it's looped for the (max input looping steps / field propagator internal loop count) consecutive time"... which is I guess about 8?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes sorry, that's what I meant by "increasing the number of steps allowed while looping accordingly" -- these results (and the regression results above with 100 max_substeps) already had that modification in along-step.

src/celeritas/global/alongstep/AlongStep.hh Outdated Show resolved Hide resolved
src/celeritas/global/alongstep/AlongStep.hh Outdated Show resolved Hide resolved
src/celeritas/global/alongstep/AlongStep.hh Outdated Show resolved Hide resolved
src/celeritas/field/FieldPropagator.hh Outdated Show resolved Hide resolved
src/celeritas/io/ImportData.hh Outdated Show resolved Hide resolved
src/celeritas/track/SimData.hh Outdated Show resolved Hide resolved
src/celeritas/io/ImportData.hh Outdated Show resolved Hide resolved
src/celeritas/ext/GeantImporter.cc Outdated Show resolved Hide resolved
@whokion
Copy link
Contributor

whokion commented Mar 17, 2023

This is a very nice study! As @sethrj scrutinized proposed changes, I would rather add some general comments:

  • We may consider to impose a reasonable user step limit (as an example for CMS, a fixed default user step limit = 20cm ~ depth/length of EM calorimeter, or based on the radiation length = 10 * X_o (more general, but still arbitrary) of the material associated with the volume).
  • We may start to optimize/study the default field parameters, especially the error tolerance, pgrow for the next chord and etc. The goal is to minimize the number of substeps within a reasonable maximum (in general, 128 is already too big and should be considered as a rare case or a problematic case, which really need to be understood (this also should be treated differently from the real looping case, i.e., almost circular trajectory which does not move much in the line of fligt after a full turn).
  • Of course, need to understand the small substep length problem (bounded by the error tolerance) within the context of field driver/propagate algorithms, especial for those cases of the high momentum track with a small step in cms2018+field+msc-vecgeom-cpu.
  • Concerning to how to deal with looping tracks when killed, probably depositing energy at the local position may be a first/lazy choice if the particle energy is above a certain threshold (still a good approximation in the local energy measurement if the curvature of the trajectory is relatively smaller than the dimension of the associated volume, and may be the best that one can do or does not matter for tracking point of view as the track will be killed anyway beyond the last hit point, likely not inducing any change deposition in the tracker readout) - neither will be correct anyway, so take a less possibility of wrongness (?).

Copy link
Member

@sethrj sethrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @amandalund ! I have only one small suggestion to consider.

src/celeritas/track/SimParams.cc Outdated Show resolved Hide resolved
@sethrj sethrj merged commit 70afb74 into celeritas-project:develop Mar 22, 2023
@sethrj
Copy link
Member

sethrj commented Mar 23, 2023

I merged this branch before waiting for hip-ndebug to finish, but after the last changes, one of the Field propagator tests fails on hip-ndebug:

SimpleCmsTest.vecgeom_failure
[ RUN      ] SimpleCmsTest.vecgeom_failure
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1275: Failure
Value of: calc_radius()
  Actual: 125.08080441423482
Expected: 125
Which is: 125
(Relative error 0.00064643531387855542 exceeds tolerance 0.0001)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1279: Failure
Value of: result.distance
  Actual: 11.676851876556075
Expected: 14.946488966946923
Which is: 14.946488966946923
(Relative error -0.21875619736657978 exceeds tolerance 9.9999999999999998e-13)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/field/FieldPropagator.test.cc:1281: Failure
Expected equality of these values:
  9984
  stepper.count()
    Which is: 7800
[  FAILED  ] SimpleCmsTest.vecgeom_failure (1 ms)
as does one of the stepper tests:
TestEm15MscField.device
[ RUN      ] TestEm15MscField.device
status: Reading and building Livermore PE model data
status: Reading and building Seltzer Berger model data
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:587: Failure
Expected equality of these values:
  17
  result.num_step_iters()
    Which is: 21
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:588: Failure
Value of: result.calc_avg_steps_per_primary()
  Actual: 41.625
Expected: 34
Which is: 34
(Relative error 0.22426470588235295 exceeds tolerance 9.9999999999999998e-13)
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:589: Failure
Expected equality of these values:
  5
  result.calc_emptying_step()
    Which is: 9
/var/jenkins/workspace/Celeritas_PR-685/test/celeritas/global/Stepper.test.cc:590: Failure
Expected equality of these values:
  RunResult::StepCount({1, 10})
    Which is: (1, 10)
  result.calc_queue_hwm()
    Which is: (7, 11)
[  FAILED  ] TestEm15MscField.device (30 ms)

Unfortunately, it looks like these do pass on the Crusher ndebug configuration (ROCM 5.1) even though they fail on the CI (ROCM 5.4).

It still passes on crusher with ROCM 5.4 (Clang 15.0.0). The only difference I immediately see is that Crusher's GPU is gfx90a whereas the CI is the slightly older gfx908?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request field Magnetic field and propagation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants