-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance: profiling scale_run
simulations
#286
Comments
scale_run
profiling scripts runs at very slow ratescale_run
profiling script runs at very slow rate
After the changes made in #287 a corresponding profiling run as described above gives the following SnakeViz output which is significantly improved but there are still some bottlenecks so I will keep this issue open for now to keep track of progress. |
After the changes made in #308, a profiling run as described above gives the following SnakeViz output As running for longer simulation times is now tractable without being overly onerous, going forwards I'll use a 6 month simulation time run of After the changes made in #308, a
|
Summary of recent progress in reducing #313 (changing appointment footprint to sparse structure)
|
A full 20 year / 20k population run of The SnakeViz breakdown for a 20 year / 20k population run on current master (a691b72) is as follows The ten event methods with the highest proportions of the overall run time are as follows:
The Within Within Within Within Together the |
To provide one additional datapoint, a 5 year run of |
Thanks for all this @matt-graham; c. 5h for 20k/20y simulation with all the bells and whistles feels really manageable to me!! I think going up to 50k would make sense as would be provide a respectable c1800 persons per district. Yes, there are now new completed disease modules that we should fold into the profiling scrips (notably I'll also do a PR that updates the Looking through the various things, I see that we rely on But, apart from these thoughts, I don't think we need to change the spec of these runs. Let me know what you and @tamuri think and if you'd need me to do a PR suggesting these changes. |
scale_run
profiling script runs at very slow ratescale_run
simulations
After the updates to The ten event methods with the highest proportions of the overall run time are as follows:
Some initial thoughts about possible performance improvements The total time spent in functions / methods in the The latter arises in the lines TLOmodel/src/tlo/methods/malaria.py Lines 1226 to 1229 in 7812c64
which clears all the symptoms for a set of person IDs in a loop - it might be there could some performance gain therefore from extending In TLOmodel/src/tlo/methods/symptommanager.py Lines 499 to 505 in 7812c64
Running TLOmodel/src/tlo/methods/alri.py Lines 1013 to 1019 in 7812c64
TLOmodel/src/tlo/methods/alri.py Lines 1531 to 1537 in 7812c64
TLOmodel/src/tlo/methods/diarrhoea.py Lines 1042 to 1053 in 7812c64
TLOmodel/src/tlo/methods/malaria.py Lines 482 to 490 in 7812c64
TLOmodel/src/tlo/methods/malaria.py Lines 524 to 531 in 7812c64
TLOmodel/src/tlo/methods/measles.py Lines 280 to 294 in 7812c64
TLOmodel/src/tlo/methods/symptommanager.py Lines 605 to 632 in 7812c64
TLOmodel/src/tlo/methods/symptommanager.py Lines 658 to 667 in 7812c64
which suggests there might also be some gain to either generalising A considerable amount of time is being spent in a considerable amount of time is being spent in operations in the function itself as well as in calling Another possible target for optimization is the |
Thanks for this Matt my first quick reactions—-
|
Thanks @tbhallett
Okay it seems like this is probably a good first target for me to work on then!
Thanks that's useful to know - I'll hold off on looking at specific optimisation the spurious symptoms related events for now then until we've checked if reducing the rate is sufficient to make these events non-performance critical.
Ah that's an interesting idea - given the number of generic HSI events and also the large time spent scheduling them in |
A 5 year / 50k initial population run of with the overall simulation time 25320s. There seem to be a few parts of some of the newer / rewritten modules that would benefit from some refactoring to improve performance:
|
Can we get a recent profile viz? I know at least the RTI fix was merged (#682). Thanks. |
A 5 year / 40k initial population run of scale_run.py on 6b9a1d5 gives the following SnakeViz plot of the profiling results: The ten methods with the highest proportions of the overall time are the following:
Snakeviz shows larger overheads than Please note that to make the html files accessible in their current format I had to make them public, let me know if this is an issue. |
Thanks, Dimitra. My vote for two worth exploring are the HealthSeekingBehaviourPoll (4) and then the SchistoMatureWorms event (2). The first seems to be a fairly lengthy LinearModel which is actually not called very often. It might improve if the LinearModel is setup using a custom function. The second looks like an individual event that is called many times. Need to discuss with @tbhallett (perhaps @tdm32 ?) about whether that can be replaced by population-level event without losing important behaviour of the model. |
On the second point about |
Making a note that scale runs, which so far have focused on mode_appt_constaints = 1, in the future should also consider mode_appt_constraints = 2 (see PR #986) as this may become more widely used by analysts. Performance issue already identified is related to the length of the HSI queue, which is dependent on assumptions around tclose (see Issue #999. However note that since this issue was first written up some changes where made in the way the the hsi queue is queried under mode_appt_constraints = 2 which may have improved matters). |
/run profiling |
The current version of
src/scripts/profiling/scale_run.py
is running very slowly. After reducing the simulation time to 1 month it still took around 2.5 hours to complete a run on my laptop, so if this simulation rate remained the same running for the full 20 years here would take around ~25 days!From some trial-and-error the main causes of the slowdown compared to the previous script (which for comparison took around 30 minutes to do a 2 year simulation with the same population size, but different set of modules / configuration) seems to be both of setting
spurious_symptoms=True
in the initializer for theSymptomManager
moduleTLOmodel/src/scripts/profiling/scale_run.py
Line 66 in 63f58f0
and setting
mode_appt_constraints=2
andcapabilities_coefficient=0.01
in the initializer for theHealthSystem
module,TLOmodel/src/scripts/profiling/scale_run.py
Lines 71 to 74 in 63f58f0
with each individually causing a large slow down. The additional disease modules seem to be only adding a small overhead in comparison.
A SnakeViz plot of running for 1 month simulation time with the original configuration is below
Just under 80% of the run time is being spent in
get_appt_footprint_as_time_request
which is called 1351075 times.In comparison running for 1 month simulation time with
spurious_symptoms=False
inSymptomManager
andmode_appt_constraints=0, capabilities_coefficient=1
inHealthSystem
gives the following SnakeViz plotIn this case the percentage of time spent in
get_appt_footprint_as_time_request
is much reduced though still significant (27%) and the number of calls much lower (24344, ~2% of 1351075). The overall runtime is also around 6% of previously.The text was updated successfully, but these errors were encountered: