Caching appointment time requests #322

matt-graham · 2021-07-01T13:58:28Z

Profiling of a 1 year run of scale_run.py shows that considerable time is still being spent in HealthSystem.get_appt_footprint_as_time_request even with the optimizations in #287, #303 and #313.

The SnakeViz output for a 1 year scale_run.py run shows an overall breakdown

with HealthSystem.get_appt_footprint_as_time_request corresponding to the dark orange bar beneath the light blue healthsystem.py1394(<listcomp>) bar on the far left, with the breakdown within HealthSystem.get_appt_footprint_as_time_request as follows

Most of time seems to be spent dataframe access operations in HealthSystem.get_facility_info.

The high number of calls of get_appt_footprint_as_time_request (149496460 in this 1 year run) I think suggests that it is being called multiple times for each HSI_Event due to the event being put in the hold-over list for running on subsequent days by the health system scheduler. This suggests we can cache the appointment time request generated on the first call to get_appt_footprint_as_time_request for a HSI event within the event and return it on any subsequent calls on the same event, with this idea being the basis of this PR.

In general the generated time request can depend on both a HSI event and a appointment footprint which may differ from the EXPECTED_APPT_FOOTPRINT of the event. For the sake of simplicity here I only cache time requests generated from the EXPECTED_APPT_FOOTPRINT footprint rather than any updated footprint passed via the actual_appt_footprint argument to get_appt_footprint_as_time_request, with a non-None value for this argument causing the time request to always be recomputed.

The caching mechanism here assumes that the HSI_Event is essentially static after creation - in particular it assumes that neither the EXPECTED_APPT_FOOTPRINT target attributes are changed after initial creation such that a previously computed time request will remain valid. There is also an assumption that the district of residence of the target cannot change (as this would change the facility information returned by get_facility_info). As far as I can tell these assumptions are valid currently, with this backed up by the final population dataframe after a 1 year run being the same before and after the changes in this PR. These assumptions could however potentially be invalidated by future changes to the code. While we could potentially make the EXPECTED_APPT_FOOTPRINT and target attributes of a HSI_Event immutable to guard against this, this would still not protect against the possibility of the target's district being updated in the dataframe (and explicitly checking that this value is the same to check if the cached time request is valid would remove a lot of the performance gain given the proportion of time spent currently in get_facility_info).

The SnakeViz output for a 1 year scale_run.py run after the changes made in this PR¹ gives the following overall breakdown

and specifically looking at get_appt_footprint_as_time_request

From which we can see that the time spent in computing get_appt_footprint_as_time_request has been drastically reduced (from 4330s to 120s).

1. The profiling run was in fact for a slightly different version of the code in this PR which didn't create a _cached_time_request attribute with value None in the __init__ method of HSI_Event but instead simply created this attribute after the first call to get_appt_footprint_as_time_request and checked for the presence of this attribute with hasattr when deciding whether to return a cached value, hence the presence of the hasattr block in the SnakeViz output and slight mismatch with the line numbers.

Cache computed appointment time requests generated from EXPECTED_APPT_FOOTPRINT in the HSI_Event instance

tbhallett · 2021-07-01T14:12:10Z

@matt-graham this is great. Really amazed at the high number of calls and the magnitude of this improvement.
You’re right that the properties of the HSI can be considered to be immutable after the construction of the HSI.

matt-graham · 2021-07-01T14:33:50Z

@matt-graham this is great. Really amazed at the high number of calls and the magnitude of this improvement.

Yes I was also surprised at the number of calls! I plan to do a bit of digging to figure out why so many calls are being generated in the first place and see if there are any potential optimizations to the scheduler which could help reduce this load, but this seemed a reasonable interim solution, and turned out to be surprisingly effective 😅.

You’re right that the properties of the HSI can be considered to be immutable after the construction of the HSI.

Thanks that is good to know! We could potentially semi-enforce this by making target and EXPECTED_APPT_FOOTPRINT 'read-only' properties by only defining getter methods and using an immutable type for EXPECTED_APPT_FOOTPRINT.

tbhallett · 2021-07-01T15:01:36Z

@matt-graham this is great. Really amazed at the high number of calls and the magnitude of this improvement.

Yes I was also surprised at the number of calls! I plan to do a bit of digging to figure out why so many calls are being generated in the first place and see if there are any potential optimizations to the scheduler which could help reduce this load, but this seemed a reasonable interim solution, and turned out to be surprisingly effective 😅.

Thanks. Just to note that the configuration of that profiling script is designed to make the capacity of the healthsystem low meaning that many appointments are forced to be held over (so it’s a worst case, by design).

It might be (I hope it’s not) because many instances of a kind of appointment is scheduled that is somehow never possible and the HSI has no ‘tclose’ arguenent, meaning that it’s tried, and frustrated, on every day for the rest of the simulation.

Maybe the default on ‘tclose’ should be less generous than that: eg to one week after the ‘topen’….? But I will be lead by your findings of course.

You’re right that the properties of the HSI can be considered to be immutable after the construction of the HSI.

Thanks that is good to know! We could potentially semi-enforce this by making target and EXPECTED_APPT_FOOTPRINT 'read-only' properties by only defining getter methods and using an immutable type for EXPECTED_APPT_FOOTPRINT.

Cool. I’d welcome that enforcement.

thanks very much

matt-graham · 2021-07-02T10:14:20Z

From an attempt at making the target attribute of IndividualScopeEventMixin read-only it seems that there are a few cases in the code where the event target is updated after the event creation, though none of these cases seem to be occurring between calls of get_appt_footprint_as_time_request on the same event so do not result in the cache returning the wrong value.

There are a series of cases in tests/test_dx_manager.py where the target of a previously constructed HSI_Event is updated to a person_id, and a dx test run on with the updated event, e.g.

TLOmodel/tests/test_dxmanager.py

Lines 253 to 259 in 7416d56

    
           for person_id in df.loc[df.is_alive].index: 
        
               hsi_event.target = person_id 
        
               result_from_dx_manager = dx_manager.run_dx_test( 
        
                   dx_tests_to_run='my_test1', 
        
                   hsi_event=hsi_event, 
        
               ) 
        
               assert result_from_dx_manager == df.at[person_id, 'mi_status']

TLOmodel/tests/test_dxmanager.py

Lines 263 to 271 in 7416d56

    
           for person_id in df.loc[df.is_alive].index: 
        
               hsi_event.target = person_id 
        
               result_from_dx_manager = dx_manager.run_dx_test( 
        
                   dx_tests_to_run='my_test1', 
        
                   hsi_event=hsi_event, 
        
                   use_dict_for_single=True 
        
               ) 
        
               assert isinstance(result_from_dx_manager, dict) 
        
               assert result_from_dx_manager['my_test1'] == df.at[person_id, 'mi_status']

TLOmodel/tests/test_dxmanager.py

Lines 310 to 324 in 7416d56

    
           hsi_event.target = person_id 
        
           # Confirm the my_test1 does not give result 
        
           assert None is dx_manager.run_dx_test(dx_tests_to_run='my_test1', 
        
                                                 hsi_event=hsi_event, 
        
                                                 ) 
        
           # Confirm that my_test2 and my_test3 does give result 
        
           assert None is not dx_manager.run_dx_test(dx_tests_to_run='my_test2', 
        
                                                     hsi_event=hsi_event, 
        
                                                     ) 
        
           assert None is not dx_manager.run_dx_test(dx_tests_to_run='my_test3', 
        
                                                     hsi_event=hsi_event, 
        
                                                     )

There is also a case in postnatal_supervisor.py

TLOmodel/src/tlo/methods/postnatal_supervisor.py

Lines 1164 to 1209 in 7416d56

    
           hsi_event.target = int(individual_id) 
        
           # SEPSIS 
        
           if (pnc_visit == 'pnc1') and (self.rng.random_sample() < params['prob_intervention_neonatal_sepsis_pnc']): 
        
               if self.sim.modules['HealthSystem'].dx_manager.run_dx_test( 
        
                 dx_tests_to_run='assessment_for_early_onset_neonatal_sepsis', hsi_event=hsi_event): 
        
                   logger.debug(key='message', data=f'Neonate {individual_id} has been assessed and diagnosed with early ' 
        
                                                    f'onset neonatal sepsis, they will be admitted for treatment') 
        
                   sepsis_treatment = HSI_PostnatalSupervisor_NeonatalWardInpatientCare( 
        
                       self, person_id=individual_id) 
        
                   self.sim.modules['HealthSystem'].schedule_hsi_event(sepsis_treatment, 
        
                                                                       priority=0, 
        
                                                                       topen=self.sim.date, 
        
                                                                       tclose=self.sim.date + DateOffset(days=1)) 
        
           elif (pnc_visit == 'pnc2') and (self.rng.random_sample() < params['prob_intervention_neonatal_sepsis_pnc']): 
        
               if self.sim.modules['HealthSystem'].dx_manager.run_dx_test(dx_tests_to_run='assessment_for_late_onset_' 
        
                                                                                          'neonatal_sepsis', 
        
                                                                          hsi_event=hsi_event): 
        
                   logger.debug(key='message', data=f'Neonate {individual_id} has been assessed and diagnosed with late ' 
        
                                                    f'onset neonatal sepsis, they will be admitted for treatment') 
        
                   sepsis_treatment = HSI_PostnatalSupervisor_NeonatalWardInpatientCare( 
        
                       self, person_id=individual_id) 
        
                   self.sim.modules['HealthSystem'].schedule_hsi_event(sepsis_treatment, 
        
                                                                       priority=0, 
        
                                                                       topen=self.sim.date, 
        
                                                                       tclose=self.sim.date + DateOffset(days=1)) 
        
           if pnc_visit == 'pnc2' and (self.rng.random_sample() < params['prob_intervention_delivered_hiv_test_pnc']): 
        
               if 'Hiv' in self.sim.modules.keys(): 
        
                   if ~df.at[individual_id, 'hv_diagnosed'] and df.at[mother_id, 'hv_diagnosed']: 
        
                       logger.debug(key='message', data='Neonate of a HIV +ve mother has been referred for ' 
        
                                                        'HIV testing during PNC at 6 weeks') 
        
                       individual_id = int(individual_id) 
        
                       self.sim.modules['HealthSystem'].schedule_hsi_event( 
        
                           HSI_Hiv_TestAndRefer(person_id=individual_id, module=self.sim.modules['Hiv']), 
        
                           topen=self.sim.date, 
        
                           tclose=None, 
        
                           priority=0) 
        
           hsi_event.target = mother_id

where it appears that a HSI_Event target is being updated to the ID of a child before a series of diagnosis tests run / HSI events potentially scheduled using this altered event then the event target returned to the ID of the mother.

In both set of cases it seems an alternative solution that wouldn't require updating the target of an existing HSI event would be to create something like a copy_with_new_target method in the IndividualScopeEventMixin class which creates a copy of the event with a new target and using this in place of the assignments to target. This would also require some logic to be put in place to clear the appointment time request cache on creating a copy with new target.

An alternative (or complementary) approach would be to make the caching mechanism more intelligent by making the cache a dictionary keyed by a tuple of the relevant attributes of HSI_Event which are used in computing the time request and retrieve entries from the cache based on the current values of the attributes; for example the key could be something like (hsi_event.target, hsi_event.ACCEPTED_FACILITY_LEVEL, tuple(hsi_event.EXPECTED_APPT_FOOTPRINT)) (tuple(d) where d is dict will create a tuple of the dictionary keys, whereas if d where a list this would create a tuple of the list entries so this would work if #318 were merged). This could also be easily extended to allow for the cache to also be used where actual_appt_footprint is not None by instead using (hsi_event.target, hsi_event.ACCEPTED_FACILITY_LEVEL, tuple(the_appt_footprint)) where

the_appt_footprint = hsi_event.EXPECTED_APPT_FOOTPRINT if actual_appt_footprint is None else actual_appt_footprint

This caching mechanism would probably have more overhead than the current simpler approach due to the dictionary indexing operations and construction of the key. Equally it would potentially result in fewer cache misses (post #323 being merged) if we took the latter approach of allowing keying by the actual_appt_footprint argument which might offset this overhead, and also would be robust against any future code which updates HSI event attributes between calls to get_appt_footprint_as_time_request.

tbhallett · 2021-07-02T11:40:10Z

From an attempt at making the target attribute of IndividualScopeEventMixin read-only it seems that there are a few cases in the code where the event target is updated after the event creation, though none of these cases seem to be occurring between calls of get_appt_footprint_as_time_request on the same event so do not result in the cache returning the wrong value.

There are a series of cases in tests/test_dx_manager.py where the target of a previously constructed HSI_Event is updated to a person_id, and a dx test run on with the updated event, e.g.

TLOmodel/tests/test_dxmanager.py

Lines 253 to 259 in 7416d56

for person_id in df.loc[df.is_alive].index:

hsi_event.target = person_id

result_from_dx_manager = dx_manager.run_dx_test(

dx_tests_to_run='my_test1',

hsi_event=hsi_event,

)

assert result_from_dx_manager == df.at[person_id, 'mi_status']

TLOmodel/tests/test_dxmanager.py

Lines 263 to 271 in 7416d56

for person_id in df.loc[df.is_alive].index:

hsi_event.target = person_id

result_from_dx_manager = dx_manager.run_dx_test(

dx_tests_to_run='my_test1',

hsi_event=hsi_event,

use_dict_for_single=True

)

assert isinstance(result_from_dx_manager, dict)

assert result_from_dx_manager['my_test1'] == df.at[person_id, 'mi_status']

TLOmodel/tests/test_dxmanager.py

Lines 310 to 324 in 7416d56

hsi_event.target = person_id

# Confirm the my_test1 does not give result

assert None is dx_manager.run_dx_test(dx_tests_to_run='my_test1',

hsi_event=hsi_event,

)

# Confirm that my_test2 and my_test3 does give result

assert None is not dx_manager.run_dx_test(dx_tests_to_run='my_test2',

hsi_event=hsi_event,

)

assert None is not dx_manager.run_dx_test(dx_tests_to_run='my_test3',

hsi_event=hsi_event,

)

Okay, obviously that is just me doing the testing in a clumsy/lazy way: this would be considered an abuse of the HSI system rather than its use!

I suppose that what I should have done is create and HSI event specific to each person rather than "recycling" the same HSI event.

There is also a case in postnatal_supervisor.py

TLOmodel/src/tlo/methods/postnatal_supervisor.py

Lines 1164 to 1209 in 7416d56

hsi_event.target = int(individual_id)

# SEPSIS

if (pnc_visit == 'pnc1') and (self.rng.random_sample() < params['prob_intervention_neonatal_sepsis_pnc']):

if self.sim.modules['HealthSystem'].dx_manager.run_dx_test(

dx_tests_to_run='assessment_for_early_onset_neonatal_sepsis', hsi_event=hsi_event):

logger.debug(key='message', data=f'Neonate {individual_id} has been assessed and diagnosed with early '

f'onset neonatal sepsis, they will be admitted for treatment')

sepsis_treatment = HSI_PostnatalSupervisor_NeonatalWardInpatientCare(

self, person_id=individual_id)

self.sim.modules['HealthSystem'].schedule_hsi_event(sepsis_treatment,

priority=0,

topen=self.sim.date,

tclose=self.sim.date + DateOffset(days=1))

elif (pnc_visit == 'pnc2') and (self.rng.random_sample() < params['prob_intervention_neonatal_sepsis_pnc']):

if self.sim.modules['HealthSystem'].dx_manager.run_dx_test(dx_tests_to_run='assessment_for_late_onset_'

'neonatal_sepsis',

hsi_event=hsi_event):

logger.debug(key='message', data=f'Neonate {individual_id} has been assessed and diagnosed with late '

f'onset neonatal sepsis, they will be admitted for treatment')

sepsis_treatment = HSI_PostnatalSupervisor_NeonatalWardInpatientCare(

self, person_id=individual_id)

self.sim.modules['HealthSystem'].schedule_hsi_event(sepsis_treatment,

priority=0,

topen=self.sim.date,

tclose=self.sim.date + DateOffset(days=1))

if pnc_visit == 'pnc2' and (self.rng.random_sample() < params['prob_intervention_delivered_hiv_test_pnc']):

if 'Hiv' in self.sim.modules.keys():

if ~df.at[individual_id, 'hv_diagnosed'] and df.at[mother_id, 'hv_diagnosed']:

logger.debug(key='message', data='Neonate of a HIV +ve mother has been referred for '

'HIV testing during PNC at 6 weeks')

individual_id = int(individual_id)

self.sim.modules['HealthSystem'].schedule_hsi_event(

HSI_Hiv_TestAndRefer(person_id=individual_id, module=self.sim.modules['Hiv']),

topen=self.sim.date,

tclose=None,

priority=0)

hsi_event.target = mother_id

where it appears that a HSI_Event target is being updated to the ID of a child before a series of diagnosis tests run / HSI events potentially scheduled using this altered event then the event target returned to the ID of the mother.

So, I'm not really clear on what is going on here, so have to call on @joehcollins again.

@joehcollins - is it intentional that the HSI target is switched between the mother and child in this way? Is it being done to overcome some constraint you hit on in the DxManager perhaps?

It's hard to judge/think about the other solutions without understanding this use case. I have a feeling that this may not be necessary.

In both set of cases it seems an alternative solution that wouldn't require updating the target of an existing HSI event would be to create something like a copy_with_new_target method in the IndividualScopeEventMixin class which creates a copy of the event with a new target and using this in place of the assignments to target. This would also require some logic to be put in place to clear the appointment time request cache on creating a copy with new target.

An alternative (or complementary) approach would be to make the caching mechanism more intelligent by making the cache a dictionary keyed by a tuple of the relevant attributes of HSI_Event which are used in computing the time request and retrieve entries from the cache based on the current values of the attributes; for example the key could be something like (hsi_event.target, hsi_event.ACCEPTED_FACILITY_LEVEL, tuple(hsi_event.EXPECTED_APPT_FOOTPRINT)) (tuple(d) where d is dict will create a tuple of the dictionary keys, whereas if d where a list this would create a tuple of the list entries so this would work if #318 were merged). This could also be easily extended to allow for the cache to also be used where actual_appt_footprint is not None by instead using (hsi_event.target, hsi_event.ACCEPTED_FACILITY_LEVEL, tuple(the_appt_footprint)) where
the_appt_footprint = hsi_event.EXPECTED_APPT_FOOTPRINT if actual_appt_footprint is None else actual_appt_footprint
This caching mechanism would probably have more overhead than the current simpler approach due to the dictionary indexing operations and construction of the key. Equally it would potentially result in fewer cache misses (post #323 being merged) if we took the latter approach of allowing keying by the actual_appt_footprint argument which might offset this overhead, and also would be robust against any future code which updates HSI event attributes between calls to get_appt_footprint_as_time_request.

This sounds reasonable to me, but I defer @matt-graham and @tamuri to make the judgement call (after it's been established that we actually need to have such a mechanism (to accommodate target being changed) -- my instinct being atm that it isn't).

joehcollins · 2021-07-02T12:15:42Z

@tbhallett - yes this was because in this version of the code I was delivering interventions to newborns and mothers within the same HSI (assuming that postnatal care would be delivered to both individuals as they would attend together and to keep the number of HSIs down) so had to change the hsi target when calling the dx_test for it to work on the newborn (maybe that was a bit hacky)

Conveniently i've actually removed this logic in the version that i'm currently working on and now I actually do have a separate PNC HSI for mothers and newborns (as coverage is actually different for mothers and newborns). I've checked my modules and theres no where im changing the hsi target . Very happy for you to comment out the function that this logic lives in?

tbhallett · 2021-07-02T12:33:11Z

@tbhallett - yes this was because in this version of the code I was delivering interventions to newborns and mothers within the same HSI (assuming that postnatal care would be delivered to both individuals as they would attend together and to keep the number of HSIs down) so had to change the hsi target when calling the dx_test for it to work on the newborn (maybe that was a bit hacky)

Conveniently i've actually removed this logic in the version that i'm currently working on and now I actually do have a separate PNC HSI for mothers and newborns (as coverage is actually different for mothers and newborns). I've checked my modules and theres no where im changing the hsi target . Very happy for you to comment out the function that this logic lives in?

Thanks @joehcollins -- great that this doesn't happen anymore. We could have made the simple change that the DxManager accepts an optional argument of person_id in the case when the relevant person is not equal to hsi.target.

@matt-graham - what's the best way forward?

matt-graham · 2021-07-02T12:58:05Z

Thanks @joehcollins and @tbhallett for the explanations!

My inclination would probably be to switch to the more complicated caching mechanism based on the event attributes, rather than trying to make the relevant HSI_Event attributes immutable, as this both solves the immediate issue of ensuring an outdated cached value is not returned and also will make the cache effective when updated appointment footprints are being used. I will have a go at implementing this and doing a profiling run to check whether the additional overhead is that signficant.

While (modulo needing to make the changes to test_dxmanager and postnatal_supervisor as discussed above) we can make target a read-only property, this still would leave EXPECTED_APPT_FOOTPRINT and ACCEPTED_FACILITY_LEVEL needing to be made read-only, which is more complicated as both are set directly in the __init__ method of derived classes rather than via a keyword argument passed to a super()__init__ call. We could make both properties and define setter methods which will only store a value if a value has not previously been set. This would prevent a new footprint being assigned to EXPECTED_APPT_FOOTPRINT but wouldn't preclude the current footprint being modified inplace as it is currently stored as a Counter which like dict is mutable. However, if instead of changing to a list in #318 we changed to a tuple or frozenset this would also solve that issue. These are all potentially worthwhile changes to make as they would also make the code more robust in general (by making it more difficult to accidentally use HSI_Event in an unintended manner as in the cases above) but given using a different cache mechanism would resolve the immediate concern it seems like these changes might be better made in separate PRs.

matt-graham · 2021-07-02T16:27:13Z

Profiling results from a run with the updated cache mechanism suggested above suggest the overhead from constructing the key and the dictionary accesses are minimal. Specifically 1 year run of scale_run.py gives the following SnakeViz output for the overall time breakdown,

and concentrating specifically on the time spent in get_appt_footprint_as_time_request

Compared to the previous simpler caching mechanism, there appears to be a greater speedup in the overall breakdown, though as the time spent in get_appt_footprint_as_time_request is slightly greater (199s vs 120s) it seems this is probably due to variations in the background load on the processor (the decreased chance of a cache miss here should not figure as currently pre-#323 being merged actual_appt_footprint should always be None I believe). Either way, it seems we can be confident the updated caching mechanism is not significantly worse performance wise than the previous simpler implementation, and importantly it is now robust to changes to the HSI_Event attributes and also deals with cases where actual_appt_footprint is specified which will become important after #323 is merged.

Caching appointment time requests

0983b76

Cache computed appointment time requests generated from EXPECTED_APPT_FOOTPRINT in the HSI_Event instance

matt-graham requested review from tbhallett and tamuri July 1, 2021 13:59

Updating to dict-based cache

c744c09

tbhallett approved these changes Jul 8, 2021

View reviewed changes

tamuri approved these changes Jul 8, 2021

View reviewed changes

tamuri merged commit 5ebc5d9 into master Jul 8, 2021

tamuri deleted the mmg/caching-appt-time-request branch July 8, 2021 10:14

matt-graham mentioned this pull request Jul 28, 2021

Improving performance: profiling scale_run simulations #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching appointment time requests #322

Caching appointment time requests #322

matt-graham commented Jul 1, 2021 •

edited

Loading

tbhallett commented Jul 1, 2021

matt-graham commented Jul 1, 2021

tbhallett commented Jul 1, 2021

matt-graham commented Jul 2, 2021

tbhallett commented Jul 2, 2021

joehcollins commented Jul 2, 2021

tbhallett commented Jul 2, 2021

matt-graham commented Jul 2, 2021 •

edited

Loading

matt-graham commented Jul 2, 2021

Caching appointment time requests #322

Caching appointment time requests #322

Conversation

matt-graham commented Jul 1, 2021 • edited Loading

tbhallett commented Jul 1, 2021

matt-graham commented Jul 1, 2021

tbhallett commented Jul 1, 2021

matt-graham commented Jul 2, 2021

tbhallett commented Jul 2, 2021

joehcollins commented Jul 2, 2021

tbhallett commented Jul 2, 2021

matt-graham commented Jul 2, 2021 • edited Loading

matt-graham commented Jul 2, 2021

matt-graham commented Jul 1, 2021 •

edited

Loading

matt-graham commented Jul 2, 2021 •

edited

Loading