-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch of runs submitted from PR #1308 before/after review&master merge inconsistent #1348
Comments
Hi @marghe-molaro. Nothing immediately comes to mind with regards to things that could have caused such a substantial change in model behaviour. As far as we're aware the changes to HSI events and generic first appointments in #1289, #1292 and #1315 shouldn't have changed the model output behaviour at all, and similarly changing to running on Python 3.11 on Azure in #1323 also should not have effected the model outputs. But obviously something has so maybe I'm assuming wrongly 😬. Do you have the specific Git commit SHAs for the two sets of runs (I think should be recorded under the key |
I totally agree @matt-graham that none of these changes should have made a difference, that's why the difference in outputs is so frustrating! The two Git commit SHAs are are follows: and for If you compare the parameter choice in the JSON file for the draws 0 and 6 respectively you can see they are identical. These correspond to the "No growth" scenarios in the plots above, which clearly differ. I did a quick run locally for a very small population (100 individuals), and the rescaling factors at the moment of switching modes (2019) were different (around doubled in the more recent version of the PR); this seems to suggest that the difference is starting to emerge before the switch to mode 2/rescaling, however would the PRs on the generic first appointments have affected the sequence of random numbers? If so it might be quite hard to pin down whether the diversion is just down to the small population. Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in. |
Agree this is probably worth doing, as even if this is unlikely to be the source of the difference it would be good to exclude it. To check for consistency in simulations I'm running
with the two commits c618958 and a91d60c with some small changes to |
Thank you so much Matt. Do you have any intuition as to how many years in the simulation 500 events is? When running locally I was checking the divergence after 8 years. I was doing the local runs using the same scenario files from which I submitted the batches on Azure (with reduced population). For completeness let me just note that the "infinity check" would only affect things after a mode switch, so the fact that things were starting to diverge earlier would suggest that it's unlikely to be the problem. But I will check directly. Many thanks again! |
The 500 events was less than a month of simulation time, this was very slow to run partly because event rate scales with population (and I was using a relatively large 50 000 initial population) and also because outputting dataframe hash on every event introduces a lot of overhead to the point that computing hashes ends up being the main computational cost. I've re-run with a more sensible configuration of an initial population of 5000 and outputting hashes after only every 100 events. So far after 60 000 events / just over seven years simulation time the population dataframes still appear to be in sync from the reported hashes - I will continue letting this run just to check divergences don't start appearing around the 8 year mark. Assuming we don't get any later divergences, this suggests the broader changes to the framework in |
Thank you Matt! On the "infinity check" side, I can confirm this couldn't have resulted in changes to model behaviour because infinities are already prevented from entering the 'Fraction_Time_Used' column at an earlier stage, namely: |
Just as a quick follow up, the simulations continued to remain in sync up to 10 years simulation time, so I think that means we relatively confidently exclude that the differences are arising due to the changes to HSI events / generic first appointments. |
Ok, what's even stranger is that when I run locally from the two branches using the scenario files and enforcing a mode switch after only one year (2011) and assuming 1000 individuals I already see a difference in the rescaling factors: |
The commit leading to changes in the "Minutes Used" in this branch is commit d56042f, which merged master into branch. |
Thanks for the further 🕵️ work @marghe-molaro!
Yeah I have only been checking for consistency in terms of the values in the population dataframe. For any logged outputs which don't 'feedback' into the model in terms of affecting triggerring or effects of events, then this wouldn't be captured by what I was checking. I'll try set something up to see if/when the logged capabilities start diverging.
Thanks for isolating this. Given this doesn't touch |
Hi @marghe-molaro. Running
on a91d60c with a line print(summary_by_officer.loc["Clinical"]) (choosing
and running the equivalent on c618958 we get
which seems to be identical (and from some quick eyeballing I can't spot any divergences anywhere else in output). Can I check what commands you were running @marghe-molaro where you got the differences you posted above in the logged capabilities? |
I get the difference when running python src/scripts/profiling/scale_run.py. Admittedly when running python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 I get the same result (on the first day of the sim at least) |
Thanks @marghe-molaro that's helpful. The default initial population is much larger than 5000 so it's possible it's something to do with this. Currently away from keyboard but will check when back if I see differences running with default arguments. If it's some rare event causing the discrepancy that could explain why only evident at larger populations possibly? |
Thank you so much @matt-graham! |
PS before you do anything else, please check if you can recover the discrepancy between the commits on the first day of the sim when running with python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout, so far I'm the only one seeing these discrepancy so I'd like to make sure you see them too! |
Hmm odd, I don't seem to get any discrepancies running
On a91d60c
and on c618958
Do you have a clean working tree where you're running these? |
Interestingly the figures I get seem to be the same as for your |
This is my commit log for molaro/first-component-comparison. commit a6cfbd4 is on this branch before the master merge, the two "Test" commits just modify the healthsystem to print the summary log. |
Just realised most of the checks for consistency I performed previously were bogus as I was running one set of results from an additional working tree, but as I was using the same After now trying to run
from the two commits with a line On a91d60c
and on c618958
|
Repeating the above with population dataframe hash also being printed out every 100 events we get output on c618958
and on a91d60c
So the population dataframes are diverging. This does suggest something in the merge commit d56042f @marghe-molaro previously identified as being where the difference appear has caused a change in model behaviour. |
To try to isolate which commit to master that was merged in caused the differences, I created a script
Then running
which uses 012ef6d as 'good' commit (commit on master #1308 was initially branched off from) and 82a02e9 as 'bad' commit (commit on master at point it was merged in to #1308 in d56042f) with this identifying bd0a763 (merge of #1292) as the relevant first 'bad' commit. So looks like #1292 introduced some changes to the model behaviour inadvertently. @willGraham01 do you think you might be able to help with tracking down what in #1292 might have led to the differences? We could probably do a |
Looks like some of the commits in |
After some slight edits to @matt-graham's script above to account for the 125 exit codes, #!/bin/bash
REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
RUN_CHECKSUM=$( \
python src/scripts/profiling/scale_run.py \
--years $YEARS --months $MONTHS --log-final-population-checksum \
--initial-population $INITIAL_POPULATION | tail -n 1 \
| grep -oP "(?<=checksum: ).*(?=\"])" \
)
if [ $? -ne 0 ]
then
echo "Python script failed, commit cannot be tested"
exit 125
fi
echo "Run checksum was: $RUN_CHECKSUM"
echo "Checksum to match: $REFERENCE_CHECKSUM"
if [[ "$RUN_CHECKSUM" == "$REFERENCE_CHECKSUM" ]]
then
echo "That's a match"
else
echo "No match, fail"
exit 1
fi and then bisecting along the git checkout wgraham/1237-speedup-concept
git bisect start
git bisect bad
git checkout 012ef6d
git bisect good
git bisect run <the-script-above> we get that the first "bad" commit is Fixes
|
Thank you @willGraham01! Which PR are these fixes coming in from, #1292? It would be good to know when they have been merged in master! |
Hi @willGraham01, thank you so much for this! The idea that this would be related to an issue with the ordering of HSIs in the queue makes sense because the difference in population-wide health outputs (see top of this PR) is most visible in mode 2 - when being at the top of the queue "pays off" - rather than in mode 1. What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring. Clarification: these artificial effects would be prevented at a statistical level in a large population; in a small population they might well lead to differences in outcomes for the few individuals considered. I will have a look and a think. |
Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2? |
This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order "Good code":
With the same RNG, "Refactored code":
I don't know if this is what's happening, but its a possibility that's consistent with my understanding of |
Everything here is running mode 2 |
Sorry, you're absolutely right, I've now clarified above that what I mean is that at a statistical level/for large populations this shouldn't make a difference - i.e. outcome of a specific/few individual(s) may be different (which is what you're seeing) but if considering large enough populations those differences should cancel out if the ordering of the queue for equal-priority events is truly random. Could this be an indication that we're not considering large enough populations ? |
By the same token as @marghe-molaro says, I am wondering whether the tests being used for the |
Just to point out @tbhallett that wrt (ii), squeeze factors will remain constant even as the HSI queue is changed, because squeeze factors are only meaningful in mode=1, where the refactoring of the queue doesn't lead to any differences anyway. As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master? And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally? |
The branch I'm working on is #1366, it hasn't been merged into @marghe-molaro feel free to merge #1366 into an experimental branch, but it 100% shouldn't be merged into I can do this, this afternoon & tomorrow morning, and @marghe-molaro I can notify you when you should be able to merge #1366 into an experimental branch, and set off a batch job like you suggested to see how things look? |
Perfect, sounds great @willGraham01! I'll be on standby for #1366. |
@marghe-molaro #1366 now has Hopefully you should be able to merge this into your own branch 🤞 and run some scenarios to see if we've found the cause. |
Hi @willGraham01 @matt-graham @tbhallett, The test run from #1366 has now completed. We seem indeed to recover the behaviour that we were observing at the start (compare red line here with "No growth" case in left plot at the start of this Issue), i.e. when rescaling mode 2 there is no dramatic increase in AIDS DALYs after switching modes . While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those? We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration. |
Strong support these sentiments and the desire to understand if/how the ordering of processing in the first appointments has such a big effect. |
Will's away this week, but as far as I'm aware other than the module order change in #1366, the other changes were mainly changes to ensure calls to random number generator happened in same order to give consistent results (which would hopefully also not have a major affect on distribution of results with a large population). From quickly scanning the diff, the one change I can see that might affect model behaviour is a change to a conditional statement in the diarrhoea module to add an additional clause (have just tagged you both in a comment on this line as couldn't see an easy way to link from here). |
Thank you @matt-graham! I was hoping you might be able to help me clarify something while @willGraham01 is away: in one of his previous replies he said that:
My understanding though is that prior @willGraham01's revision, modules were not only using shared dataframe properties during _do_on_generic_first_appt, but also modifying them, hence the need for this loop to exist at the end of the function:
Moving to the use of "cached" values would therefore necessarily lead to differences, as modules would now be accessing the individual's original parameters, rather than those updated by the modules called before inside _do_on_generic_first_appt. Am I missing something? |
It may be a moot point, but I also noticed that this section, will lead to the updating not working as expected if more than module is trying to update the same property. (The updates from the last module will "win".) TLOmodel/src/tlo/methods/hsi_generic_first_appts.py Lines 93 to 97 in 6494277
|
As was mentioned in the catch-up meeting, this is now a blocker on several analyses. I'm at a workshop today, but will have time tomorrow to dig into it. I think if the situation looks difficult to fix quickly we should rollback the refactoring changes so epi team can continue on with their work. |
Hi @tamuri, do you have any suggestions as to how we should proceed? If we do roll back is that going to come in through master? Many thanks in advance! |
Hi @marghe-molaro, sorry for the delay replying. You are completely correct here and this is something I missed when reviewing #1292. The intention had been for the previous model behaviour to be retained (beyond possibly calling the module specific logic in a different order as we hadn't envisaged this would lead to changes in distributional output even if specific realisations were different). We should be updating the data structure used to mediate access to the individual properties within the loop when a module makes an update rather than always using the initially read cached values. The simplest way to restore the previous behaviour would be to move the code for updating the dataframe if there are updates inside the loop and also recreate the patient_details = self.sim.population.row_in_readonly_form(self.target)
for module in modules.values():
module_patient_updates = getattr(module, self.MODULE_METHOD_ON_APPLY)(
patient_id=self.target,
patient_details=patient_details,
symptoms=symptoms,
diagnosis_function=self._diagnosis_function,
consumables_checker=self.get_consumables,
facility_level=self.ACCEPTED_FACILITY_LEVEL,
treatment_id=self.TREATMENT_ID,
random_state=self.module.rng,
)
if module_patient_updates:
self.sim.population.props.loc[
self.target, module_patient_updates.keys()
] = module_patient_updates.values()
patient_details = self.sim.population.row_in_readonly_form(self.target) Combined with the changes Will made in #1366 to fix the module ordering etc. this should give us identical behaviour to previously. Implementing exactly as above will however mean we may lose some of the performance benefit we were aiming for on #1292 as we'd now be updating the dataframe multiple times again. I think we can avoid this with a slightly more involved fix which will mean updating the In the mean time I've created a new branch
Thanks for also flagging this @tbhallett. In this case I think (beyond the issue with updates not being synced back to the cached values in |
Thanks so much for this @matt-graham, @tamuri and @marghe-molaro It's great that it seems like we'll be able to restore the original behaviour of the model and retain the performance benefits of the refactoring. I have only one remaining question: I was surprised that apparently the ordering of the modules being called in |
Agreed this is definitely something we should investigate - will create an issue to remind us to do this. |
Hi @matt-graham @willGraham01 @tamuri @tbhallett, Here are the preliminary results from tests on the sensitivity of the output to module order, based on 3 runs per draw with a pop of 100,000. The cases I considered are as follows:
At this link I include the time evolution of total DALYs and of all causes. I am still sorting through them, but my first impression is that things broadly speaking look reasonable, as we wouldn't expect the runs to be identical in any case. But I wonder if some causes (esp. TB and depression, which are quite common) are indicative that a population of 100,000 may be a bit small. Let me know what you think. |
Thanks very much @marghe-molaro I think this is very convincing. We don't expect the order to make a difference and we see no evidence of it having an effect. I think we can be reassured by this and move on! |
Thanks @marghe-molaro. Agree with @tbhallett, this looks like strong evidence module order not having an effect as expected. I'll link to your comment in the issue we raised to check about this and close it. |
Thank you @matt-graham. I will also close this issue, as I believe we have now established the causes of the original discrepancy, which has been fixed by PR #1389. |
PR #1308 allows user to rescale HR capabilities to capture effective capabilities when transitioning to mode 2.
Two batch of runs where submitted to Azure from this PR, effect_of_policy-2024-04-09T132919Z (2 runs per draw) and effect_of_capabilities_scaling-2024-04-29T074640Z (10 runs per draw). While the rescaling seems to be occurring in both sets of runs, the evolution of DALYs recovered is much different when all other factors (cons. availability, HR used, etc...) are identical, as shown in plots below.
The two branches from which these jobs were submitted produce identical outputs to the test test_rescaling_capabilities_based_on_squeeze_factors, suggesting difference does not arise from stylistic changes in how rescaling is computed in more recent branch suggested by @matt-graham. While the more recent branch includes checks on refactoring not being infinite, which would have occurred in the case of some mental health care workers, this should not be affecting DALYs related to HIV/AIDS shown here which do not rely on this medical worker class.
Other differences between branches include changes brought over by master including:
@matt-graham any suggestions on things to double check would be very appreciated!
The text was updated successfully, but these errors were encountered: