Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch of runs submitted from PR #1308 before/after review&master merge inconsistent #1348

Closed
marghe-molaro opened this issue May 15, 2024 · 57 comments
Assignees
Labels
bug Something isn't working epi

Comments

@marghe-molaro
Copy link
Collaborator

marghe-molaro commented May 15, 2024

PR #1308 allows user to rescale HR capabilities to capture effective capabilities when transitioning to mode 2.

Two batch of runs where submitted to Azure from this PR, effect_of_policy-2024-04-09T132919Z (2 runs per draw) and effect_of_capabilities_scaling-2024-04-29T074640Z (10 runs per draw). While the rescaling seems to be occurring in both sets of runs, the evolution of DALYs recovered is much different when all other factors (cons. availability, HR used, etc...) are identical, as shown in plots below.
Scaling of capabilities

The two branches from which these jobs were submitted produce identical outputs to the test test_rescaling_capabilities_based_on_squeeze_factors, suggesting difference does not arise from stylistic changes in how rescaling is computed in more recent branch suggested by @matt-graham. While the more recent branch includes checks on refactoring not being infinite, which would have occurred in the case of some mental health care workers, this should not be affecting DALYs related to HIV/AIDS shown here which do not rely on this medical worker class.

Other differences between branches include changes brought over by master including:

  • Refactoring of generic first appointments
  • Transition to python 3.11
  • new scenario switcher class

@matt-graham any suggestions on things to double check would be very appreciated!

@matt-graham
Copy link
Collaborator

Hi @marghe-molaro. Nothing immediately comes to mind with regards to things that could have caused such a substantial change in model behaviour. As far as we're aware the changes to HSI events and generic first appointments in #1289, #1292 and #1315 shouldn't have changed the model output behaviour at all, and similarly changing to running on Python 3.11 on Azure in #1323 also should not have effected the model outputs. But obviously something has so maybe I'm assuming wrongly 😬.

Do you have the specific Git commit SHAs for the two sets of runs (I think should be recorded under the key commit in the JSON file in the top-level directory containing the batch job results)? With those we could try running (shorter) runs locally and try to see where they start diverging

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 15, 2024

I totally agree @matt-graham that none of these changes should have made a difference, that's why the difference in outputs is so frustrating!

The two Git commit SHAs are are follows:
for effect_of_policy-2024-04-09T132919Z
"commit": "c61895880abf8e942635839d7d62ed38c83f6f89",
"github": "https://github.com/UCL/TLOmodel/tree/c61895880abf8e942635839d7d62ed38c83f6f89"
[draw 0]

and for
effect_of_capabilities_scaling-2024-04-29T074640Z
"commit": "a91d60c518069ed11477f27fb3c81f1ee92302ca",
"github": "https://github.com/UCL/TLOmodel/tree/a91d60c518069ed11477f27fb3c81f1ee92302ca"
[matched by draw 6]

If you compare the parameter choice in the JSON file for the draws 0 and 6 respectively you can see they are identical. These correspond to the "No growth" scenarios in the plots above, which clearly differ.

I did a quick run locally for a very small population (100 individuals), and the rescaling factors at the moment of switching modes (2019) were different (around doubled in the more recent version of the PR); this seems to suggest that the difference is starting to emerge before the switch to mode 2/rescaling, however would the PRs on the generic first appointments have affected the sequence of random numbers? If so it might be quite hard to pin down whether the diversion is just down to the small population.

Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.

@matt-graham
Copy link
Collaborator

matt-graham commented May 15, 2024

Maybe it would be good to submit a run from the latest branch rolling back the "infinity check" commit, to double check that it is not that creating issues. It's the only PR-specific change that was brought in.

Agree this is probably worth doing, as even if this is unlikely to be the source of the difference it would be good to exclude it.

To check for consistency in simulations I'm running

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

with the two commits c618958 and a91d60c with some small changes to src/tlo/simulation.py to print a hash of the population dataframe after each event has been completed. As calculating the hash takes a little while this is unfortunately quite slow but the outputs after the first 500 events suggest the population dataframes have stayed exactly in sync. If this continues to be the case this suggests it might be something specific to the scenarios rather than the simulation, though it could well be that I just haven't reached the divergence point yet.

@marghe-molaro
Copy link
Collaborator Author

Thank you so much Matt. Do you have any intuition as to how many years in the simulation 500 events is? When running locally I was checking the divergence after 8 years. I was doing the local runs using the same scenario files from which I submitted the batches on Azure (with reduced population).

For completeness let me just note that the "infinity check" would only affect things after a mode switch, so the fact that things were starting to diverge earlier would suggest that it's unlikely to be the problem. But I will check directly.

Many thanks again!

@matt-graham
Copy link
Collaborator

The 500 events was less than a month of simulation time, this was very slow to run partly because event rate scales with population (and I was using a relatively large 50 000 initial population) and also because outputting dataframe hash on every event introduces a lot of overhead to the point that computing hashes ends up being the main computational cost. I've re-run with a more sensible configuration of an initial population of 5000 and outputting hashes after only every 100 events. So far after 60 000 events / just over seven years simulation time the population dataframes still appear to be in sync from the reported hashes - I will continue letting this run just to check divergences don't start appearing around the 8 year mark.

Assuming we don't get any later divergences, this suggests the broader changes to the framework in master that were merged in (particularly the refactoring of HSI events / generic first appointments) have not changed the model behaviour, which is what we expected. This is not immediately helpful in trying to identify what is causing the divergence here, but at least means we can probably ignore these changes when looking for the problem (unless there is some weird interaction with running simulations as part of a scenario that has meant these changes cause differences in model behaviour).

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 16, 2024

Thank you Matt!

On the "infinity check" side, I can confirm this couldn't have resulted in changes to model behaviour because infinities are already prevented from entering the 'Fraction_Time_Used' column at an earlier stage, namely:
summary_by_officer['Fraction_Time_Used'] = (
summary_by_officer['Minutes_Used'] / summary_by_officer['Total_Minutes_Per_Day']
).replace([np.inf, -np.inf, np.nan], 0.0)
which means the added infinity check at the moment of rescaling is redundant.

@matt-graham
Copy link
Collaborator

Just as a quick follow up, the simulations continued to remain in sync up to 10 years simulation time, so I think that means we relatively confidently exclude that the differences are arising due to the changes to HSI events / generic first appointments.

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 16, 2024

Ok, what's even stranger is that when I run locally from the two branches using the scenario files and enforcing a mode switch after only one year (2011) and assuming 1000 individuals I already see a difference in the rescaling factors:

@marghe-molaro
Copy link
Collaborator Author

more_earlier_2011_switch More_recent_2011_switch

@tamuri tamuri added bug Something isn't working epi labels May 16, 2024
@marghe-molaro
Copy link
Collaborator Author

When I run locally with the python src/scripts/profiling/scale_run.py scenario files, this is what I see on the first day in terms of minutes used:

  • Minutes used at level 2 appear to be consistent between two branches;
  • Minutes used at level 1a appear to be similar but not identical between the two branches;
    If you are seeing the event chain as identical, I wonder if the issue could be down to the assumed footprint? I.e. the sequence of events is identical, but the registered request in terms of minutes has changed.
original_branch_first_day later_branch_first_day

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 17, 2024

The commit leading to changes in the "Minutes Used" in this branch is commit d56042f, which merged master into branch.

@matt-graham
Copy link
Collaborator

Thanks for the further 🕵️ work @marghe-molaro!

If you are seeing the event chain as identical, I wonder if the issue could be down to the assumed footprint? I.e. the sequence of events is identical, but the registered request in terms of minutes has changed.

Yeah I have only been checking for consistency in terms of the values in the population dataframe. For any logged outputs which don't 'feedback' into the model in terms of affecting triggerring or effects of events, then this wouldn't be captured by what I was checking. I'll try set something up to see if/when the logged capabilities start diverging.

The commit leading to changes in the "Minutes Used" in this branch is commit d56042f, which merged master into branch.

Thanks for isolating this. Given this doesn't touch healthsystem.py it not obvious to me what would be causing these differences, but I'll see if stepping through simulations reveals anything!

@matt-graham
Copy link
Collaborator

Hi @marghe-molaro.

Running

python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000

on a91d60c with a line

print(summary_by_officer.loc["Clinical"])

(choosing "Clinical" here just to keep output manageably sized and because there were differences evident for this officer type in output you recorded) added to Healthsystem.log_current_capabilities_and_usage I get output after the 10000th event

10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        380.24            2.962632
2                          270.021738        132.24            0.489738
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        357.06            2.782025
2                          270.021738        259.12            0.959626
3                          113.389117         20.00            0.176384
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        435.56            3.393656
2                          270.021738        352.12            1.304043
3                          113.389117         60.00            0.529151
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        379.12            2.953905
2                          270.021738        341.00            1.262861
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000

and running the equivalent on c618958 we get

10000 2011-04-17 00:00:00 13314295d95889a198c06d870d40a548244f2124
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        380.24            2.962632
2                          270.021738        132.24            0.489738
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        357.06            2.782025
2                          270.021738        259.12            0.959626
3                          113.389117         20.00            0.176384
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        435.56            3.393656
2                          270.021738        352.12            1.304043
3                          113.389117         60.00            0.529151
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                         128.345357        379.12            2.953905
2                          270.021738        341.00            1.262861
3                          113.389117         40.00            0.352768
4                            1.158937          0.00            0.000000
5                            2.679878          0.00            0.000000

which seems to be identical (and from some quick eyeballing I can't spot any divergences anywhere else in output).

Can I check what commands you were running @marghe-molaro where you got the differences you posted above in the logged capabilities?

@marghe-molaro
Copy link
Collaborator Author

I get the difference when running python src/scripts/profiling/scale_run.py.

Admittedly when running python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 I get the same result (on the first day of the sim at least)

@marghe-molaro
Copy link
Collaborator Author

Just copying it here to double check I'm not going out of my mind. "first" and "second" component-comparison are just the names of the copies of the two commits I made.

You can see when using python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout I get different results, whereas when using python src/scripts/profiling/scale_run.py --years 10 --disable-log-output-to-stdout --initial-population 5000 the output is the same.

Screenshot 2024-05-17 at 18 44 11 Screenshot 2024-05-17 at 18 45 16

@matt-graham
Copy link
Collaborator

Thanks @marghe-molaro that's helpful. The default initial population is much larger than 5000 so it's possible it's something to do with this. Currently away from keyboard but will check when back if I see differences running with default arguments. If it's some rare event causing the discrepancy that could explain why only evident at larger populations possibly?

@marghe-molaro
Copy link
Collaborator Author

Thank you so much @matt-graham!
Just to add to the general confusion, when I was running locally with a different scenario file and forcing the population to be quite small (<=1000) I was still seeing discrepancy on day one...so I'm really confused as to why it's still not showing up for you with a population of 5000.

@marghe-molaro
Copy link
Collaborator Author

PS before you do anything else, please check if you can recover the discrepancy between the commits on the first day of the sim when running with python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout, so far I'm the only one seeing these discrepancy so I'd like to make sure you see them too!

@matt-graham
Copy link
Collaborator

matt-graham commented May 17, 2024

Hmm odd, I don't seem to get any discrepancies running

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

On a91d60c

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on c618958

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

Do you have a clean working tree where you're running these?

@matt-graham
Copy link
Collaborator

Interestingly the figures I get seem to be the same as for your molaro/second-component-comparison branch.

@marghe-molaro
Copy link
Collaborator Author

This is my commit log for molaro/first-component-comparison. commit a6cfbd4 is on this branch before the master merge, the two "Test" commits just modify the healthsystem to print the summary log.

Screenshot 2024-05-17 at 19 57 05

@matt-graham
Copy link
Collaborator

Just realised most of the checks for consistency I performed previously were bogus as I was running one set of results from an additional working tree, but as I was using the same conda environment for both in which the tlo package was installed in editable mode by pointing to the main working tree, running a script from the additional working tree still used the tlo source code from the main working tree 🤦.

After now trying to run

python src/scripts/profiling/scale_run.py --disable-log-output-to-stdout

from the two commits with a line print(summary_by_officer.loc["Clinical"]) added to Healthsystem.log_current_capabilities_and_usage in both cases I get the same differences in output as @marghe-molaro has been reporting

On a91d60c

                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on c618958

                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000           0.0            0.000000
1a                        1283.453567         618.0            0.481513
2                         2700.217379        1184.0            0.438483
3                         1133.891165           0.0            0.000000
4                           11.589373           0.0            0.000000
5                           26.798780           0.0            0.000000
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        693.81            0.540581
2                         2700.217379       1023.00            0.378858
3                         1133.891165        797.00            0.702889
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

@matt-graham
Copy link
Collaborator

Repeating the above with population dataframe hash also being printed out every 100 events we get output on c618958

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000           0.0            0.000000
1a                        1283.453567         618.0            0.481513
2                         2700.217379        1184.0            0.438483
3                         1133.891165           0.0            0.000000
4                           11.589373           0.0            0.000000
5                           26.798780           0.0            0.000000
100 2010-01-02 00:00:00 3ff9fc28a196f298b69314ac3c31ae1d95db5623
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        693.81            0.540581
2                         2700.217379       1023.00            0.378858
3                         1133.891165        797.00            0.702889
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

and on a91d60c

0 2010-01-01 00:00:00 4deda7ea3a711e16534247a1b63b93453cdf6bd5
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        654.75            0.510147
2                         2700.217379       1184.00            0.438483
3                         1133.891165          0.00            0.000000
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000
100 2010-01-02 00:00:00 768f2b3fff2c7973c0776758f3340c35415f6134
                Total_Minutes_Per_Day  Minutes_Used  Fraction_Time_Used
Facility_Level
0                            0.000000          0.00            0.000000
1a                        1283.453567        802.75            0.625461
2                         2700.217379       1023.00            0.378858
3                         1133.891165       1195.50            1.054334
4                           11.589373          0.00            0.000000
5                           26.798780          0.00            0.000000

So the population dataframes are diverging. This does suggest something in the merge commit d56042f @marghe-molaro previously identified as being where the difference appear has caused a change in model behaviour.

@matt-graham
Copy link
Collaborator

matt-graham commented May 20, 2024

To try to isolate which commit to master that was merged in caused the differences, I created a script population_checksum_check_for_bisect.py which checks the population dataframe checksum against a reference value for a short simulation of 1 month with 1000 initial population (after finding this does give different checksums on the relevant commits)

REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
python src/scripts/profiling/scale_run.py --years $YEARS --months $MONTHS --log-final-population-checksum --initial-population $INITIAL_POPULATION | tail -n 1 | grep -q "Population checksum: ${REFERENCE_CHECKSUM}"

Then running git bisect with

git checkout 82a02e9740f4129cd5301b2be717bab4eab7a717
git bisect start
git bisect bad
git checkout 012ef6d7fd1e5635ce1987abd70cda0f8d5e4ae9
git bisect good
git bisect run ./population_checksum_check_for_bisect.py

which uses 012ef6d as 'good' commit (commit on master #1308 was initially branched off from) and 82a02e9 as 'bad' commit (commit on master at point it was merged in to #1308 in d56042f) with this identifying bd0a763 (merge of #1292) as the relevant first 'bad' commit.

So looks like #1292 introduced some changes to the model behaviour inadvertently. @willGraham01 do you think you might be able to help with tracking down what in #1292 might have led to the differences? We could probably do a git bisect similar to above on the branch #1292 was based on to isolate which commit in that branch the differences first arose.

@matt-graham
Copy link
Collaborator

matt-graham commented May 20, 2024

Looks like some of the commits in wgraham/1237-speedup-concept don't allow running the check script as there are errors in the code which means the scale_run.py script fails to run. Might need to adjust the check script to return a special exit code 125 if scale_run.py fails to complete without error, exit code 0 if it completes and gives correct checksum and exit code 1 if it completes but gives wrong checksum.

@willGraham01
Copy link
Collaborator

willGraham01 commented May 21, 2024

After some slight edits to @matt-graham's script above to account for the 125 exit codes,

#!/bin/bash

REFERENCE_CHECKSUM=dc273b2e0ef27a3d61afb954ede48b8ba038a9a7
YEARS=0
MONTHS=1
INITIAL_POPULATION=1000
RUN_CHECKSUM=$( \
    python src/scripts/profiling/scale_run.py \
    --years $YEARS --months $MONTHS --log-final-population-checksum \
    --initial-population $INITIAL_POPULATION | tail -n 1 \
    | grep -oP "(?<=checksum: ).*(?=\"])" \
)

if [ $? -ne 0 ]
then
    echo "Python script failed, commit cannot be tested"
    exit 125
fi

echo "Run checksum was:  $RUN_CHECKSUM"
echo "Checksum to match: $REFERENCE_CHECKSUM"

if [[ "$RUN_CHECKSUM" == "$REFERENCE_CHECKSUM" ]]
then
    echo "That's a match"
else
    echo "No match, fail"
    exit 1
fi

and then bisecting along the wgraham/1237-speedup-concept branch;

git checkout wgraham/1237-speedup-concept
git bisect start
git bisect bad
git checkout 012ef6d
git bisect good
git bisect run <the-script-above>

we get that the first "bad" commit is 3863c0b2c9120c274eea6033b255a694e0f0df9a. I'll take a look at this commit in more detail and report back!

Fixes

3863c0b2c9120c274eea6033b255a694e0f0df9a - some logic translation in check_if_fever_is_caused_by_maleria. Discrepancy fixed but tests need updating.

0fd30ba22 introduces a change in the number of random samples generated by the stunting module, which is also causing a divergence. Fixed this.

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 21, 2024

Thank you @willGraham01! Which PR are these fixes coming in from, #1292? It would be good to know when they have been merged in master!

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 23, 2024

Hi @willGraham01, thank you so much for this!

The idea that this would be related to an issue with the ordering of HSIs in the queue makes sense because the difference in population-wide health outputs (see top of this PR) is most visible in mode 2 - when being at the top of the queue "pays off" - rather than in mode 1.

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

Clarification: these artificial effects would be prevented at a statistical level in a large population; in a small population they might well lead to differences in outcomes for the few individuals considered.

I will have a look and a think.

@marghe-molaro
Copy link
Collaborator Author

Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?

@willGraham01
Copy link
Collaborator

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order CMD and Malaria are added to the queue -> randomise_queue now assigns Malaria a higher "equal" priority than the HIV event -> starts the divergence. Since all the RNG is seeded separately we'd be doing:

"Good code":

  • Schedule HIV for person 466 (randomise_queue gives priority 2, for sake or argument)
  • Schedule Malaria for person 865 (randomise_queue gives priority 3)
  • Schedule CMD for person 865 (randomise_queue gives priority 1)
  • Resulting order: CMD, HIV, Malaria which doesn't run due to resource constraints.

With the same RNG,

"Refactored code":

  • Schedule HIV for person 466 (priority 2)
  • Schedule CMD for person 865 (priority 3)
  • Schedule Malaria for person 865 (priority 1)
  • Resulting order: Malaria, HIV which doesn't run due to resource constraints, CMD which runs because it doesn't need the same resources

I don't know if this is what's happening, but its a possibility that's consistent with my understanding of randomise_queue. It only works of course because it's the same person (865) being assigned two subsequent events.

@willGraham01
Copy link
Collaborator

Sorry @willGraham01, is the "long version" example you are describing running in mode 1 or mode 2?

Everything here is running mode 2

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 23, 2024

What I find very confusing though is that the order in which HSIs are added to the queue is explicitly randomised (see randomise_queue parameter) exactly to prevent such artificial effects from emerging. The question is then why does the randomisation fail to truly prevent such effects from occurring.

This does seem odd, but I think this it could still be consistent with what we're seeing here? Swapping the order CMD and Malaria are added to the queue -> randomise_queue now assigns Malaria a higher "equal" priority than the HIV event -> starts the divergence.

Sorry, you're absolutely right, I've now clarified above that what I mean is that at a statistical level/for large populations this shouldn't make a difference - i.e. outcome of a specific/few individual(s) may be different (which is what you're seeing) but if considering large enough populations those differences should cancel out if the ordering of the queue for equal-priority events is truly random. Could this be an indication that we're not considering large enough populations ?

@tbhallett
Copy link
Collaborator

By the same token as @marghe-molaro says, I am wondering whether the tests being used for the git bisect is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?

@marghe-molaro
Copy link
Collaborator Author

By the same token as @marghe-molaro says, I am wondering whether the tests being used for the git bisect is sufficiently capturing the unexpected behaviour that @marghe-molaro notes in her runs. Might it be useful to either (i) do a large run again with the changes made so far and see if we recover the original behaviour; (ii) write a super-discriminatory test (e.g. in a large population over a short-run, compare squeeze factors against a reference value). ...?

Just to point out @tbhallett that wrt (ii), squeeze factors will remain constant even as the HSI queue is changed, because squeeze factors are only meaningful in mode=1, where the refactoring of the queue doesn't lead to any differences anyway.

As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?

And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?

@willGraham01
Copy link
Collaborator

willGraham01 commented May 23, 2024

As for (i) I can merge the changes that @willGraham01 has just made and then resubmit a new batch of runs in mode 2 (with and without rescaling, for curiosity) to check how things look. @willGraham01, have these changes already been merged in master?

And @willGraham01, can I double check first whether in these changes you are now forcing the order of the modules in the generic_first_appts to be the same as it used to be originally?

The branch I'm working on is #1366, it hasn't been merged into master yet. This commit forces the "old" order of the modules though, and everything seems to be consistent with 012ef6d (commit before #1292 was merged).

@marghe-molaro feel free to merge #1366 into an experimental branch, but it 100% shouldn't be merged into master yet 😅 I also haven't had a chance to work through the remaining commits in #1292 (to resolve the merge conflicts with master) and thus, you'll get a lot of merge conflicts too!

I can do this, this afternoon & tomorrow morning, and @marghe-molaro I can notify you when you should be able to merge #1366 into an experimental branch, and set off a batch job like you suggested to see how things look?

@marghe-molaro
Copy link
Collaborator Author

Perfect, sounds great @willGraham01! I'll be on standby for #1366.

@willGraham01
Copy link
Collaborator

@marghe-molaro #1366 now has master merged in and passes the script we've been using here. You can see the (now explicit) ordering of the modules in the generic appointments code.

Hopefully you should be able to merge this into your own branch 🤞 and run some scenarios to see if we've found the cause.

@marghe-molaro
Copy link
Collaborator Author

Hi @willGraham01 @matt-graham @tbhallett,

The test run from #1366 has now completed. We seem indeed to recover the behaviour that we were observing at the start (compare red line here with "No growth" case in left plot at the start of this Issue), i.e. when rescaling mode 2 there is no dramatic increase in AIDS DALYs after switching modes .

Yearly_DALYs_AIDS_After_Fix

While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.

@tbhallett
Copy link
Collaborator

While it is reassuring that we recover the original results, if the true cause of the discrepancy was the order in which the modules are called during the first generic appointment it is concerning that this would be observed on 100,000 population sized runs. However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

We could maybe resubmit a run after reshuffling the order in which modules are called in #1366 to check whether this really is the source of the discrepancy? If we confirmed this to be the case this artificial effect would require some quite careful consideration.

Strong support these sentiments and the desire to understand if/how the ordering of processing in the first appointments has such a big effect.

@matt-graham
Copy link
Collaborator

However @willGraham01 if I understood correctly there were some other minor fixes you brought in with #1366, so maybe/hopefully the blame, for some for now non-identified reason, was with those?

Will's away this week, but as far as I'm aware other than the module order change in #1366, the other changes were mainly changes to ensure calls to random number generator happened in same order to give consistent results (which would hopefully also not have a major affect on distribution of results with a large population). From quickly scanning the diff, the one change I can see that might affect model behaviour is a change to a conditional statement in the diarrhoea module to add an additional clause (have just tagged you both in a comment on this line as couldn't see an easy way to link from here).

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented May 28, 2024

Thank you @matt-graham! I was hoping you might be able to help me clarify something while @willGraham01 is away: in one of his previous replies he said that:

  • The failure cannot be because one of the modules in the main loop is editing any shared dataframe properties, since lines 164-180 occur after the dataframe update from the main loop is applied, but still use the "cached" values that the main loop utilised.

My understanding though is that prior @willGraham01's revision, modules were not only using shared dataframe properties during _do_on_generic_first_appt, but also modifying them, hence the need for this loop to exist at the end of the function:

        if proposed_patient_details_updates:
            df.loc[
                self.target, proposed_patient_details_updates.keys()
            ] = proposed_patient_details_updates.values()

Moving to the use of "cached" values would therefore necessarily lead to differences, as modules would now be accessing the individual's original parameters, rather than those updated by the modules called before inside _do_on_generic_first_appt. Am I missing something?

@tbhallett
Copy link
Collaborator

It may be a moot point, but I also noticed that this section, will lead to the updating not working as expected if more than module is trying to update the same property. (The updates from the last module will "win".)

if module_patient_updates:
proposed_patient_details_updates = {
**proposed_patient_details_updates,
**module_patient_updates,
}

@tamuri
Copy link
Collaborator

tamuri commented May 28, 2024

As was mentioned in the catch-up meeting, this is now a blocker on several analyses. I'm at a workshop today, but will have time tomorrow to dig into it. I think if the situation looks difficult to fix quickly we should rollback the refactoring changes so epi team can continue on with their work.

@marghe-molaro
Copy link
Collaborator Author

Hi @tamuri, do you have any suggestions as to how we should proceed? If we do roll back is that going to come in through master? Many thanks in advance!

@matt-graham
Copy link
Collaborator

Thank you @matt-graham! I was hoping you might be able to help me clarify something while @willGraham01 is away: in one of his previous replies he said that:

  • The failure cannot be because one of the modules in the main loop is editing any shared dataframe properties, since lines 164-180 occur after the dataframe update from the main loop is applied, but still use the "cached" values that the main loop utilised.

My understanding though is that prior @willGraham01's revision, modules were not only using shared dataframe properties during _do_on_generic_first_appt, but also modifying them, hence the need for this loop to exist at the end of the function:

        if proposed_patient_details_updates:
            df.loc[
                self.target, proposed_patient_details_updates.keys()
            ] = proposed_patient_details_updates.values()

Moving to the use of "cached" values would therefore necessarily lead to differences, as modules would now be accessing the individual's original parameters, rather than those updated by the modules called before inside _do_on_generic_first_appt. Am I missing something?

Hi @marghe-molaro, sorry for the delay replying. You are completely correct here and this is something I missed when reviewing #1292. The intention had been for the previous model behaviour to be retained (beyond possibly calling the module specific logic in a different order as we hadn't envisaged this would lead to changes in distributional output even if specific realisations were different). We should be updating the data structure used to mediate access to the individual properties within the loop when a module makes an update rather than always using the initially read cached values. The simplest way to restore the previous behaviour would be to move the code for updating the dataframe if there are updates inside the loop and also recreate the patient_details structure when doing this, that is something like

        patient_details = self.sim.population.row_in_readonly_form(self.target)
        for module in modules.values():
            module_patient_updates = getattr(module, self.MODULE_METHOD_ON_APPLY)(
                patient_id=self.target,
                patient_details=patient_details,
                symptoms=symptoms,
                diagnosis_function=self._diagnosis_function,
                consumables_checker=self.get_consumables,
                facility_level=self.ACCEPTED_FACILITY_LEVEL,
                treatment_id=self.TREATMENT_ID,
                random_state=self.module.rng,
            )
            if module_patient_updates:
                self.sim.population.props.loc[
                    self.target, module_patient_updates.keys()
                ] = module_patient_updates.values()
                patient_details = self.sim.population.row_in_readonly_form(self.target)

Combined with the changes Will made in #1366 to fix the module ordering etc. this should give us identical behaviour to previously.

Implementing exactly as above will however mean we may lose some of the performance benefit we were aiming for on #1292 as we'd now be updating the dataframe multiple times again. I think we can avoid this with a slightly more involved fix which will mean updating the PatientDetails data structure to allow both reading and writing (but only synchronizing the writes back to the dataframe when explicitly requested to allow us to still do this once). I think I can probably get a PR for this more involved submitted by tomorrow. After discussing with @tamuri what we're proposing is therefore that I'll aim to get a PR submitted to make this fix ASAP with a deadline of tomorrow midday. If for some reason this turns out to be more complicated than I expected we'll fall back to using the simpler (less performant) fix above, so either way we should have something ready to restore previous behaviour by tomorrow afternoon. Does that sound reasonable @marghe-molaro?

In the mean time I've created a new branch mmg/hsi-event-individual-updates-quickfix off from the branch in #1366 with the quick fix above for ensuring any module updates to the individual properties are reflected in the cached values immediately. You could potentially create a new branch from your analysis branch and merge in that branch and trigger a new Azure run to check if that does completely restore behaviour (or potentially run with a different module ordering to check if this is the origin of strong module ordering dependence).

It may be a moot point, but I also noticed that this section, will lead to the updating not working as expected if more than module is trying to update the same property. (The updates from the last module will "win".)

if module_patient_updates:
proposed_patient_details_updates = {
**proposed_patient_details_updates,
**module_patient_updates,
}

Thanks for also flagging this @tbhallett. In this case I think (beyond the issue with updates not being synced back to the cached values in patient_details dicussed above) this would work equivalently to the previous behaviour where if multiple modules update a particular property only the update from the last module would be reflected in the final population dataframe. This does mean there would be some module ordering dependence but we in this case would at least still be consistent with previous behaviour if using the same module ordering.

@tbhallett
Copy link
Collaborator

Thanks so much for this @matt-graham, @tamuri and @marghe-molaro

It's great that it seems like we'll be able to restore the original behaviour of the model and retain the performance benefits of the refactoring.

I have only one remaining question:

I was surprised that apparently the ordering of the modules being called in hsi_generic_first_appts.py seemed to have such a big effect. I would not have expected this. So, when we have the original behaviour of the model, I would keen to understand if that is right, and the cause of it. When we have the fix, please could we possibly discuss/investigate that too?

@matt-graham
Copy link
Collaborator

I have only one remaining question:

I was surprised that apparently the ordering of the modules being called in hsi_generic_first_appts.py seemed to have such a big effect. I would not have expected this. So, when we have the original behaviour of the model, I would keen to understand if that is right, and the cause of it. When we have the fix, please could we possibly discuss/investigate that too?

Agreed this is definitely something we should investigate - will create an issue to remind us to do this.

@marghe-molaro
Copy link
Collaborator Author

marghe-molaro commented Jun 6, 2024

Hi @matt-graham @willGraham01 @tamuri @tbhallett,

Here are the preliminary results from tests on the sensitivity of the output to module order, based on 3 runs per draw with a pop of 100,000. The cases I considered are as follows:

At this link I include the time evolution of total DALYs and of all causes. I am still sorting through them, but my first impression is that things broadly speaking look reasonable, as we wouldn't expect the runs to be identical in any case. But I wonder if some causes (esp. TB and depression, which are quite common) are indicative that a population of 100,000 may be a bit small. Let me know what you think.

@tbhallett
Copy link
Collaborator

Thanks very much @marghe-molaro

I think this is very convincing. We don't expect the order to make a difference and we see no evidence of it having an effect. I think we can be reassured by this and move on!

@matt-graham
Copy link
Collaborator

Thanks @marghe-molaro. Agree with @tbhallett, this looks like strong evidence module order not having an effect as expected. I'll link to your comment in the issue we raised to check about this and close it.

@marghe-molaro
Copy link
Collaborator Author

Thank you @matt-graham. I will also close this issue, as I believe we have now established the causes of the original discrepancy, which has been fixed by PR #1389.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working epi
Projects
None yet
Development

No branches or pull requests

5 participants