-
Notifications
You must be signed in to change notification settings - Fork 7
How to do a model analysis
This document explains how to do a simple model analysis, beginning with thinking of a model "experiment" and ending with the presentation of results as a plot.
Your system is set-up correctly if you can achieve the following two things:
- Run all tests from the tip of branch
master
without errors. If this not achieved, make sure you have run through all the steps on the Installation page. - At the command line run
tlo batch-list
. You should see a list of jobs. If not, make sure you have run all the steps on the Setting up Azure Batch page.
A high level description of the key object types in TLOmodel
and how they relate to each other is given in the below entity relationship diagram
erDiagram
Simulation ||--|{ Module : registers
Module ||--|{ Event: schedules
Module ||--|{ Parameter : defines
Module ||--|{ Property : defines
EventQueue ||--|{ Event: prioritizes
Event }|--|| Population: updates
Simulation ||--|| Population: contains
Population ||--|{ Property: contains
Simulation ||--|| EventQueue: controls
For the purposes of running a model analysis the three key object types we need to know about are the simulation, modules and parameters
erDiagram
Simulation ||--|{ Module : registers
Module ||--|{ Parameter : defines
Simulation {
int seed
int initial_pop_size
Date date
Date end_date
dict log_config
}
Module {
str name
RandomState rng
}
Parameter {
Type type
str description
}
While we can interact directly with Simulation
objects, typically we instead create a Scenario
which encapsulates multiple simulation runs across one or more sets of parameters corresponding to different settings of interest. By running simulations multiple times for each set of parameters we can capture statistical variability in the simulation outcomes due to the randomness that enters the simulation from drawing from (pseudo-)random number generators.
erDiagram
Scenario ||--|{ Draw : contains
Draw ||--|{ Sample: contains
Scenario {
int seed
Date start_date
Date end_date
int pop_size
dict log_configuration
list modules
}
Draw {
int draw_number
int draw_seed
dict parameters
}
Sample {
int sample_number
int simulation_seed
}
The top-level Scenario
object controls the global set up of the simulations - for example the simulation start and end dates, initial population size, set of modules to register and configuration for the loggers used to record the simulation output.
A Scenario
will constitute one or more draws, with each draw corresponding to a specific set of model parameters (associated with each module) to run simulations for. Each module sets default values for the parameters it defines (often by reading in values from external resource files), with each draw defining overrides to these default values (with any parameters not specified left as their default values).
Each draw itself is associated with one or more samples or runs corresponding to running the simulation with the parameters for a particular draw with a specific seed for the random number generators used in the simulation. By having each run use a different random seed, we sample from the distribution of simulation outcomes / outputs given a set of parameter values.
For example for a scenario with number_of_draws = 2
and runs_per_draw = 3
we would have something like the following
graph TD;
subgraph S[ ]
Scenario --> d1[Draw 1];
Scenario --> d2[Draw 2];
subgraph D1[ ]
d1 --> s11[Sample 1/1];
d1 --> s12[Sample 1/2];
d1 --> s13[Sample 1/3];
end
subgraph D2[ ]
d2 --> s21[Sample 2/1];
d2 --> s22[Sample 2/2];
d2 --> s23[Sample 2/3];
end
end
Azure Batch is a service offered as part of the Microsoft Azure cloud platform for running batches of jobs in parallel on a pool of compute nodes.
The TLOmodel
package contains a command line interface (CLI) tool tlo
for submitting jobs corresponding to the runs (samples) of a particular Scenario
to Azure Batch, checking the status of jobs running on Azure Batch and retrieving the results of completed jobs.
On Azure Batch, a submitted job is executed in parallel across a collection nodes organized in to a pool.
graph TD
subgraph P["Pool"]
subgraph N1["Node 1"]
end
subgraph N2["Node 2"]
end
subgraph N3["Node 3"]
end
subgraph N4["Node 4"]
end
subgraph N...["..."]
end
end
An Azure Batch job consists of a list of one or more tasks which are distributed across the nodes in a pool. The tlo
CLI creates one task per scenario sample and creates a pool size equal to the total number of samples with one sample running on each node.
graph TD
subgraph P["Pool"]
subgraph N1["Node 1"]
end
subgraph N2["Node 2"]
end
subgraph N3["Node 3"]
end
subgraph N4["Node 4"]
end
subgraph N...["..."]
end
end
subgraph Tasks
s11
s12
s13
s21
s...[...]
end
N1 -.- s11[Sample 1/1]
N2 -.- s12[Sample 1/2]
N3 -.- s13[Sample 1/3]
N4 -.- s21[Sample 2/1]
Say we wish to understand the loss in health outcomes in a population through some consumables being unavailable. We could examine this in the model by running the model under two different settings:
- A setting which mimics the real-world situation where some consumables are sometimes not available.
- A counterfactual setting in which all consumables are always available.
The difference in the health outcomes in the model between these two scenarios will be an estimate of the effect of consumables sometimes not being available.
The example code is given below and is also found on the branch hallett/example-analysis
in src/scripts/analysis_example/
.
The file defining a scenario is accessed on the Azure Batch nodes by fetching a Git commit from the TLOmodel
repository on GitHub. To be able to submit a scenario job to Azure Batch we therefore need to create a new local Git branch which will be used to track the files associated with the scenario on your computer, add and commit the changes to the scenario files to this branch, and push these changes to GitHub.
Typically we would want to create a branch from the latest commit on the master
branch.
%%{init: { 'gitGraph': { 'mainBranchName': 'master'}} }%%
gitGraph
commit
commit
branch user-prefix/name-of-branch
checkout user-prefix/name-of-branch
commit
In this case we would first switch to (checkout) the master
locally. In PyCharm this can be done by going toGit > Branches...
from the main menu bar (or clicking the branch selector on the status bar at the bottom of the interface), selecting the master
branch from the Local branches
section of the resulting list and then selecting the Checkout
option.
Once we have checked out our local master
branch we need to make sure it is up to date with any changes on the corresponding master
branch on the remote repository on GitHub since we last updated. To do this select Git > Pull...
from the main menu bar, ensure that the dialog box shows git pull origin master
(pull
is the name of the Git command to fetch changes from a remote branch and merge them in the local branch, origin
is the default name for the remote repository) and the click the Pull
button.
Now that our master
branch is up to date with the latest commits on GitHub we are ready to create a new branch. To do select Git > New Branch...
from the main menu bar. In the resulting dialog enter the new branch name in the text field. As a convention, it can be useful to give all your branches a common prefix separated from the rest of the branch name by a forward slash for example user-prefix/name-of-branch
, as this will make it easier to find your branches in the PyCharm branch list and on GitHub. Once you have entered the branch name, ensure the Checkout branch
tick box is ticked and then click the Create
button.
For the purposes of this tutorial, rather than creating a new branch from master
we will instead create a branch from the hallett/example-analysis
branch as this already contains the example scenario files we will use here.
%%{init: { 'gitGraph': { 'mainBranchName': 'master'}} }%%
gitGraph
commit
commit
branch hallett/example-analysis
checkout master
commit
commit
checkout hallett/example-analysis
commit
commit
branch user-prefix/example-analysis
commit
To do this in PyCharm, first select Git > Update Project...
from the main menu bar (or click the Merge incoming changes into the current branch
option is selected in the dialog box that appears, then click OK
. This will ensure changes on all branches on the remote repository are fetched to the local repository.
Then select Git > Branches...
from the main menu bar, and under the Remote branches
section in the resulting list select the origin/hallet/example-analysis
branch. In the New branch name
field in the resulting dialog box enter the name {user-prefix}/example-analysis
where {user-prefix}
is replaced with a short prefix unique to you - for example your initials or surname, then click the Create
button.
Once we have a create a new branch, the next step will typically be to create a Python file defining the scenario for the analysis we wish to run. Within the top-level TLOmodel
directory the src
(shorthand for source code) contains two sub-directories tlo
(which contains the source code for the tlo
Python package defining the model framework) and scripts
. Any analysis specific files you create, for example defining a scenario or a script to analyze the results of a scenario run, should be placed in a relevantly named subdirectory of scripts
. Here we will use the src/scripts/analysis_example
directory already created on the hallett/example-analysis
branch.
To navigate to the src/scripts/analysis_example
we can use the tree navigator in the Project
tab on the left-hand side of the interface. Within the analysis_example
directory you should see two .py
Python source files:
analysis_impact_of_consumables_availability.py
scenario_impact_of_consumables_availability.py
It is the latter of these two files we will be interested in for now, which defines the scenario class we will use to in our analysis. If we were creating a new analysis scenario from scratch we could instead right click on the relevant subdirectory within the src/scripts
directory in the Projects
tree explorer in PyCharm, and select New > Python file
from the context file, and then give the file an appropriate name (typically all lower case with underscore characters _
as word separators).
We will now break down how we define a scenario programmatically for the analysis described above by explaining each part of the scenario_impact_of_consumables_availability.py
file.
from tlo import Date, logging
from tlo.methods.fullmodel import fullmodel
from tlo.scenario import BaseScenario
We first import the names of objects we need to use from other Python modules (libraries). Here we import various names from the tlo
package which contains the TLOmodel
code -
-
Date
- a type for representing dates in the simulation. -
logging
- a module which defines the types and functions for logging simulation output. -
fullmodel
- a function which returns a list of allModule
instances that should be registered in a run of the full model. -
BaseScenario
- an abstract type that custom scenario classes should inherit from that implements some shared functionality.
class ImpactOfConsumablesAvailability(BaseScenario):
We define a new custom subclass of BaseScenario
called ImpactOfConsumablesAvailability
- the name we use for the scenario class should be descriptive of what the scenario is meant to help analyze. Python class names should generally be in camel case.
def __init__(self):
super().__init__(
seed=0,
start_date=Date(2010, 1, 1),
end_date=Date(2015, 1, 1),
initial_population_size=10_000,
number_of_draws= 2,
runs_per_draw=2,
)
The initializer method __init__
of the class performs an initial set up of the class, including setting class attribute values. The self
argument here will be the instance of the newly created class. The block starting super().__init__(
line calls the __init__
method of the superclass of the object - here BaseScenario
. The BaseScenario.__init__
method accepts several keyword arguments which allows us to set attributes which need to be defined for any scenario:
-
seed
: The top-level integer seed to use for generating per run simulation seeds for random number generators (here zero), -
start_date
: Date to start simulation at (here January 1st 2010). -
end_date
: Date to end simulation at (here January 1st 2015). -
initial_population_size
: Number of individuals to initialize population with (here 10 000). -
number_of_draws
: Number of draws (distinct parameter sets) over which to run simulation (here two). -
runs_per_draw
: Number of independent model runs to perform per draw (here two).
def log_configuration(self):
return {
'filename': 'impact_of_consumables_availability',
'directory': './outputs',
'custom_levels': {
'*': logging.WARNING,
'tlo.methods.demography': logging.INFO,
'tlo.methods.healthburden': logging.INFO,
}
}
The log_configuration
method should return a dictionary configuring the loggers used to record the simulation output. A structured logging system is used with a hierarchy of loggers for the different modules, and it possible to control the verbosity of the output of each logger by specifying custom levels - for more details on how to configure logging see the Structured logging wiki page. Here we specify the default logging level to be WARNING
(a relatively high level which will give sparse output for most modules as only log messages at level ERROR
or WARNING
will be outputted), but specify a lower INFO
logging level for two key modules of interest, Demography
and HealthBurden
, to ensure we record the output we need for our analysis from these modules. As a general principle it is good to keep log levels as high as possible while still retaining the log output needed for a specific analysis, as log files can grow very large for long simulations when log levels are set too low which can lead to simulations exceeding resources on nodes and jobs failing.
def modules(self):
return fullmodel(resourcefilepath=self.resources)
The modules
method should return a list of modules (instances of subclasses of the tlo.Module
class corresponding to self-contained model components) to be registered in the simulation. This list can be constructed explicitly by importing module classes from the relevant Python modules (with module here instead referring to a single importable Python file) in tlo.methods
, which can be useful if you wish to run a simulation with only a subset of the available modules included. If you wish to run a scenario with all currently 'complete' modules, a helper function fullmodel
is provided in tlo.methods.fullmodel
, which will return a list of instances of the relevant modules. The fullmodel
function has one required argument - the path to the directory containing the resource files to use when initializing the modules, which is here set to the resources
attribute automatically defined for instances of BaseScenario
as a relative path ./resources
. There are also two optional arguments to fullmodel
:
-
use_simplified_births
: A boolean argument specifying whether to use theSimplifiedBirths
module in place of the full pregnancy related modules. This reduces the computational cost (and so run time) of the simulation at the cost of using a much simpler and less realistic model of birth events. -
module_kwargs
: A dictionary mapping from module class names to dictionaries of keyword argument names and values to set for the module.
def draw_parameters(self, draw_number, rng):
return {
'HealthSystem': {'cons_availability': ['default', 'all'][draw_number]}
}
The draw_parameters
method, should, given an integer draw number draw_number
between 0 and number_of_draws - 1
(inclusive), and a seeded random number generator rng
, return a dictionary mapping from module names to dictionaries specifying override values for the parameters in that module for this particular draw. This method is what controls the different parameter sets used for each draw of the scenario. The dictionaries specifying the parameter overrides for each module should map from string parameter names to the parameter value to use in place of the default for this draw.
In the example here, a single parameter cons_availability
(controlling consumable availability) in the HealthSystem
module is overridden, with draw number 0 using the string value 'default'
(corresponding to the default behaviour of consumables not always being available) and draw number 1 using the string value 'all'
(corresponding to all consumables always being available). Indexing into a list of values to use for a parameter for each draw number is a simple way to set parameter values when only one parameter is being changed across draws. An alternative approach for generating grids corresponding to Cartesian products of sets of values for multiple parameters is demonstrated below. The rng
random number generator can also be used to generate random values for parameters to use within a Monte Carlo method.
Currently the scenario is set to run for 5 years with an initial population of 10 000 and with the full set of pregnancy related modules, with this configuration taking tens of minutes for each simulation run. To allow the scenario to be run more quickly for testing purposes, update the definition of the scenario so that
- the initial population size is 1000,
- the simulation is run for 1 months from January 1st 2010,
- the
use_simplified_births
argument tofullmodel
is set toTrue
to use theSimplifiedBirths
module in place of the full pregnancy related modules.
Once you have made the necessary changes, use the Commit
tab in PyCharm (typically in the left sidebar) to add these changes to the staging area (select the scenario_impact_of_consumables_availability.py
using the tick box within the Default Changelist
tree view) and commit to your personal {user-prefix}/example-analysis
branch (enter a descriptive commit message in the text field below the change list and then click the Commit
button).
To run scenarios we can use the tlo
command line tool. PyCharm has a built-in terminal which can be used to run tlo
commands - it should be available from the Terminal
tab at the bottom of the interface or by going to View > Tool Windows > Terminal
from the main menu bar.
We can run the scenario file we just edited
-
locally by running
tlo scenario-run src/scripts/analysis_example/scenario_impact_of_consumables_availability.py
-
or on as a batch job on Azure by running
tlo batch-submit
To use the latter command, we need for the changes we made to the scenario file to be pushed to GitHub. Before we push these changes however we should test they work locally first.
Use the tlo scenario-run
command to run the edited scenario locally from the terminal in PyCharm. If the scenario runs without errors you should see output like the following (some details such as file paths will differ)
Found class ImpactOfConsumablesAvailability in src/scripts/analysis_example/scenario_impact_of_consumables_availability.py
{"uuid": "8159d34ec0", "type": "header", "module": "tlo.scenario", "key": "message", "level": "INFO", "columns": {"message": "str"}, "description": null}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Loaded scenario using /tmp/tmpqlv7e9xs"]}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Found 2 draws; 2 runs/draw"]}
{"uuid": "8159d34ec0", "date": "0000-00-00T00:00:00", "values": ["Running draw 0, sample 0"]}
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/0/impact_of_consumables_availability__2023-01-24T093713.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/0
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/1/impact_of_consumables_availability__2023-01-24T093821.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/0/1
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/0/impact_of_consumables_availability__2023-01-24T093927.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/0
Finished writing module-specific log files.
Processing log file outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/1/impact_of_consumables_availability__2023-01-24T094031.log
Writing module-specific log files to outputs/impact_of_consumables_availability-2023-01-24T093713Z/1/1
Finished writing module-specific log files.
Once the scenario has completed successfully locally, we can push the changes on the current branch to GitHub. To do this in PyCharm we can either select Git > Push...
from the main menu bar or click the {user-prefix}/example-analysis
branch in the left pane, and a single file with changes (scenario_impact_of_consumables_availability.py
) showing in the right pane. If this all looks correct click the Push
button to push these changes to GitHub.
Once all changes on our local branch are pushed to GitHub we are ready to submit a job to Azure Batch using the tlo batch-submit
command. Do this now.
When running scenarios on Azure we can see a list of currently running (active) jobs by running
tlo batch-list --active
To show a list of recently completed jobs instead run
tlo batch-list --completed
For a job identifier {job-id}
(which have the general format {stem_of_scenario_filename}-{timestamp}
) we can get more details about the progress of the job's tasks by running
tlo batch-job --tasks {job-id}
Once a job has completed we can download the scenario outputs by running
tlo batch-download {job-id}
where {job-id}
is the relevant identifier for the job. By default this command will look for jobs run under the username specified in your tlo.conf
configuration file. To download the results from a job run by a different user we can instead run
tlo batch-download --username {username} {job-id}
where {username}
is the relevant user name (typically an email) and {job-id}
is the job identifier as previously.
As a job submitted with the above scenario file will take a while to complete, for now we can use the results of a previous completed run by Tim (tbh03@ic.ac.uk) with job identifier scenario_impact_of_consumables_availability-2023-01-12T131247Z
. Retrieve the results of this job by running
tlo batch-download --username tbh03@ic.ac.uk scenario_impact_of_consumables_availability-2023-01-12T131247Z
in the terminal from Pycharm.
Once we have the results from a scenario run we need to write code to extract the subset of the logged output we wish to use for our analysis, compute the summary statistics of interest and plot or otherwise visualise these. Typically in TLOmodel
we place this analysis code in a separate script in the same directory as the Python file defining the scenario which generates the results to be analyzed. In the current example, the relevant script is src/scripts/analysis_example/scenario_impact_of_consumables_availability.py
.
The script extracts and plots key summary statistics from the scenario results. Below we describe some of the key parts of the script.
import argparse
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tlo.analysis.utils import (
extract_params,
extract_results,
get_scenario_info,
get_scenario_outputs,
load_pickled_dataframes,
make_age_grp_lookup,
make_age_grp_types,
summarize,
)
We first import the objects we will need. We will use the argparse
module from the Python standard library to make a simple command line interface. Matplotlib, NumPy and Pandas are all third-party libraries that are key parts of the scientific Python ecosystem, providing plotting, numerical array and dataframe functionality respectively. With the tlo
package the tlo.analysis.utils
module defines various useful helper functions for use in analysis scripts we import here and will explain as we use below.
def extract_total_deaths(results_folder):
def extract_deaths_total(df: pd.DataFrame) -> pd.Series:
return pd.Series({"Total": len(df)})
return extract_results(
results_folder,
module="tlo.methods.demography",
key="death",
custom_generate_series=extract_deaths_total,
do_scaling=True
)
We first define a function to get coarse total death summary statistics (accumulated across the whole population and simulation time). The key function we use here is extract_results
, which parses the log output for a specific module (here tlo.methods.demography
), extracts the values associated with a particular key (here "death"
) and applies a user provided function to generate the summary statistics of interest from the parsed log output dataframe (here the extract_deaths_total
function). The do_scaling
boolean argument when True
, scales the values associated with each simulation sample (run) by a per-run scale-factor recorded in the log which adjusts for relative size of the initial simulated population.
def plot_summarized_total_deaths(summarized_total_deaths, param_strings):
fig, ax = plt.subplots()
number_of_draws = len(param_strings)
statistic_values = {
s: np.array(
[summarized_total_deaths[(d, s)].values[0] for d in range(number_of_draws)]
)
for s in ["mean", "lower", "upper"]
}
ax.bar(
param_strings,
statistic_values["mean"],
yerr=[
statistic_values["mean"] - statistic_values["lower"],
statistic_values["upper"] - statistic_values["mean"]
]
)
ax.set_ylabel("Total number of deaths")
fig.tight_layout()
return fig, ax
The function here uses the Matplotlib bar
function to generate a bar plot summarizing the difference in the total deaths summary statistics across the different scenario draws. The variation across the samples of each draw are shown as error bars.
def compute_difference_in_deaths_across_runs(total_deaths, scenario_info):
deaths_difference_by_run = [
total_deaths[0][run_number]["Total"] - total_deaths[1][run_number]["Total"]
for run_number in range(scenario_info["runs_per_draw"])
]
return np.mean(deaths_difference_by_run)
The function here gets the total deaths associated with the runs for each draw (set of parameters) and computes the mean across the runs using the NumPy mean
function.
def extract_deaths_by_age(results_folder):
def extract_deaths_by_age_group(df: pd.DataFrame) -> pd.Series:
_, age_group_lookup = make_age_grp_lookup()
df["Age_Grp"] = df["age"].map(age_group_lookup).astype(make_age_grp_types())
df = df.rename(columns={"sex": "Sex"})
return df.groupby(["Age_Grp"])["person_id"].count()
return extract_results(
results_folder,
module="tlo.methods.demography",
key="death",
custom_generate_series=extract_deaths_by_age_group,
do_scaling=True
)
To give a finer grained view of the difference between the different settings, the function here extracts total deaths for each age group in the population again using the extract_results
helper function on the logger outputs with the "death"
key from the tlo.methods.demography
module, but using a different function to extract the necessary data from the parsed logger output dataframe.
def plot_summarized_deaths_by_age(deaths_summarized_by_age, param_strings):
fig, ax = plt.subplots()
for i, param in enumerate(param_strings):
central_values = deaths_summarized_by_age[(i, "mean")].values
lower_values = deaths_summarized_by_age[(i, "lower")].values
upper_values = deaths_summarized_by_age[(i, "upper")].values
ax.plot(
deaths_summarized_by_age.index, central_values,
color=f"C{i}",
label=param
)
ax.fill_between(
deaths_summarized_by_age.index, lower_values, upper_values,
alpha=0.5,
color=f"C{i}",
label="_"
)
ax.set(xlabel="Age-Group", ylabel="Total deaths")
ax.set_xticks(deaths_summarized_by_age.index)
ax.set_xticklabels(labels=deaths_summarized_by_age.index, rotation=90)
ax.legend()
fig.tight_layout()
return fig, ax
This function uses the Matplotlib line plot function plot
and filled area plot function fill_between
to show the variation in the total deaths across age groups for the two different scenario settings, with the mean across the runs shown as a solid line and the variation across runs represented by the filled area.
if __name__ == "__main__":
This conditional if
statement is a common pattern in Python scripts to allow the script to also be imported as a module without executing the script itself to allow reusing any functions or other objects defined in the module. The code inside this block will only be executed if running as a script.
# Parse command line arguments
parser = argparse.ArgumentParser(
"Analyse scenario results for testing impact of consumables availability"
)
parser.add_argument(
"--scenario-outputs-folder",
type=Path,
required=True,
help="Path to folder containing scenario outputs",
)
parser.add_argument(
"--show-figures",
action="store_true",
help="Whether to interactively show generated Matplotlib figures",
)
parser.add_argument(
"--save-figures",
action="store_true",
help="Whether to save generated Matplotlib figures to results folder",
)
args = parser.parse_args()
Here we use the argparse
module from the Python library to define some command line arguments that can be used to control the script and parse the arguments passed when invoking the script.
# Find results_folder associated with a given batch_file and get most recent
results_folder = get_scenario_outputs(
"scenario_impact_of_consumables_availability.py", args.scenario_outputs_folder
)[-1]
The get_scenario_outputs
helper function allows getting all the scenario results associated with a given scenario script, ordered by the timestamp of when the scenario was submitted. Here we get the last set of results (that is the most recent).
# Load log (useful for checking what can be extracted)
log = load_pickled_dataframes(results_folder)
# Get basic information about the results
scenario_info = get_scenario_info(results_folder)
# Get the parameters that have varied over the set of simulations
params = extract_params(results_folder)
We use three helper functions from tlo.analysis.utils
to get various information about the scenario run:
-
load_picked_dataframes
is used to (lazily) load the dataframes corresponding to full log output. Having access to the full logging data can be useful when initially exploring the scenario outputs in the first stages of analysis. -
get_scenario_info
is used to get basic information about the scenario - specifically the number of drawsnumber_of_draws
and number of runs per drawruns_per_draw
as a dictionary. -
extract_params
is used to get the overridden parameters for each draw of the scenario as a dataframe.
# Create a list of strings summarizing the parameter values in the different draws
param_strings = [f"{row.module_param}={row.value}" for _, row in params.iterrows()]
To give meaningful labels to the plots we construct strings of the form {module_name}:{parameter_name}={value}
for each value of the (here) single parameter (HealthSystem:cons_availability
) overridden in each draw of the scenario.
# We first look at total deaths in the scenario runs
total_deaths = extract_total_deaths(results_folder, True)
# Compute and print the difference between the deaths across the scenario draws
mean_deaths_difference_by_run = compute_difference_in_deaths_across_runs(
total_deaths, scenario_info
)
print(f"Mean difference in total deaths = {mean_deaths_difference_by_run:.3g}")
Using the helper functions defined previously we extract the total death statistics across the scenario draws and runs and compute and print the mean differences across runs.
# Plot the total deaths across the two scenario draws as a bar plot with error bars
fig_1, ax_1 = plot_summarized_total_deaths(summarize(total_deaths), param_strings)
We then plot the total deaths for the two scenario draws as a bar plot with error bars to represent the variability across runs.
# Now we look at things in more detail with an age breakdown
deaths_by_age = extract_deaths_by_age(results_folder)
# Plot the deaths by age across the two scenario draws as line plot with error bars
fig_2, ax_2 = plot_summarized_deaths_by_age(summarize(deaths_by_age), param_strings)
Using the previously defined helper functions here we extract and plot a breakdown of total deaths per age group for the two scenario draws.
if args.show_figures:
plt.show()
if args.save_figures:
fig_1.savefig(results_folder / "total_deaths_across_scenario_draws.pdf")
fig_2.savefig(results_folder / "death_by_age_across_scenario_draws.pdf")
Depending on the command line arguments passed we potentially show the interactive figure windows and/or save the figures to the results folder.
To run the analysis script on the scenario results we downloaded earlier we can run
python src/scripts/analysis_example/analysis_impact_of_consumables_availability.py --scenario-outputs-folder outputs/tbh03@ic.ac.uk --show-figures
in the terminal in PyCharm. This will show the figures as interactive windows. To instead save the figures to files in the results folder we can run
python src/scripts/analysis_example/analysis_impact_of_consumables_availability.py --scenario-outputs-folder outputs/tbh03@ic.ac.uk --save-figures
The following functions in the tlo.analysis.utils
modules can be useful for helping with analyzing the output from a scenario run
-
extract_params
- gets the overridden parameters from scenario runs in the form of a dataframe summarizing the parameters that change across the draws. -
extract_results
- unpacks the results of the successful runs in the form of a dataframe with the column multi-index for the draw/run. -
summarize
- computes summary statistics. Finds mean value and 95% interval across the runs for each draw. -
compute_mean_across_runs
- computes the mean across scenario runs of a dictionary of counters keyed by draw and run. -
plot_stacked_bar_chart
- plots a stacked bar chart using count data binned over two levels of grouping. -
get_grid
- creates the arrays needed to plot a heatmap, returns grid as a dictionary.
The above describes a very basic work flow. Here are some other things that you can do.
The make_cartesian_parameter_grid
helper function in tlo.scenario
can be used to generate a parameter grid corresponding to the Cartesian product of a set of values for each of a collection of module specific parameters. This can be useful to create a scenario which uses draws corresponding to all combinations of a (small) collection of parameter values.
The function accepts a single dictionary argument, which should map from string module names to dictionaries mapping from string parameter names (for that module) to iterables (for example lists, sets, arrays) of parameter values to iterate across. It returns a list of nested dictionaries corresponding to the module specific parameter sets formed as the Cartesian product of all the sets of values specified for each parameter.
For example, if we wished to extend our previous analysis to also consider the impact of bed availability and its interaction with consumable availability, we might wish to run a scenario studying four different settings
- A setting where both consumables and bed are sometimes not available.
- A setting where consumables are always available but beds are sometimes not.
- A setting where beds are always available but consumables are not.
- A setting where both consumables and beds are always available.
We could use the make_cartesian_parameter_grid
in a scenario to help generate the 'grid' of four different parameter sets corresponding to these settings as follows
from tlo import Date, logging
from tlo.methods.fullmodel import fullmodel
from tlo.scenario import BaseScenario, make_cartesian_parameter_grid
class ImpactOfConsumablesAndBedAvailability(BaseScenario):
def __init__(self):
self._parameter_grid = make_cartesian_parameter_grid(
{
"HealthSystem": {
"cons_availability": ["default", "all"],
"beds_availability": ["default", "all"],
}
}
)
super().__init__(
seed=0,
start_date=Date(2010, 1, 1),
end_date=Date(2015, 1, 1),
initial_population_size=10_000,
number_of_draws=len(self._parameter_grid),
runs_per_draw=2
)
def log_configuration(self):
return {
'filename': 'impact_of_consumables_and_bed_availability',
'directory': './outputs',
'custom_levels': {
'*': logging.WARNING,
'tlo.methods.demography': logging.INFO,
'tlo.methods.healthburden': logging.INFO,
}
}
def modules(self):
return fullmodel(resourcefilepath=self.resources)
def draw_parameters(self, draw_number, rng):
return self._parameter_grid[draw_number]
A breadth of information including tutorials for beginner, intermediate and advanced levels can be found at matplotlib.org
- The class used to define the experiment is documented here: Scenario.
- The documentation for the command line interface tool is here: Interacting with Azure Batch
TLO Model Wiki