Skip to content

Developer onboarding guide

Matt Graham edited this page Jun 7, 2023 · 1 revision

Getting set up

For setting up development environment via a terminal see the instructions on TLOmodel documentation site (you only really need Miniconda rather than a full Anaconda distribution). There are also more detailed installation guide in the TLOmodel repository wiki aimed at setting up a development environment with PyCharm.

Potentially useful resources

  • Overview of framework - Word document. Top-level overview of how model is organized.
  • Tutorial introduction to individual-based modelled - video, slides. Overview of modelling approach.
  • How to do a model analysis - wiki page. Specifically the first part giving diagrammatic overview of the model structure and how the different parts related to each other.
  • Coding conventions wiki page. Overview of some of conventions in terms of file organization and naming used with model.

Repository directory structure

The main directories of interest in repository are

  • .github/workflows - contains GitHub Actions YAML files defining workflows for continuous integration (run on all pushes to pull-request branches, merges in to master branch, and on a nightly cron schedule) and workflow files managing additional comment triggered workflows.
  • docs - static reStructuredText files used by Sphinx for building HTML documentation hosted at tlomodel.org, some associated Python scripts for generating some of the documentation, and write ups of each of the module modules as docx files in the writeups subdirectory.
  • outputs - default directory used for any outputs produced by simulations.
  • requirements - input files used by pip-compile to produce pinned requirements files and corresponding generated requirements files. base refers to dependencies for installing and running model, dev for additional dependencies needed to also run tests, use tox for automation and build requirements files.
  • resources - files (largely Excel .xlsx spreadsheets or comma separated variable files), containing data used to set default parameters for models and used in calibration of model. Git LFS is primarily used with the files in this directory - if you get an error indicating a file in the resources directory cannot be found, most likely you only have the pointer file present and need to set up Git LFS and run git lfs fetch --all.
  • src/scripts - scripts written by modellers / users for running analyses with model. These tend to be mainly used by the individual researchers who wrote them, and we don't have any testing to ensure these stay up to date with changes in the code.
  • src/tlo - the top-level directory for the tlo Python package defining the modelling framework and model components.
  • src/tlo/methods - directory containing the individual modules of the model (in TLO terms a self-contained component of model, often but not exclusively associated with a particular disease, rather than the usual Python meaning of module). Most of these modules have been primarily developed by one or more of modelling team.
  • tests - the pytest test modules defining the test functions. For the most part there is a one-to-one mapping from test modules to Python modules in the src/tlo package and subpackages, though some test modules aren't primarily associated with any one module (for example test_determinism.py and test_maternal_health_helper_and_analysis_functions.py).

Using tox

We use tox for automating some of the common tasks we perform with the TLOmodel code. The tox.ini file in the root of the repository defines various 'environments' - each of these specifies both a set of dependencies and commands to run. Running tox -e {environment_name} will create a clean virtual environment, install the dependencies for {environment_name} and then run the associated commands. Some of the more useful environments we have set up are

  • py38-pandas12 - run tests with Python v3.8 and Pandas v1.2 (current versions we support).
  • py311-pandas20 - run tests with Python v3.11 and Pandas v2.0 (versions we are hopefully going to move to soon).
  • docs - build documentation using Sphinx.
  • check - run checks on source code (flake8, isort, check-manifest) - this is run as part of GitHub Actions CI so it's useful to run this locally before pushing to catch any errors, particularly with import ordering.
  • profile - run the src/scripts/profiling/scale_run.py with arguments specifying to simulate 5 years and show a progress bar, using pyinstrument to get profiling data.

Profiling

We typically use the scale_run.py script in src/scripts/profiling as the target for profiling runs, with this (by default) performing a run of the full model with an initial population size and total simulation time currently judged to be reflective of what we would want to use in model analysis runs.

Some of the behaviour of the model (for example availability of resources) is appropriately scaled by a factor controlled by the ratio of the simulated initial population size and real initial population size, as computed in the Demography.compute_initial_model_to_data_popsize_ratio method, therefore runs with smaller population sizes can still be usefully interpreted. However, generally a larger initial population size will be expected to make the model better reflect the population being simulated. The default initial population size used in scale_run.py is therefore based on trading off model fidelity with ensuring runs can be completed in a reasonable time (as a ball park, roughly 24 hours or less for a full run), and has been increased over time as the model has been made more performant.

Doing profiled runs of the model helps to identify where the key bottlenecks are, with the profiling output recording how much time is being spent in different parts of the call graph. Different profiling tools uses different approaches to gathering this information. The results of profiling the scale_run.py script over time along with some analysis of identified bottlenecks are currently tracked in a Git issue.

Deterministic profilers like the built-in profile and cProfile modules trace all function calls; this gives a high degree of granularity to the recorded profiling statistics, but with a tradeoff that the overheads arising from the recording of statistics by the profiler can distort the results, particularly for small functions which are run many times for which the overhead will have more of an effect.

Statistical profilers such as pyinstrument instead record where the program is in the call stack at some regular interval (by default 1ms for pyinstrument). This significantly reduces the overhead associated with profiling, reducing the bias in the results compared to non-profiled runs, at the cost of introducing some variance. Compared to the built-in profile cProfile modules, pyinstrument also has the advantage of recording the full call stack rather than just the specific function being called: this is particularly useful in helping to identify where functions are being called from, with there many functions in TLOmodel which are called from multiple different parts of the code.

The default behaviour of the built-in cProfile and profile modules is to output the profile results to stdout as a table with columns showing the number of calls, total time spent in the function (excluding calls to sub-functions), cumulative time spent in function (including calls to sub-functions) and filename plus line number reference for function, with one row per called function. For complex codebases like TLOmodel which have very large call graphs this output is not always that interpretable. Alternatively the profiling data can be outputed to a file using the -o {filename} option for example

python -m cProfile -o scale_run.prof src/scripts/profiling/scale_run.py

This can then be visualised using other applications. For example, SnakeViz allows viewing the profiling results in a browser as a interactive 'icicle' or 'sunburst' visualization, that represents the time spent in different functions in a more visual form. Pyinstrument also allows outputing to the same pstats output format used by cProfile using the option -r pstats, for example

pyinstrument -i 0.01 -r pstats -o scale_run.prof src/scripts/profiling/scale_run.py

where -i 0.01 sets the sampling interval to 0.01 seconds. Pyinstrument also has several other useful output formats including the default behaviour rendering as a labelled text-based call graph (-r text, rendering to an interactive HTML file (-r html) and rendering to a file that can be used with speedscope (-r speedscope).

Clone this wiki locally