Skip to content

matsengrp/gapml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAPML: GESTALT analysis using penalized maximum likelihood

Associated paper: Feng, et al. (In press). "Estimation of cell lineage trees by maximum-likelihood phylogenetics". Annals of Applied Statistics. Preprint available on BioRxiv.

Installation

You will need to have PHYLIP mix installed so that you can call mix on the command line. http://evolution.genetics.washington.edu/phylip/

We use pip to install things into a python virtual environment. FIRST: if you want tree visualizations from ete3, you will need to do something special since we are using pip (rather than conda). Install PyQt5 (sudo apt-get on linux, brew install on mac. For mac, see https://stackoverflow.com/questions/39821177/python-pyqt-on-macos-sierra). Then run pip install PyQt5 in your virtual environment. AFTERWARDS: Install the libraries into your virtual environment through the requirements.txt file.

We use nestly + SCons to run simulations/analyses. You should install scons outside the virtual environment, for a python 2.* or a 3.5+ environment. Then activate the virtual environment and then run scons ___.

You need to create a file constant_paths.py in the gestalt folder that provides the paths to the different executables. (mix and bhv distance calculators) For example:

MIX_PATH = "~/my_path/mix" # download from phylip website
BHV_PATH = "~/my_path/gtp.jar" # download from http://comet.lehman.cuny.edu/owen/code.html
RSPR_PATH = "~/my_path/rspr" # download from https://github.com/cwhidden/rspr

GESTALT pipeline

generate_data.py: make data

restrict_observed_barcodes.py: restrict to observing the first K alleles

get_parsimony_topologies.py or get_collapsed_oracle.py: create tree topologies to fit to

tune_topology.py: to select the best topology given a set of possible topologies

  • --obs-file: Pickle file with observations generated by read_gestalt_data.py or generate_data.py
  • --topology-file: Pickle file with candidate topologies from get_parsimony_topologies.py
  • --out-model-file: Name of output file.
  • --log-file: Name of log file
  • --branch-pen-params: Candidate penalty parameters for penalizing differences between branch lengths. We will tune over these using a variant of cross-validation.
  • --target-lam-pen-params: Candidate penalty parameters for penalizing differences between target cut rates. We will tune over these using a variant of cross-validation.
  • --num-penalty-tune-iters: Number of iterations we should spend on tuning the penalty parameters.
  • --num-penalty-tune-splits: Number of data splits to create (typically 3 - 5)
  • --max-fit-splits: Maximum number of data splits to use for fitting and tuning the penalty parameters (typically same value as --num-penalty-tune-splits, but you can use a smaller number to speed up the computation)
  • --num-chad-tune-iters: Number of iterations to spend on tuning the tree topology with subtree prune and regraft (SPR) moves.
  • --num-chad-stop: If we don't change the tree topology for this many steps, then stop tuning the topology.
  • --max-chad-tune-search: Maximum number of SPR moves to consider at each iteration
  • --max-extra-steps: Maximum number of hidden cuts we should consider when approximating the likelihood. (A large number means more computation time.)
  • --max-sum-states: Maximum number of ancestral states to sum over when computing the likelihood. (A large number means more computation time.)
  • --max-iters: Maximum number of iterations for tuning branch lengths and mutation parameters.
  • --num-inits: Number of initializations to try when minimizing penalized log likelihood with respect to the branch lengths and mutation parameters

convert_to_newick.py: output the fitted tree in newick format

Fitting the trees via alternative methods

fit_chronos.py: uses Sanderson 2002, the chronos function in R package ape

How to get things running quickly and how to run analyses in the paper

The full pipelines are specified in the sconscript files. To run it, we use scons. For example: scons simulation_topol_consist/ The sconscript files provides the full pipeline for generating data and analyzing it (see simulation_topol_consist/sconscript as an example) as well as processing data from GESTALT and analyzing it (see analyze_gestalt/sconscript as an example).

Running tests

To run all the tests:

python3 -m unittest

To run specific test module tests/test_me.py:

python3 -m unittest tests.<test_me>

Experimental results

The fitted trees for the two adult fish in the paper is available at gestalt/analyze_gestalt/_output/ADR*_PMLE_tree.json.