Please use https://github.com/nangalialab/rtreefit
You can install rtreefit like so:
devtools::install_github("NickWilliamsSanger/rtreefit",build_vignettes=TRUE)
This package estimates time-based (“ultrametric”) trees, wherein the y-axis of phylogenetic somatic mutation trees is converted from mutations to time. The method jointly fits wild type rates, mutant rates and absolute time branch lengths using a Bayesian per individual tree-based model under the assumption that the observed branch lengths are Poisson or Negative Binomial distributed with
Mean = Duration × Sensitivity × Mutation Rate
The method works with at most one change point per branch and supports heterochronous sampling. See the rtreefit vignette for slightly fuller mathematical details:
browseVignettes("rtreefit")
We consider a rooted tree where each edge i consists of an observed mutation count mi and a true duration ti. We refer to a given edge and its child node interchangeably by the same label. Now let D(i) be the set of terminal nodes (tips) that descend from node i and let A(i) be its corresponding set of ancestral nodes excluding the root. We assume that each tip of the tree k has a known corresponding time Tk (e.g. the post conception age in years of the patient at sampling of the cell) and so we therefore have the following constraint:
Tk = ∑i ∈ A(k)ti
and
Tk > ti > 0
We incorporate this constraint by performing the optimisation over the interior branches of the tree with reparameterised branch durations xi transformed to be in the range 0 < xi < 1. If j is an edge whose parent node is the root then:
tj = xjmin(Tk : k ∈ D(j))
For other interior edges, i, we have
ti = (min{Tk:k∈D(i)}−∑j ∈ A(i)tj)xi
The duration of the terminal edges is fixed by the values of ti on the interior edges and the overall duration constraint:
ti = min{Tk:k∈D(i)} − ∑j ∈ A(i)tj
We assume that there are p − 1 change points in the tree corresponding to the acquisition of driver mutations. This results in p mutation rates λj applying throughout the tree where we allow at most one change point per branch and the initial ancestral (or wild type) rate is λ0 and additional rate change points occur a fraction αj along branch j and descendent branches have the rate λj unless there are additional change points in descendant branches. The effective rate on branches with a change point going from λl to λj is just the weighted average αjλl + (1 − αj)λj where we use a uniform unit interval prior for the α's.
We assume the underlying mutation process follows a Negative Binomial Distribution with the above piecewise constant driver specific mutation rates, the number of mutations accrued on branch i in time ti measured in years:
Mi ∼ NB(λ×ti,λ×ti×ϕ)
The number of observed mutations is:
mi ∼ Binomial(Mi, si)
Where we are using a per-branch estimated sensitivity si that indirectly depends on the depth of sample and the number of samples sharing a branch (see ?). This is equivalent too:
mi ∼ NB(λ×ti×si,λ×ti×ϕ)
with priors 1/ϕ ∼ HalfNormal(0, 10), λ ∼ 𝒩(Λ, 0.25Λ) where Λ is the naive estimation of a single rate λ as the per patient median of the ratio of the root to tip mutation count and the tip sampling age, and finally we use the weakly informative prior for the stick breaking fractions:
xi ∼ Beta(α = pi/(1 − ∑j ∈ A(i)pj),β = 1)
where the pi is an initial approximation of the duration of the branch length expressed as a fraction of the sampling time: pi = minj ∈ D(i){(mj+1)/(∑k ∈ A(j)(mk+1))}
Note that the overdispersion parameter is rescaled so that it is comparable across branches with different mutation burden.
Here we assume the underlying mutation process follows a Poisson Distribution again with the above piecewise constant driver specific mutation rates, the number of observed mutations accrued on branch i in time ti measured in years:
mi ∼ Poisson(λ × ti × Si)
where
Si ∼ Beta(α=c,β=c(1−si)/si)
Where we have chosen the concentration parameter c = 100. This reflects only modest uncertainty in our estimates in sensitivity and also allows the model to mitigate larger than expected variability in the branch lengths. In other respects the priors are the same as for the Negative Binomial Model.
First lets simulate a neutral tree using rsimpop and fit the tree..
library("rtreefit")## Loads rsimpop as well
NYEARS=25
RATE=18
get_agedf_from_sim=function(simtree){
st=get_elapsed_time_tree(simtree)## Gets "Real Time" ultrametric tree
nh=nodeHeights(st)
out=data.frame(tip.label=st$tip.label,age=nh[match(1:length(st$tip.label),st$edge[,2]),2]/365)
out$age=ifelse(out$age<1e-6,1e-6,out$age)
out
}
testing=run_neutral_sim(0.1,1/365,nyears=NYEARS)
#> n_sim_days: 9125
#> b_stop_if_empty: 0
#> b_stop_at_pop_size: 1
#> maxt: 0
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 18250
#> MAX_SIZE= 300003
#> n_sim_days: 9125
#> b_stop_if_empty: 0
#> b_stop_at_pop_size: 0
#> maxt: 114.76592783799
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 18250
#> MAX_SIZE= 300003
st=get_subsampled_tree(testing,30)
#> Starting checking the validity of tmp...
#> Found number of tips: n = 31
#> Found number of nodes: m = 30
#> Done.
st=get_elapsed_time_tree(st,mutrateperdivision = 0,backgroundrate = RATE/365,odf=1)
plot_tree(st)
#>
#> Phylogenetic tree with 31 tips and 30 internal nodes.
#>
#> Tip labels:
#> s1, s2, s3, s4, s5, s6, ...
#>
#> Rooted; includes branch lengths.
st$agedf=get_agedf_from_sim(st)
res=fit_tree(tree=st,switch_nodes = c(),xcross = c(),niter = 10000,model = "poisson_tree",early_growth_model_on = 0.0)
#> Warning in fit_tree(tree = st, switch_nodes = c(), xcross = c(), niter =
#> 10000, : No sensitivity supplied: assuming 99%
#> Median lambda estimate=18.10
print(res$lambda)
#> $mean
#> [1] 18.08432
#>
#> $sd
#> [1] 0.1722527
#>
#> $lb
#> [1] 17.75389
#>
#> $ub
#> [1] 18.42925
#>
#> $median
#> [1] 18.08249
par(mfcol=c(1,2))
ut=get_elapsed_time_tree(st)
ut$edge.length=ut$edge.length/365
plot_tree(ut,cex.label = 0);title("True Ultrametric Tree")
#>
#> Phylogenetic tree with 31 tips and 30 internal nodes.
#>
#> Tip labels:
#> s1, s2, s3, s4, s5, s6, ...
#>
#> Rooted; includes branch lengths.
plot_tree(res$ultratree,cex.label = 0);title("Inferred Tree")
#>
#> Phylogenetic tree with 31 tips and 30 internal nodes.
#>
#> Tip labels:
#> s1, s2, s3, s4, s5, s6, ...
#>
#> Rooted; includes branch lengths.
NYEARS=40
RATE=15
selsim=run_selection_sim(0.1,1/365,target_pop_size = 1e5,nyears_driver_acquisition = 5,nyears = NYEARS,fitness=0.3,minprop = 0.05)
#> n_sim_days: 1825
#> b_stop_if_empty: 0
#> b_stop_at_pop_size: 1
#> maxt: 0
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 3650
#> MAX_SIZE= 300003
#> n_sim_days: 1825
#> b_stop_if_empty: 0
#> b_stop_at_pop_size: 0
#> maxt: 116.727721368761
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 3650
#> MAX_SIZE= 300003
#> No driver found: tries= 0
#> val population fitness id driver1
#> 1 0 1 0.0 0 0
#> 2 1 100027 0.0 0 0
#> 21 1 1 0.3 1 1
#> n_sim_days: 14600
#> b_stop_if_empty: 1
#> b_stop_at_pop_size: 0
#> maxt: 1825.00040223652
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 29200
#> MAX_SIZE= 300087
#> No driver found: tries= 1
#> val population fitness id driver1
#> 1 0 1 0.0 0 0
#> 2 1 100027 0.0 0 0
#> 21 1 1 0.3 1 1
#> n_sim_days: 14600
#> b_stop_if_empty: 1
#> b_stop_at_pop_size: 0
#> maxt: 1825.00040223652
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 29200
#> MAX_SIZE= 300087
#> No driver found: tries= 2
#> val population fitness id driver1
#> 1 0 1 0.0 0 0
#> 2 1 100027 0.0 0 0
#> 21 1 1 0.3 1 1
#> n_sim_days: 14600
#> b_stop_if_empty: 1
#> b_stop_at_pop_size: 0
#> maxt: 1825.00040223652
#> driver_rate_per_cell_per_day: 0
#> MAX_EVENTS= 29200
#> MAX_SIZE= 300087
st=get_subsampled_tree(selsim,30)
#> Starting checking the validity of tmp...
#> Found number of tips: n = 31
#> Found number of nodes: m = 30
#> Done.
st=get_elapsed_time_tree(st,mutrateperdivision = 0,backgroundrate = RATE/365,odf=1)
st$agedf=get_agedf_from_sim(st)
node=st$events$node[which(st$events$driverid==1)]
res=fit_tree(tree=st,switch_nodes = node,xcross = c(),niter = 10000,model = "poisson_tree",early_growth_model_on = 0.0)
#> Warning in fit_tree(tree = st, switch_nodes = node, xcross = c(), niter =
#> 10000, : No sensitivity supplied: assuming 99%
#> Median lambda estimate=15.38
print(res$lambda)
#> $mean
#> lambda[1] lambda[2]
#> 15.44703 15.21443
#>
#> $sd
#> lambda[1] lambda[2]
#> 0.1891049 0.3896968
#>
#> $lb
#> lambda[1] lambda[2]
#> 15.07746 14.47648
#>
#> $ub
#> lambda[1] lambda[2]
#> 15.82462 16.02395
#>
#> $median
#> lambda[1] lambda[2]
#> 15.44616 15.20701
ut=get_elapsed_time_tree(st)
ut$edge.length=ut$edge.length/365
par(mfcol=c(1,2))
plot_tree(ut,cex.label = 0);title("True Ultrametric Tree")
#>
#> Phylogenetic tree with 31 tips and 30 internal nodes.
#>
#> Tip labels:
#> s1, s2, s3, s4, s5, s6, ...
#>
#> Rooted; includes branch lengths.
plot_tree(res$ultratree,cex.label = 0);title("Inferred Tree")
#>
#> Phylogenetic tree with 31 tips and 30 internal nodes.
#>
#> Tip labels:
#> s1, s2, s3, s4, s5, s6, ...
#>
#> Rooted; includes branch lengths.