-
Notifications
You must be signed in to change notification settings - Fork 10
/
akerborg_08_birthdeath_796654.pdf.txt
1538 lines (1242 loc) · 56.2 KB
/
akerborg_08_birthdeath_796654.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>1471-2148-8-77.fm</title>
<meta name="Author" content="abdulkadir.sufi"/>
<meta name="Creator" content="FrameMaker 7.0"/>
<meta name="Producer" content="Acrobat Distiller 5.0.5 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
BMC Evolutionary Biology
BioMed Central
Open Access
Research article
Birth-death prior on phylogeny and speed dating
Örjan Åkerborg*1,2, Bengt Sennblad1 and Jens Lagergren1,2
Address: 1Stockholm Bioinformatics Centre (SBC), Albanova, Stockholm University, SE-10691 Stockholm, Sweden and 2School for Computer
Science and Communication (CSC), Royal Institute of Technology (KTH), SE-10044 Stockholm, Sweden
Email: Örjan Åkerborg* - osv@sbc.su.se; Bengt Sennblad - bens@sbc.su.se; Jens Lagergren - jensl@csc.kth.se
* Corresponding author
Published: 4 March 2008
BMC Evolutionary Biology 2008, 8:77
doi:10.1186/1471-2148-8-77
Received: 22 August 2007
Accepted: 4 March 2008
This article is available from: http://www.biomedcentral.com/1471-2148/8/77
© 2008 Åkerborg et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: In recent years there has been a trend of leaving the strict molecular clock in order
to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates
and divergence times makes formulation of informative prior distributions for branch lengths
possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution
rates among lineages have been proposed, enabling simultaneous inference of substitution rates and
divergence times. This problem has, however, mainly been analysed in the Markov chain Monte
Carlo (MCMC) framework, an approach requiring computation times of hours or days when
applied to large phylogenies.
Results: We demonstrate that a hill-climbing maximum a posteriori (MAP) adaptation of the
MCMC scheme results in considerable gain in computational efficiency. We demonstrate also that
a novel dynamic programming (DP) algorithm for branch length factorization, useful both in the hillclimbing and in the MCMC setting, further reduces computation time. For the problem of inferring
rates and times parameters on a fixed tree, we perform simulations, comparisons between hillclimbing and MCMC on a plant rbcL gene dataset, and dating analysis on an animal mtDNA dataset,
showing that our methodology enables efficient, highly accurate analysis of very large trees.
Datasets requiring a computation time of several days with MCMC can with our MAP algorithm be
accurately analysed in less than a minute. From the results of our example analyses, we conclude
that our methodology generally avoids getting trapped early in local optima. For the cases where
this nevertheless can be a problem, for instance when we in addition to the parameters also infer
the tree topology, we show that the problem can be evaded by using a simulated-annealing like
(SAL) method in which we favour tree swaps early in the inference while biasing our focus towards
rate and time parameter changes later on.
Conclusion: Our contribution leaves the field open for fast and accurate dating analysis of
nucleotide sequence data. Modeling branch substitutions rates and divergence times separately
allows us to include birth-death priors on the times without the assumption of a molecular clock.
The methodology is easily adapted to take data from fossil records into account and it can be used
together with a broad range of rate and substitution models.
Page 1 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
Background
Since Felsenstein brought the maximum likelihood (ML)
framework to phylogenetic inference [1] the number of
supporters of likelihood based estimation has steadily
increased, and it is now widely considered the most accurate approach. Although ML phylogenetic inference is
generally quicker than stochastic optimization inference
(MCMC), the time complexity also of ML-algorithms has
been prohibitive for analysis of large phylogenies.
Recently, however, the field has seen considerable
advances in speed of the best methods, among which
PHYML [2] and RAxML [3] are notable, seemingly without sacrificing much accuracy. With these methods phylogenies with tens or even hundreds of taxa [2] are readily
examined. After its introduction a decade ago [4-6], the
use of Bayesian methods in phylogenetic inference has
been a field of active research in which not only the phylogeny itself has been sought, but also additional issues
have been addressed, such as substitution rate hypotheses,
accuracy of ancestral state inference, and the rooting problem, see [7] for a review. In particular, a number of alternatives to the much debated molecular clock hypothesis
[8] have been suggested. Among these are models with
molecular clocks operating locally [9,10] and also a range
of models with rate variation over lineages; these include
auto-correlated models, i.e., the rate distribution for a particular branch depends on the rate value of the parent
branch [11-15], and uncorrelated models where rates are
drawn independently from a common underlying distribution [16] (Sennblad et al.: Parental guidance vs. mutual
independence – evaluation of bayesian models of substitution rate evolution, submitted).
A fundamental question for the accuracy of Bayesian phylogenetic inference is the selection of a prior distribution
on trees and branch lengths [17], and several types of such
have been suggested. An intuitively appealing choice is to
assume that the tree branching follows a birth-death process which is indeed what is exploited by Yang and Rannala
[4,18]. In [18], the authors assumed a molecular clock
and used integration over branching times to calculate the
posterior distribution of tree topologies, a procedure they
reported infeasible for trees with more than five leaves. In
order to overcome this, they applied an MCMC methodology to the problem [4], but the time complexity of the
algorithm still severely limited the number of taxa that
could be analyzed.
One might think that a model containing a birth-death
prior on the tree branching would necessarily be consistent with a molecular clock, since the birth-death process
generates ultrametric trees. The molecular clock can be
avoided, however, by modeling the substitution rates and
branching times separately. Biologically, the ability to separately infer rates and times is of importance, since the
http://www.biomedcentral.com/1471-2148/8/77
former can pinpoint periods of molecular function
changes and the latter enables dating of speciations and
other evolutionary events. Moreover, as was recognized by
Arvestad et al. [19,20], explicit time modeling facilitates
integration of separate time dependent evolutionary models into a unified framework.
In a paper on MCMC-estimation of phylogenies and
divergence times [16], the authors performed simulations
comparing uncorrelated and clock-like models, and concluded that clock-like models only perform well on clocklike data, while uncorrelated models always perform
acceptably. Furthermore, they investigated parameters
inferred from two viral and one marsupial dataset with
the result that no significant auto-correlation was
detected. One conclusion in the paper was, therefore, that
auto-correlated models are not generally suitable. This
contrasts with the results presented in a recent paper by
Lepage et al. [21]. They evaluate two auto-correlated models and concludes that they clearly outperform the uncorrelated models tested.
In the work presented here, we make use of a Bayesian
framework with informative priors on branch lengths. We
have investigated solutions to the parameter inference problem, i.e., inference of rates and times, as well as the phylogeny inference problem, where we also infer the tree
topology. These problems are analysed with a hill-climbing maximum a posteriori (MAP) as well as an MCMC
methodology. We further introduce a dynamic programming (DP) algorithm for optimal factorization of branch
lengths into rates and times, thereby considerably reducing the computation time needed for hill-climbing as well
as MCMC-algorithms. The nucleotide substitution is
modeled by a continuous-time Markov process and we
use a birth-death process to obtain an a priori distribution
of phylogenies' branching times. The substitution rates
are drawn iid from a Γ-distribution [22]. We perform simulations to show that with our method fast simultaneous
inference of substitution rates and branching times for a
given tree topology is not only feasible on large trees but
also largely unaffected by local-optima problems. By comparing our results with MCMC-algorithms based on, in
one case, the same model (Sennblad et al.: Parental guidance vs. mutual independence – evaluation of bayesian
models of substitution rate evolution, submitted), and in
another case, a similar model [23], we show that despite
its uncomplicated nature, the presented algorithm delivers parameter estimations with high accuracy. Finally, we
show that also the phylogeny inference problem is manageable in acceptable time. By using a simulated annealing-inspired methodology, the simulated annealing like
(SAL)-method, where tree topologies are swapped often in
the beginning, but henceforth more rarely, we can avoid
getting stuck on a particular local optimum tree.
Page 2 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
Methods
General notation
Let s be the number of aligned sequences and n their common length with columns of indels omitted. The data will
be represented by an s × n matrix D = {dij}. The topology
of the phylogenetic tree relating the sequences is denoted
by T. Assuming the nucleotide substitution rate to be constant on any particular edge in the tree, we will denote the
length, the rate, and the time of the edge connecting node
v with its parent by lv, rv, and tv, respectively, where lv ≡ rvtv.
We will refer to the complete vectors of lengths, rates, and
times, for all edges as l, r, and t, respectively.
The discussion in this paper centers around P [r, t|D, T],
the probability of the rates and the times given the
sequences and the tree topology, and P [r, t, T|D], the
probability of the rates, the times, and the tree topology
given the sequences. P [r, t|D, T] is defined by:
P[D|r , t ,T ]p[r]p[t|T ]drdt
P[r , t | D, T ] = ∫
.
P[D]
(1)
We will refer to the factor P [D|r, t, T] as the data probability
and to factors p [r] and p [t|T] as rates prior and times prior,
respectively. The probability P [r, t|D, T] will be referred to
as the posterior. The value of the data probability depends
solely on the tree's branch lengths whereas the product of
the rates and the times prior probabilities, which can be
viewed as informative length priors, depends on how
these lengths are factorized into rates and times.
Three solutions to the parameter inference problem
The dimensionality of the integral in (1) is (2s - 2) + (s 2) for a tree inferred from s sequences, making its numerical evaluation feasible only for relatively small s [18]. Performing the integration by means of MCMC is the natural
remedy, but this is a notoriously computationally intensive methodology. In the following, we will therefore
study several aspects of the alternative problem of finding
the corresponding MAP solution:
optimal data probability but a better prior probability).
Unfortunately, the r × t-method is impractically slow. To
overcome the weaknesses of these two methods, we have
developed a DP algorithm, to be described below, which
finds the best possible factorization of any given length
vector l under the rates and times priors.
By applying our DP algorithm, we can optimize the factorization of the lengths resulting from the l-space search.
Even so, we may very well end up in a situation where we
have found an optimal l but where there is another l with
nearly as good data probability but prior value so superior
that the posterior is better. That solution cannot be found
with the l-method. By instead combining the DP and the
r × t-algorithms, it is possible to achieve a better result in
a reasonable time. We search the r × t-space as before, but
at each iteration, the search is interrupted with probability
p (we have found p = 0.001 to be suitable). If so, we factorize the current l optimally, and make a desirable jump
in the search space that would otherwise have been
impossible. This method will be referred to as the combined method.
A DP algorithm for branch length factorization
Given lengths for each edge in the tree, the objective of the
DP algorithm is to factorize the lengths into optimal rate
and time parameter estimates for the edges. To facilitate
this, a discretization of the time interval from the leaves to
the root is made. We scale this interval in order to give the
leaf times and the root times values zero and one, respectively. All s - 2 non-root inner nodes are assigned intermediate values corresponding to the equidistant grid that is
the result of the discretization (see Figure 1). The number
of grid intervals is N (we have found N = 100 to be suitable). For a given node u with children v and w, fu(tu),
denoting p [r]p [τu|Tu], is calculated for each possible discretized divergence time τu of u, using all possible discretized divergence times for v and w. For the node u, fu(τu) is
given by the maximum value over τv, τw of:
p[l v / t v ]p[l w / t w ] p[tu | Tu , tv , tw ] f (tv ) f (tw )
arg max P[D | r , t , T ]p[r]p[t | T ],
r ,t
(2)
namely, the most likely complete set of rates and times
given the data and the model. We have implemented
three algorithms for finding MAP solutions. As a first alternative, the r × t-method, we explore the entire r × t-space
and seek the optimum in that space. As a second alternative, the l-method, we search the l-space and when a supposed l-optimum is found, we factorize the lengths into
rates and times. Neglecting issues regarding non-global
optima, we expect the l-method to find a solution that is
optimal only with regard to the data probability, while the
r × t-method should eventually find the optimal solution
including the priors on rates and times (i.e. with the same
rate priors
time prior
contribution
(3)
where tv ≡ τu - τv and tw ≡ τu - τw. The values of the rate priors
are dependent on the gamma distribution and the quotient between the lengths and the times for the edges leading to u. We use the prior densities for branching times
given a tree, derived in [24]. This yields values dependent
on the birth-death process, τu, and the number of leaves of
the subtree Tu in which u is the root. We work from the
leaves up and, when we reach the root, the optimal
froot(1.0) ≡ p [r]p [t|T] and the rate-time factorization of the
lengths that gave rise to the optimal value are retrieved.
Page 3 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
the evolving model were found to be negligible. In [16],
the authors argue that over short time scales inherited factors become small relative to stochastic factors, and that
over very long time scales the variation causes the autocorrelation from lineage to lineage to break down.
In the recent paper by Lepage et al. [21], the authors perform benchmarking of relaxed clock models. They test
various choices of divergence time priors (birth-death,
Dirichlet, uniform) and variants of both auto-correlated
and uncorrelated rate models. This test favors the autocorrelated rate model alternative while the optimal choice
of time prior is found to be more data dependent.
Figure
zation 1
A dynamic programming algorithm for branch length factoriA dynamic programming algorithm for branch
length factorization. Values fu(τu) are calculated for all
possible choices of τu. τu is u's divergence time counted from
the leaves and it takes values corresponding to the equidistant grid. fu(τu) is a product of rate priors for the edges leading to u, u's contribution to the time prior, and the
corresponding probabilities for u's children v and w: fv(τu) and
fw(τu) respectively.
Rate model and hyperparameter values
In this study, we use a rate model described in (Sennblad
et al.: Parental guidance vs. mutual independence – evaluation of bayesian models of substitution rate evolution,
submitted). Briefly, this model averages the rate for each
branch as independent and identically distributed (iid)
stochastic variables drawn from an underlying Γ-distribution. Apart from the edge rates themselves, the model has
two hyperparameters, namely the mean m and the variance
v of the Γ-distribution, which are both assigned uniform
priors in the interval [0, 1000]. Variants of the iid model
with respect to the underlying distribution (e.g., 3, Log
Normal, Inverse Gaussian and Exponential distributions)
have been described [16]. A comparative study in (Sennblad et al.: Parental guidance vs. mutual independence –
evaluation of bayesian models of substitution rate evolution, submitted) found no difference in performance
between these variants, while results in [21] suggested that
the Γ-distribution might be more flexible in describing
rate variation. An alternative approach suggested by Sanderson [12,13] and Thorne et al. [14], is to model rates as
evolving over the tree, such that there is an auto-correlation between rates at adjacent nodes. In (Sennblad et al.:
Parental guidance vs. mutual independence – evaluation
of bayesian models of substitution rate evolution, submitted), the differences in performance of the iid model and
In this study, we have chosen the iid-Γ model as an example distribution; it is, however, straight-forward to adapt
our methodology to accommodate any of the models
described above, with a possible exception for the Dirichlet time prior which we have not tested.
The birth-death process generating tree branching times
have hyperparameters for birth rate, λ, and death rate, µ,
respectively. That is, the set of hyperparameters used in
the inference are λ, µ, m, and v. For the simulation analyses, we generated sequences using λ = 1.0, µ = 1.0, m = 0.5,
v = 0.01. We then used these same values during inference.
We used the simple Jukes-Cantor [25] substitution model.
When analyzing biological data, we estimated values for
λ, µ, m, v during inference and the substitution parameters
were estimated with a maximum likelihood analysis using
PAUP [26] and the GTR+Γ model [27].
Proposal distributions
The proposal scheme used was developed for the MCMCalgorithm described in (Sennblad et al.: Parental guidance
vs. mutual independence – evaluation of bayesian models
of substitution rate evolution, submitted). Each parameter (i.e. rates, times, and model hyperparameters) has
equal probability of being updated, and the new value s'
of the updated parameter s is LogNormal(s, σ). The factor
σ is a prespecified constant related to the parameter type
which was calibrated to optimize mixing and convergence
of the MCMC chain.
We present here results obtained with MCMC, as well as a
hill-climbing MAP adaptation of the MCMC-algorithm.
In the MAP case we only accept proposals increasing the
posterior, while in the MCMC case the standard Metropolis-Hastings proposal-acceptance scheme is used.
We further tested a deterministic scheme where a search
was performed in all directions, i.e., we perturb each
parameter, store all results and then choose the perturbation giving the optimal log-likelihood. In this case, the
results were qualitatively the same as with the randomized
Page 4 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
algorithm described above, although computation times
were longer.
The phylogeny inference problem and the SAL-method
When we extend our ambition to also do phylogeny inference, the most severe problem with the hill-climbing
approach is that the algorithm can get stuck on a particular non-global optimum tree. This problem can, at least to
some extent, be resolved by the SAL-method, i.e., a methodology where tree topologies are swapped often in the
beginning, but henceforth more rarely. For simplicity, we
use a linear scheme where the probability of a tree swap is
originally k times higher than that of a parameter change
(we have found k = 1000 to be suitable). For each tree
swap attempt, we decrease k with one unit until k equals
unity. The types of tree rearrangements that we use are rerooting, nearest neighbor interchange (NNI) and subtree
pruning and regrafting (SPR), see [7].
The applicability of the DP-algorithm as well as the SALmethod are independent of the way in which we exploit
the search space, i.e., they are useful in both the MAP and
the MCMC setting.
http://www.biomedcentral.com/1471-2148/8/77
Results
Parameter inference simulations
We first evaluated the three inference methods described
above on fixed trees with respect to their capacity to infer
parameters. We generated a tree with 100 leaves and generated nucleotide sequences of length 1000 by evolving
them on the tree. Figure 2 illustrates the performance of
the l-method, i.e., MAP over lengths, relative to the r × tmethod and the combined method, i.e., MAP over rates
and times without and with DP, respectively. We note that
finding r and t that optimize the prior probabilities
requires more than 200,000 iterations for the r × tmethod to converge although the data probability has
reached optimum after only 50,000 iterations approximately. The speedup achieved with the combined
method, as compared to the r × t-method, is considerable.
We note also that the combined algorithm obtains a solution with higher log-likelihood than does the otherwise
fast l-method. When the performances of these methods
are compared on a larger set of trees, the conclusions
stated above become even clearer. We generated 100 trees
with 10 leaves and 100 trees with 100 leaves, and generated nucleotide sequences of length 1000 on each tree.
For each of our three methods, we further made two sep-
Figure 2
Three MAP inference methods – a comparison
Three MAP inference methods – a comparison. The plot illustrates the fact that optimization over the full r × t parameter space (red curves) is much slower than both the equivalent optimization over l with subsequent DP-partition of l into r
and t (blue curves), and the method combining r × t search with occasional DP-optimization interruptions (green curves). The
solid curves show log-likelihoods including the priors p [r]p [t|T], whereas the dashed curves show the same results, but
excluding that factor. Approximate point of convergence, i.e., the first point when 1000 subsequent steps has not resulted in
any improvement, is marked with an asterisk.
Page 5 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
arate parameter estimations on each tree. The aim was to
compare the two results and from that evaluate the respective methods' reproducibility. The results in Table 1 show
the methods' performances with respect to optimality,
speed, and reproducibility of likelihood, rates, and times.
On average the combined method delivers the result with
best likelihood, the r × t-method is second, while it is
again clear that the r × t-method is by far the slowest. We
note in particular that the l-methods' rate and time variance is almost negligible; but this is simply a consequence
of the fact that the l-method consistently finds very similar
l-optima, and that the optimal DP-partitioning of these loptima results in very similar r and t. To summarize, the
combined method outperforms the others with respect to
quickly finding optimal values for branches' rates and
times.
Evaluation of parameter recovery
To get an indication of what accuracy we can expect for
solutions obtained with the hill-climbing approach, we
compared the inferred rates and times parameters with the
corresponding values used when simulating the sequence
data. We generated 10 trees with 10 leaves and 10 trees
with 100 leaves, then generated nucleotide sequences of
length 1000 on each tree. We recorded the true rates and
times for each edge, and we ran the combined method
and the r × t-method on each tree while recording the
rates and times thereby inferred. We repeated the procedure using different values of v, the rate variance hyperparameter, when generating the sequence data. The rate
mean hyperparameter was kept constant at 0.5.
From Figure 3 can be seen that for 10-leaf trees and m =
0.5, v = 0.001, the inferred rates and times are on average
around 5% off the true value. For these m and v the quotient between the highest and the lowest true rate in the
tree is approximately 1.25, showing that the data is fairly
clock-like. For m = 0.5, v = 0.1, on the other hand, the rate
parameters are as much as 45% off and the time parameters around 30% off. Here, though, the highest true rate is
more than 5 times greater than the lowest. The same pattern is seen also for large 100-leaf trees, with the accuracy
being very high for fairly clock-like data and declining as
the data becomes less clock-like. Interestingly, we note
that for the 100-leaf trees, the combined method is much
more accurate than the r × t-method with respect to both
rates and times. With the combined method, the times, in
particular, and the rates, are very well estimated even for v
as large as 0.05.
Hill-climbing vs. MCMC – a parameter inference
performance comparison on a rbcL dataset
We also made a parameter inference comparison on a
dataset consisting of rbcL genes from eudicots, a group of
flowering plants. We used the tree presented in [28]. We
compared the rates and times inferred using the combined method, i.e., hill-climbing MAP with the DP factorization speedup, with the rates and times posterior
distributions obtained from an MCMC analysis using the
r × t-method (i.e. without DP). The MCMC-chain was run
100,000,000 iterations, and was sampled at regular intervals (Sennblad et al.: Parental guidance vs. mutual independence – evaluation of bayesian models of substitution
rate evolution, submitted). To evaluate the convergence of
the MCMC analysis, the Gelman and Rubin [29] convergence tests, as implemented in the R package Coda [30],
was used. The MCMC-analysis required a computation
time of several days, while the MAP-optimum is reached
after some 10.000 steps, requiring less than a minute on
the same type of computer, the difference mainly being
due to MCMC requiring more iterations to accurately
describe the posterior distribution while MAP results in
point estimates. Figure 4 depicts the rates and times posterior distributions inferred using MCMC, together with
the corresponding values from the MAP-analysis. We note
that for the 62 rate and time variables that we attempt to
Table 1: Parameter inference simulations. Results from pairwise MAP runs on 100 generated tree topologies using the combined
method, the r × t-method, and the l-method.
Tree size
Method1
MAP value
average
10 leaves
combined
r×t
l
combined
r×t
l
-8046.54
-8046.72
-8049.34
-65249.12
-65257.40
-65258.41
100 leaves
MAP diff2
average
worst
0.48
0.32
0.012
1.61
4.66
0.16
3.39
13.03
0.20
8.60
23.10
1.54
Rates diff3
average
worst
Times diff4
average
worst
0.58
0.36
0.00068
4.72
14.08
0.034
0.26
0.17
0.0015
1.24
5.11
0.060
1.46
4.61
0.0084
10.64
21.66
0.071
0.72
1.73
0.05
3.39
9.19
0.41
Comp time5
relative
1
2
1
2
10
1
1The
combined method and the r × t-method infer optimal substitution rates and divergence times. The l-method infer optimal branch lengths
which are subsequently partitioned into rates and times. The latter problem is much easier and the inferred rates and times less optimal.
2Difference in log-likelihood between two MAP runs on the same tree.
3Difference in substitution rate norm between two MAP runs on the same tree.
4Difference in divergence time norm between two MAP runs on the same tree.
5Approximate computation time relative to the l-method.
Page 6 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
Figure of
Degree 3 parameter recovery
Degree of parameter recovery. Rates and times inferred using the combined method (green curves), and the r × t-method
(red curves) are compared with the corresponding true parameters that were used to generate the sequence data. Results for
10-leaf (upper figure), and 100-leaf trees (lower figure) are shown. We plot the deviation of the inferred parameter from the
true value against different rate variances v. Rates and times are shown as stars and circles, respectively.
infer, only one has a MAP estimate outside of the borders
marked out by the 25% and 75% quantiles from the
MCMC analysis, and that one is only slightly outside. This
is good performance, since it seems clear from the figure
that the MCMC analysis gives rather tight parameter inference, i.e., the quantile intervals are small.
Phylogeny inference simulations
Compared to the parameter inference problem discussed
in the previous three sections, the phylogeny inference
problem is significantly more difficult. Given a set of
sequences, the objective is to perform simultaneous inference of tree topology, rates, and times. In Figure 5 and
Table 2, we present a simulation study comparing hillclimbing and MCMC, with the result, as expected, that
hill-climbing converges quickly but is less reliable than
MCMC. In both the hill-climbing and the MCMC case, a
run is considered successful when it first visits a state with
log-likelihood at least as good as the optimal log-likelihood for the tree generating the data. For MCMC this is
despite the fact that if the posterior distribution is nonuniform it cannot have been reached. That is, we underestimate the time required.
When doing phylogeny inference the local optima problem really becomes an issue. What often happens both in
the hill-climbing MAP and in the MCMC case, is that the
algorithm gets stuck on one particular sub-optimal tree,
and by optimizing rate and time parameters for that tree
we make a move to another tree unlikely. To circumvent
this problem, we use the SAL-method described in Methods; that is, we do tree swaps often early in the chain,
while focusing more on fine-tuned parameter changes
later. For small trees the effect of the SAL-method is
minor. Out of the 100 MAP estimations we carried out for
the 10-leaf trees with the SAL-method and the standard
method (i.e. perturbing each parameter equally often), we
finished successfully in 98 and 87 cases, respectively (see
Table 2). When using MCMC on 10-leaf trees, we finished
successfully in 99 cases with the SAL-method and in 100
Page 7 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
,%*
*+%+ *)%)*,%'
-%+ '%,
-%*
,%,
,%)
)%$
,%, -%* '%,
-%* '%$
,%)
-%+ '%+
,%&
-%* '%$
&%, .%&
,%& $%-
-%* -%(
-%) '%+
'%* &%)
$%$ -%,
)%(
,%) $%, -%+
,%+ $%- -%'
,%)
$%.
-%(
-%-
)%'
$%- -%&
$%'
&%) .%,
$%-
$%)
,%.
.%, (%&
,%)
,%, -%) '%$
'%,
'%& &%(
'%+ &%+
$%- -%$
)%)
$%+ -%*
)%. ,%,
#
*%(
)%,
)%$ )%.
)%&
,%)
$%+
,%&
'%* &%$
$%. -%.
$%)
-%' '%(
-%(
-%+
)%& ,%+
,%- $%-
!
"
*%&
*%.
&%. .%(
-%. &%) .%*
)%*
$%- -%'
,%& -%- '%&
'%) .%* (%,
&%) .%'
,%& -%- '%&
'%* &%,
-%* &%, .%&
,%+
,%& $%$
$%&
)%, )%&
'%)
'%(
Figure hill-climbing vs. MCMC comparison
An rbcL4
An rbcL hill-climbing vs. MCMC comparison. The speciation times inferred by using MCMC are shown by the tree
nodes (means) and the blue bars (25% and 75% quantiles), respectively. The positions of the tree nodes, and the 25% and 75%
bars, are relative to the position of the root, which has time value 1.0. Substitution rates are explicitly written out scaled by
100 on branches and are shown by yellow bars (mean, 25% and 75% quantiles). In both cases, corresponding MAP estimates
are indicated by red stars.
Page 8 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
Figure 5
Phylogeny inference simulations
Phylogeny inference simulations. Phylogeny inference simulations performed on 30-leaf trees are shown. An inference is
considered successful at the first visit to the true tree, i.e., the tree generating the sequence data, or to a state with log-likelihood value at least as good as the true tree. In the left figure, we plot percent successes as a function of number of iterations
for 100 runs on one tree with, respectively, MAP using the combined method, i.e., with DP (solid green line with stars indicating that at least one run has reached the success zone during the last 1000 steps), MAP using the r × t-method, i.e., without DP
(solid red line with circles) and MCMC with (dotted green line with stars) and without DP (dotted red line with circles). We
also plot results obtained with MAP and the combined method but without the SAL-method (solid green line with triangles). In
the right figure, the comparison is between the same MAP runs with (again solid green line with stars) and without DP (again
red solid line with circles) when a success is recorded as before, and the same methods (dashed green and red lines with stars
and circles respectively) where a success is recorded when optimum for the true tree is reached.
cases with the standard method. For 10-leaf trees, both the
standard method and the SAL-method require a similar
number of iterations to converge successfully, and we conclude that the difference in computation time between the
two methods for these small trees is minor.
For inference on larger trees (results from simulations
using 30-leaf trees are shown), it is clear that favoring of
early tree moves, as in the SAL-method, is necessary. Out
of 100 simulations that were performed with the standard
method, after 50,000 iterations only 25 MAP runs have
reached a likelihood similar to that of the true tree (i.e. the
tree generating the data). This should be compared with
the SAL-method which finishes successfully in 74% of the
cases. Also for MCMC, where we expect optimal values to
be eventually achieved in all cases, we note the same pattern with the SAL-method having much shorter time to
success. We finally conclude that the effect of our DP-algorithm is less pronounced when it comes to phylogeny
inference than is the case for parameter inference. A comparison between the combined method (i.e. with the DP
augmentation) and the r × t-method is shown in the right
part of Figure 5. We record the percentage of runs finding
first the true tree and second the optimal log-likelihood
Page 9 of 14
(page number not for citation purposes)
BMC Evolutionary Biology 2008, 8:77
http://www.biomedcentral.com/1471-2148/8/77
Table 2: Phylogeny inference simulations. 100 times MAP and MCMC phylogeny inference on 10- and 30-leaf trees, using the
combined, the r × t-method, the standard and the simulated annealing-like algorithms respectively.
Tree size
Combined
MAP
MCMC
r×t
MAP
MCMC
10
30
10
30
10
30
10
30
# Iterations1
SAL
Std
2000
6000
4200
18000
1900
8000
5000
22000
1500
29000
1700
32000
3300
>50000
4600
>50000
Time2
SAL
Std
50 s
11 m
90 s
20 m
45 s
14 m
100 s
20 m
15 s
20 m
20 s
25 m
25 s
>30 m
30 s
>30 m
Success3
SAL
Std
98
74
99
80
99
83
96
54
87
25
100
60
98
0
100
0
1Median
number of iterations for successful runs (i.e. runs finding a tree with likelihood at least as good as the true tree).
computation time needed for successful runs.
3Percentage of runs being successful.
2Median
for that tree. The combined method finds the true tree in
74% and the r × t-method in 83% of the cases. The combined method is slightly faster needing on average 6000
iterations while the r × t-method needs 8000 iterations.
Now, to find the optimum the combined method needs
8000 iterations, i.e., another 2000 iterations after the true
tree is found. This should be compared to the r × tmethod which needs 8000 iterations to find the optimum
after the true tree is found, in total 16000 iterations on
average.
over trees in an seemingly unexpected manner. We thus
record a higher percentage of successes for the dataset
including 31 sequences in trees a + b + c than for the
smaller subset in trees a + b (22 sequences) and a (11
sequences). A plausible reason for this is that the tree used
to compare with (i.e. the one shown in Figure 6) is not
necessarily optimal and that the relative effect of this is
larger for the smaller subtrees. The distribution of results
inferred by our method seems however to be the expected
one with wider spread for bigger trees.
Hill-climbing vs. MCMC – a phylogeny inference
performance comparison on a mtDNA dataset
To test the phylogeny inference method on biological
data, we used a mitochondrial DNA dataset originally presented in [23]. It consists of the complete cytochrome oxidase II and cytochrome b genes, altogether around 1800
nucleotides, in 40 species. The authors utilize this and
other datasets together with calibration times obtained
from fossil records to infer divergence times among the
lemurs of Madagascar. The model they used is that of
Thorne et al. [14] which, again, is based on auto-correlated rates. Our use of this data was twofold. Firstly, we
tested whether, and if so how often, we could find a tree
with similar or higher likelihood than the one used in
[23] (see Figure 6). Secondly, we used their tree in order
to obtain divergence times for comparison.
Hill-climbing vs. MCMC – a dating performance
comparison on a mtDNA dataset
We finally made a comparison between divergence times
resulting from our method and those reported in [23].
The results in Table 4 imply that our inference differs considerably from the calibration times used by Yoder et al..
All times inferred by us on the Laurasiatheria clade (C4–
C7) are wildly underestimated as compared to the
reported calibrations, although the relative edge lengths
on the clade seem to agree. On the other hand, we overestimate the time point for the divergence between Pan/
Homo and Gorilla (C1). Our results agree with what has