-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathagrawal_09_pairwise_798006.pdf.txt
1148 lines (881 loc) · 38.8 KB
/
agrawal_09_pairwise_798006.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>1471-2105-10-S3-S1.fm</title>
<meta name="Author" content="productiontraining"/>
<meta name="Creator" content="FrameMaker 8.0"/>
<meta name="Producer" content="Acrobat Distiller 8.1.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
BMC Bioinformatics
BioMed Central
Open Access
Proceedings
Pairwise statistical significance of local sequence alignment using
multiple parameter sets and empirical justification of parameter set
change penalty
Ankit Agrawal* and Xiaoqiu Huang
Address: Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA
Email: Ankit Agrawal* - ankitag@iastate.edu; Xiaoqiu Huang - xqhuang@iastate.edu
* Corresponding author
from Second International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2008
Napa Valley, CA, USA. 30 October 2008
Published: 19 March 2009
BMC Bioinformatics 2009, 10(Suppl 3):S1
doi:10.1186/1471-2105-10-S3-S1
<supplement> <title> <p>Second International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2008</p> </title> <editor>Min Song, Doheon Lee and Jun Huan</editor> <note>Proceedings</note> <url>http://www.biomedcentral.com/content/pdf/1471-2105-10-S3-info.pdf</url> </supplement>
This article is available from: http://www.biomedcentral.com/1471-2105/10/S3/S1
© 2009 Agrawal and Huang; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Accurate estimation of statistical significance of a pairwise alignment is an important
problem in sequence comparison. Recently, a comparative study of pairwise statistical significance
with database statistical significance was conducted. In this paper, we extend the earlier work on
pairwise statistical significance by incorporating with it the use of multiple parameter sets.
Results: Results for a knowledge discovery application of homology detection reveal that using
multiple parameter sets for pairwise statistical significance estimates gives better coverage than
using a single parameter set, at least at some error levels. Further, the results of pairwise statistical
significance using multiple parameter sets are shown to be significantly better than database
statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times
significantly better than SSEARCH. Using non-zero parameter set change penalty values give better
performance than zero penalty.
Conclusion: The fact that the homology detection performance does not degrade when using
multiple parameter sets is a strong evidence for the validity of the assumption that the alignment
score distribution follows an extreme value distribution even when using multiple parameter sets.
Parameter set change penalty is a useful parameter for alignment using multiple parameter sets.
Pairwise statistical significance using multiple parameter sets can be effectively used to determine
the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database
search.
Background
Local sequence alignment plays a major role in the analysis of DNA and protein sequences [1-3]. It is the basic step
of many other applications like detecting homology, find-
ing protein structure and function, deciphering evolutionary relationships, etc. There exist several local sequence
alignment programs that use well-known algorithms [4,5]
or their heuristic versions [3,6,7]. Database search is a spePage 1 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
cial case of pairwise local sequence alignment where the
second sequence is a database in itself consisting of many
sequences. Recently, there have been many enhancements
in alignment program features [8,9] using difference
blocks and multiple scoring matrices, in an attempt to
incorporate more biological features in the alignment
algorithm.
Why statistical significance?
The local sequence alignment programs report alignment
scores for the alignments constructed, and related
(homologous) sequences will have higher alignment
scores. But the definition of high depends strongly on the
alignment score distribution, which gives importance to
the concept of statistical significance. An alignment score
is considered statistically significant if it has a low probability of occurring by chance. Since the alignment score
distribution depends on various factors like alignment
program, scoring scheme, sequence lengths, sequence
compositions [10], it implies that it is possible to have
two alignments of different sequence pairs with scores x
and y with x <y, but x more significant than y. Therefore,
instead of using the alignment score for detecting homology, the statistical significance of an alignment score is
more widely accepted as a metric to comment on the relatedness of the two sequences being aligned. Of course, it is
important to emphasize here that although statistical significance is a good preliminary indicator of biological significance, it does not necessarily imply biological
significance [10,11].
Accurate statistical theory for the ungapped alignment
score distribution is available [12]. However, no precise
statistical theory currently exists for the gapped alignment
score distribution and for score distributions from alignment programs using additional features. Accurate estimation of statistical significance of gapped sequence
alignment scores has attracted a lot of attention in the
recent years [13-21].
Although there exists some understanding of the statistics
of gapped alignment score distributions for simple scoring schemes [22,23], but a complete mathematical
description of the optimal score distribution remains far
from reach [23]. There exist some excellent reviews on statistical significance in sequence comparison in the literature [10,24-26].
Database statistical significance versus pairwise statistical
significance
Recently, a study of pairwise statistical significance and its
comparison with database statistical significance [27] was
conducted. In summary, the database statistical significance reported by most database search programs like
SSEARCH, FASTA, PSI-BLAST is database-dependent, and
http://www.biomedcentral.com/1471-2105/10/S3/S1
hence, the same alignment of two sequences with the
same alignment score can be evaluated as having different
significance values in database searches with different
databases, and even with the same database at different
times, as the database size can be variable. On the other
hand, pairwise statistical significance is specific to the
sequence pair being aligned, and is database-independent. In [27], various approaches to estimate pairwise statistical significance were compared to find that maximum
likelihood fitting with censoring left of peak is the most
accurate method for estimating pairwise statistical significance. Further, this method was compared with database
statistical significance in a homology detection experiment to find that pairwise statistical significance performs
comparably to and sometimes significantly better than
database statistical significance.
Accurate statistical significance estimates for pairwise
alignments can be very useful to comment on the relatedness of a pair of sequences aligned by an alignment program independent of any database. And thus, it can also
be used to compare different combination of alignment
parameters – like the alignment program itself, substitution matrices, gap costs. A comparison of different gap
opening penalties for four commonly used BLOSUM
matrices using pairwise statistical significance was presented in [28]. In addition to the standard local alignment
algorithms [4,5], some recent algorithms have been developed [8,9] that take into account other desirable biological features in addition to gaps – like difference blocks or
the use of multiple parameter sets (substitution matrices,
gap penalties). These features of the alignment programs
enhance the sequence alignment of real sequences by better suiting to different conservation rates at different spatial locations of the sequences. As pointed out earlier,
accurate statistical theory for alignment score distribution
is available only for ungapped alignment, and not even
for its simplest extension, i.e., alignment with gaps. Accurate statistics of the alignment score distribution from
newer and more sophisticated alignment programs therefore is not expected to be straightforward. For comparing
the performance of newer alignment programs, accurate
estimates of pairwise statistical significance can be very
useful. Further, quick and accurate estimates of pairwise
statistical significance can also be helpful for applications
like multiple sequence alignment and phylogenetic tree
construction which are based on pairwise sequence alignment to select most related pairs of sequences, for example, in a progressive multiple sequence alignment.
The extreme value distribution for ungapped and gapped
alignments
Just as the distribution of the sum of a large number of
independent identically distributed (i.i.d.) random variables tends to a normal distribution (central limit theo-
Page 2 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
rem), the distribution of the maximum of a large number
of i.i.d. random variables tends to an extreme value distribution (EVD) [29]. This is an important and useful fact,
because in principle it allows us to fit an EVD to the score
distribution from any local alignment program and use it
for estimating statistical significance of scores from that
program. The distribution of Smith-Waterman local alignment score between random, unrelated sequences is
approximately a Gumbel-type EVD [12]. In the limit of
sufficiently large sequence lengths m and n, the statistics
of HSP (High-scoring Segment Pairs which correspond to
ungapped local alignments) scores are characterized by
two parameters, K and λ. The probability (P-value) that
the optimal local alignment score S exceeds x is estimated
by:
Pr(S > x) ~1 - e-E,
where E is the E-value and is given by
E = Kmne-λx.
For E-values less than 0.01, both E-value and P-values are
very close to each other. The above formulae are valid for
ungapped alignments [12], and the parameters K and λ
can be computed analytically from the substitution scores
and sequence compositions. For gapped alignments, no
perfect statistical theory has yet been developed, although
there exist some good starting points for the problem as
mentioned before [22,23]. Recently, researchers have also
looked closely at the low probability tail distribution, and
the work in [30] applied a rare-event sampling technique
earlier used in [31] and suggested a Gaussian correction to
the Gumbel distribution to better describe the rare event
tail, resulting in a considerable change in the reported significance values. However, for most practical purposes,
the original Gumbel distribution has been widely used to
describe gapped alignment score distribution [9,1315,17,27,32,33]. From an empirically generated score distribution, we can directly observe the E-value E for a particular score x, by counting the number of times a score x
or higher was attained. Since this number would be different for different number of random shuffles N (or number
of sequences in the database in case of database search), a
normalized E-value is defined as
E normalized =
E
N
In theory, this normalized E-value is same as the P-value
(for large N).
Contributions
In this paper, we extend the existing work on pairwise statistical significance [27] to incorporate in it the use of
http://www.biomedcentral.com/1471-2105/10/S3/S1
multiple parameter sets, and evaluate it on an important
knowledge discovery application-homology detection.
We conducted similar experiments as reported in [34],
and later in [27] on a subset of the CATH 2.3 database to
compare pairwise statistical significance with single and
multiple parameter sets. This benchmark database was
earlier created in [34] to evaluate seven protein structure
comparison methods and two sequence comparison programs: SSEARCH and PSI-BLAST. SSEARCH uses the original Smith-Waterman algorithm [4], and is considered the
most sensitive algorithm in terms of retrieval accuracy,
better than the heuristic versions like BLAST and FASTA
[35,36]. PSI-BLAST is a modification to the BLAST program, where position specific scoring matrices are constructed over multiple iterations of BLAST algorithm.
Comparison of pairwise statistical significance results
using multiple parameter sets with pairwise statistical significance using a single parameter set shows that at least
for some error levels, using multiple parameter sets is significantly better than using a single parameter set. This is
because sequences can have different conservation rates at
different spatial locations, which can be better aligned
using multiple parameter sets (substitution matrices, gap
penalties, etc.). Comparison with database statistical significance results show that pairwise statistical significance
with multiple parameter sets gives significantly better performance than the statistical significance estimates
reported by BLAST and PSI-BLAST, and comparable and at
times significantly better performance than the SSEARCH
program. Further, the results also give concrete evidence
that for the practical application of homology detection,
the score distribution from alignment program using multiple parameter sets can also be assumed to follow an
extreme value distribution. This is an important and useful finding since it is in general difficult to accurately
determine statistics of alignment scores from enhanced
alignment programs. Finally, experiments with different
values of parameter set change penalties indicate that it is
indeed important to use a non-zero parameter set change
penalty while performing alignment using multiple
parameter sets.
Methods
Pairwise statistical significance estimation
Consider the pairwise statistical significance described in
[27] to be obtainable by the following function: PairwiseStatSig(Seq1, Seq2, SC, N) where Seq1 is the first sequence,
Seq2 is the second sequence, SC is the scoring scheme
(substitution matrix, gap penalties), and N is the number
of shuffles. The function PairwiseStatSig, therefore generates a score distribution by aligning Seq1 with N shuffled
versions of Seq2, fits the distribution to an extreme value
distribution using censored maximum likelihood fitting
to obtain the statistical parameters K and λ, and returns
the pairwise statistical significance estimate of the pair-
Page 3 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
wise alignment score between Seq1 and Seq2 using the
parameters K and λ in the P-value formula. More details
on pairwise statistical significance can be found in [27]. In
this paper, we dynamically use multiple parameter sets
instead of a single scoring scheme SC for estimation of
pairwise statistical significance.
Dynamic use of multiple parameter sets in sequence
alignment
Usually, pairwise sequence alignment is done with a single parameter set (substitution matrix, gap penalties). But
to suit the different levels of conservation between
sequences, there exists an algorithm [9] which can dynamically use multiple parameter sets and generate a single
optimal alignment with possibly different parameter sets
used in different regions of the alignment. The algorithm
is implemented in a program named GAP4. The algorithm uses a dynamic programming approach as
explained in [9]. Consider alignment of two sequences A
= a1, a2,..., am and B = b1, b2,... bn using p parameter sets P1,
P2,..., Pp. Let Ai and Bj be the subsequences a1, a2,... ai and
b1, b2,..., bj respectively. For each alignment position (i, j)
and each parameter set Pk, the algorithm keeps track of the
optimal alignment score of the subsequences Ai and Bj
where the last component (substitution, gap, or difference
block) is scored using Pk. Dynamic programming is used
to get optimal alignment for progressive alignment positions, until i becomes m and j becomes n. Appropriate
modification of the algorithm also allows it to calculate
the optimal local alignment. More details about using
multiple parameter sets for pairwise sequence alignment
can be found in [9].
Evaluation methodology
To evaluate the performance of pairwise statistical significance using multiple parameter sets, we used a nonredundant subset of the CATH 2.3 database (Class, Architecture, Topology, and Hierarchy, [37]) provided by [34]
and available at ftp://ftp.ebi.ac.uk/pub/software/unix/
fasta/prot_sci_04/. This database was selected in [34] to
evaluate seven structure comparison programs and two
sequence comparison programs. As described in [34], this
dataset consists of 2771 domain sequences and includes
86 selected test query sequences. This domain set is considered as a valid benchmark for testing protein comparison algorithms [38].
We used this database and query set for experimenting
with pairwise statistical significance using multiple
parameter sets. For each of the 86 × 2771 comparisons, we
used the maximum likelihood method with censoring left
of peak with 1000 shuffles to fit the score distribution
from the GAP4 program with substitution matrices
BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and
their all possible combinations (24 - 1 = 15 in number).
http://www.biomedcentral.com/1471-2105/10/S3/S1
All matrices were used in 1/3 bit scale. The gap opening
penalties for each of these matrices was set to the values
empirically determined to be the best for this database in
[28]. These are listed in Table 1. The gap extension penalties were set to 2 for all the four matrices. Following
[27,34], Error per Query (EPQ) versus Coverage plots
were used to present the results. To create these plots, the
list of pairwise comparisons was sorted based on decreasing statistical significance (increasing P-values). Going
down the list, the coverage count is increased by one if the
two sequences of the pair are homologs, and the error
count is increased by one if they are not. At a given point
in the list, Errors Per Query (EPQ) is the total number of
errors incurred so far, divided by the number of queries;
and coverage is the fraction of total homolog pairs so far
detected. In the ideal case, the curve would go from 0% to
100% coverage, without incurring any errors, which
would correspond to a straight line on the x-axis. Therefore, the more the curve is towards the right, the better the
curve is.
Just as gap opening and gap extension penalties are
dynamically charged during the alignment process whenever a gap is inserted and extended respectively, the GAP4
[9] program allows the use of a parameter set change penalty, which is dynamically charged whenever the parameter set mapping is changed during the alignment process.
To see the effect of parameter set change penalty on the
coverage performance, we conducted a series of homology
detection experiments with one of the substitution matrix
combinations (BLOSUM45 and BLOSUM62) with different parameter set change penalties. Coverage vs. parameter set change penalty curves were plotted at different error
levels to find the usefulness of the parameter set change
penalty, as reported in the next section.
Results
Comparison with pairwise statistical significance using
single parameter set
Out of the 15 substitution matrix combinations, 4 are
using single parameter sets, 6 are using two parameter
sets, 4 are using three parameter sets, and 1 is using all
four parameter sets. The EPQ vs. Coverage curves using
pairwise statistical significance with two, three, and four
parameter sets are presented in Figures 1, 2, and 3 respecTable 1: Effective gap opening penalties for commonly used
BLOSUM matrices determined for the benchmark database
used
BLOSUM matrix
Gap opening penalty
BLOSUM45
BLOSUM50
BLOSUM62
BLOSUM80
7
9
11
13
Page 4 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
http://www.biomedcentral.com/1471-2105/10/S3/S1
0.24
0.26
0.28
0.1
0.3
0.32
BL45
BL50
BL45,BL50
0.01
1
0.22
0.24
0.26
0.1
0.28
0.1
0.3
0.32
BL50
BL62
BL50,BL62
Coverage
Errors per Query
Errors per Query
BLOSUM50,
BLOSUM62
0.26
1
0.22
0.24
0.26
0.3
0.32
BL45
BL80
BL45,BL80
Coverage
(c)
10
BLOSUM50,
BLOSUM80
1
0.22
0.28
0.1
0.01
10
0.24
BLOSUM45,
BLOSUM80
(b)
10
0.01
0.32
Coverage
(a)
1
0.22
0.3
BL45
BL62
BL45,BL62
0.01
Coverage
0.28
0.24
0.26
0.28
0.1
0.01
(d)
0.3
0.32
BL50
BL80
BL50,BL80
Coverage
(e)
Errors per Query
1
0.22
10
BLOSUM45,
BLOSUM62
Errors per Query
10
BLOSUM45,
BLOSUM50
Errors per Query
Errors per Query
10
BLOSUM62,
BLOSUM80
1
0.22
0.24
0.26
0.28
0.1
0.01
0.3
0.32
BL62
BL80
BL62,BL80
Coverage
(f)
Figure
Pairwise1statistical significance using two parameter sets
Pairwise statistical significance using two parameter sets. Errors per Query vs. Coverage plot for pairwise statistical
significance using two parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45,
BLOSUM50; (b) BLOSUM45, BLOSUM62; (c) BLOSUM45, BLOSUM80; (d) BLOSUM50, BLOSUM62; (e) BLOSUM50,
BLOSUM80; (f) BLOSUM62, BLOSUM80. In 5 panels (b) through (f) out of 6, using two parameter sets leads to better coverage than using a single parameter set at most error levels.
tively. For comparison purposes, the EPQ vs. Coverage
curves using corresponding single parameter sets are also
presented in the same figures. The y-axis (error-axis) in all
these graphs is in log-scale, and hence there is more information in the upper part of the graphs. These figures suggest that pairwise statistical significance using multiple
parameter sets performs comparably to and sometimes
significantly better than pairwise statistical significance
using a single parameter set for most instances of using a
single parameter set, and at most error levels.
Comparison with database statistical significance
Since the EPQ vs. Coverage curves on the complete dataset
can be distorted due to poor performance by one or two
queries (if those queries produce many errors at low coverage levels) [34], to compare the performance of pairwise
statistical significance using multiple parameter sets with
database statistical significance, we examined the performance of the methods with individual queries, following the work in [34]. The coverage of each of the 86
queries at the 1st, 3rd, 10th, 30th, and 100th error was
recorded, and the median coverage for each error level
across the 86 queries was plotted to obtain EPQ vs. Coverage curves for the sequence comparison method to be
evaluated. Figure 4 shows the median coverage level at the
1st, 3rd, 10th, 30th, and 100th false positive for homologs
(i.e. 43 of the queries have worse coverage, and 43 have
better coverage). The curve for SSEARCH in Figure 4 is
derived from the figure 2A in [34]. The curves for BLAST
and PSI-BLAST were obtained by experimentation. It is
clear that the proposed pairwise statistical significance
using multiple parameter sets performs significantly better than BLAST and PSI-BLAST at all error levels, comparable to SSEARCH at low error levels, and significantly
better than SSEARCH at higher error levels.
Empirical justification of parameter set change penalty
The coverage vs. parameter set change penalty plot for the
substitution matrix combination of BLOSUM45 and
BLOSUM62 is illustrated in Figure 5. The curve shows a
Page 5 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
http://www.biomedcentral.com/1471-2105/10/S3/S1
1
0.22
0.24
0.26
0.3
0.32
BL45
BL50
BL62
BL45,BL50,BL62
0.1
0.01
(a)
0.24
0.26
0.28
0.01
0.3
0.32
BL45
BL62
BL80
BL45,BL62,BL80
0.1
1
0.22
0.24
0.26
Coverage
0.28
0.3
0.32
BL45
BL50
BL80
BL45,BL50,BL80
0.1
Coverage
(b)
10
BLOSUM45,
BLOSUM62,
BLOSUM80
1
0.22
BLOSUM45,
BLOSUM50,
BLOSUM80
0.01
Coverage
10
Errors per Query
0.28
Errors per Query
10
BLOSUM45,
BLOSUM50,
BLOSUM62
Errors per Query
Errors per Query
10
BLOSUM50,
BLOSUM62,
BLOSUM80
1
0.22
0.24
0.26
0.28
(c)
0.32
BL50
BL62
BL80
BL50,BL62,BL80
0.1
0.01
0.3
Coverage
(d)
Figure
Pairwise2statistical significance using three parameter sets
Pairwise statistical significance using three parameter sets. Errors per Query vs. Coverage plot for pairwise statistical
significance using three parameter sets, along with the curves using corresponding single parameter sets. (a) BLOSUM45,
BLOSUM50, BLOSUM62; (b) BLOSUM45, BLOSUM50, BLOSUM80; (c) BLOSUM45, BLOSUM62, BLOSUM80; (d)
BLOSUM50, BLOSUM62, BLOSUM80. In all 4 panels, using three parameter sets leads to better coverage than using a single
parameter set at most error levels for at least two instances of using a single parameter set.
poor coverage performance for the case when the parameter set change penalty is not charged, i.e., when the alignment algorithm is freely allowed to change the parameter
set during alignment without charging any penalty. This
can be explained by the fact that the algorithm would try
to mathematically maximize the alignment score by
changing the parameter set as frequently as possible,
which may produce more biologically irrelevant alignments. A similar phenomenon is also observed when very
low gap penalty is used [28]. The coverage performance
clearly improves for non-zero values of parameter set
change penalty, which provides its empirical justification.
Discussion
As pointed out earlier, SSEARCH employs the original
Smith-Waterman algorithm for alignment, and is considered more sensitive than its heuristic implementations
like BLAST and FASTA. PSI-BLAST uses an iterative
approach with query-specific substitution matrices, and
its performance mainly depends on the quality of position-specific scoring matrices (PSSMs) constructed iteratively. The results show that PSI-BLAST gave poorer
performance than pairwise statistical significance using
multiple parameter sets, even with PSSMs constructed
against the benchmark CATH database used in our experiments. However, using PSSMs derived against BLAST
non-redundant protein database has been shown to give
better results [34] as it uses much more information than
Page 6 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
0.35
BLOSUM45,
BLOSUM50,
BLOSUM62,
BLOSUM80
1
0.22
0.24
0.3
0.26
0.28
0.3
0.32
BL45
BL50
BL62
BL80
BL45,BL50,BL62,BL80
0.1
0.01
0
just a pair of sequences. Comparable and at times significantly better results than SSEARCH using pairwise statistical significance with multiple parameter sets implies that
statistical significance estimates at least as good as database statistical significance can be obtained by pairwise
100
Comparison with
database statistical
significance
Errors per Query
60
BL45,BL62
BL50,BL80
BL45,BL50,BL80
BL45,BL50,BL62,BL80
BLAST
PSI-BLAST
SSEARCH
20
0
0.2
0.3
Coverage
0.4
0.1 EPQ (9 errors)
1 EPQ (86 errors)
5 EPQ (430 errors)
10 EPQ (860 errors)
BLOSUM45,
BLOSUM62
0.15
Figure
Pairwise3statistical significance using four parameter sets
Pairwise statistical significance using four parameter
sets. Errors per Query vs. Coverage plot for pairwise statistical significance using four parameter sets BLOSUM45,
BLOSUM50, BLOSUM62, BLOSUM80 along with the curves
using corresponding single parameter sets. Using four parameter sets results in better coverage than using a single parameter set at most error levels for at least three instances of
using a single parameter set.
40
0.25
0.2
Coverage
80
Coverage
Errors per Query
10
http://www.biomedcentral.com/1471-2105/10/S3/S1
0.5
Figure 4
Comparison with database statistical significance
Comparison with database statistical significance.
Comparison of pairwise statistical significance (using multiple
parameter sets) and database statistical significance. All
parameter combinations are significantly better than database
statistical significance estimates reported by BLAST and PSIBLAST at all error levels, and better than SSEARCH especially at higher error levels.
1
2
3
4
5
6
7
8
Parameter Set Change Penalty
9
10
Figure 5
Empirical justification of parameter set change penalty
Empirical justification of parameter set change penalty. Coverage vs. Parameter Set Change Penalty plots at different errors per query for the substitution matrix
combination of BLOSUM45 and BLOSUM62. Poor coverage
is obtained if the parameter set change penalty is zero. The
coverage is better and steady for non-zero values of parameter set change penalty.
statistical significance using multiple parameter sets without having to do a time-consuming database search. This
can be very useful to estimate accurate pairwise statistical
significance of two (or a few) sequences, which is a common scenario in many pairwise alignment based applications like phylogenetic tree construction, progressive
multiple sequence alignment.
It is important to note that the proposed method is not a
database search method but statistical significance estimation method for pairwise local alignment, and the comparison with database search programs like BLAST,
SSEARCH, and PSI-BLAST is of their statistical significance
estimation strategies. The proposed method, as of now is
not scalable to a database search, but can be used to refine
the results from a fast database search program like
BLAST.
Since pairwise alignment using multiple parameter sets
takes more computational time than using a single
parameter set, pairwise statistical significance estimation
using multiple parameter sets also takes more time than
pairwise statistical significance estimation using a single
parameter set. In general, using k parameter sets increases
the computation time by a factor little more than k. Therefore, faster methods for significance estimation can be
very helpful.
Conclusion
This paper extends the work on pairwise statistical significance by incorporating in it the use of multiple parameter
Page 7 of 9
(page number not for citation purposes)
BMC Bioinformatics 2009, 10(Suppl 3):S1
sets (substitution matrices, gap penalties, etc.), and compares it with database statistical significance for the
knowledge discovery application of homology detection.
The results show that pairwise statistical significance using
multiple parameter sets performs better than pairwise statistical significance using a single parameter set. It also
performs significantly better than database statistical significance using BLAST and PSI-BLAST, and comparable
and at times significantly better than database statistical
significance using SSEARCH. Further, an empirical justification of the use of parameter set change penalty is provided.
Since PSI-BLAST results can be improved by using better
quality PSSMs derived from larger protein databases, we
believe that the performance of pairwise statistical significance can also be improved using sequence-specific/position-specific substitution matrices, which is a significant
part of our future work. Another important contribution
can be to estimate the pairwise statistical significance
accurately in less time, since using multiple parameter sets
increases the significance estimation time. Faster methods
for determining pairwise statistical significance would be
very useful.
http://www.biomedcentral.com/1471-2105/10/S3/S1
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AA conceived the study based on the initial idea given by
XH. Both AA and XH did the programming, AA carried out
the experiments and analysis, and drafted the initial manuscript. Both AA and XH read and approved the final version of the manuscript.
Acknowledgements
The authors would like to thank Dr. Volker Brendel for helpful discussions
and providing links to the data.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 3, 2009: Second International Workshop on Data and Text Mining
in Bioinformatics (DTMBio) 2008. The full contents of the supplement are
available online at http://www.biomedcentral.com/1471-2105/10?issue=S3.
20.
21.
22.
23.
24.
25.
26.
References
1.
2.
3.
4.
5.
Pearson WR, Lipman DJ: Improved Tools for Biological
Sequence Comparison. Proc Natl Acad Sci U S A 1988,