-
Notifications
You must be signed in to change notification settings - Fork 10
/
ahola_06_statistical_794567.pdf.txt
2224 lines (1715 loc) · 78.3 KB
/
ahola_06_statistical_794567.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>1471-2105-7-484.fm</title>
<meta name="Author" content="petere"/>
<meta name="Creator" content="FrameMaker 7.0"/>
<meta name="Producer" content="Acrobat Distiller 7.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
BMC Bioinformatics
BioMed Central
Open Access
Methodology article
A statistical score for assessing the quality of multiple sequence
alignments
Virpi Ahola*1,2, Tero Aittokallio3,6, Mauno Vihinen4,5 and Esa Uusipaikka2
Address: 1Biotechnology and Food Research, MTT Agrifood Research Finland, Jokioinen, Finland, 2Department of Statistics, University of Turku,
Turku, Finland, 3Department of Mathematics, University of Turku, Turku, Finland, 4Institute of Medical Technology, University of Tampere,
Tampere, Finland, 5Research Unit, Tampere University Hospital, Tampere, Finland and 6Systems Biology Unit, Institut Pasteur, Paris, France
Email: Virpi Ahola* - virpi.ahola@mtt.fi; Tero Aittokallio - tero.aittokallio@utu.fi; Mauno Vihinen - mauno.vihinen@uta.fi;
Esa Uusipaikka - esa.uusipaikka@utu.fi
* Corresponding author
Published: 03 November 2006
BMC Bioinformatics 2006, 7:484
doi:10.1186/1471-2105-7-484
Received: 11 April 2006
Accepted: 03 November 2006
This article is available from: http://www.biomedcentral.com/1471-2105/7/484
© 2006 Ahola et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
Background: Multiple sequence alignment is the foundation of many important applications in
bioinformatics that aim at detecting functionally important regions, predicting protein structures,
building phylogenetic trees etc. Although the automatic construction of a multiple sequence
alignment for a set of remotely related sequences cause a very challenging and error-prone task,
many downstream analyses still rely heavily on the accuracy of the alignments.
Results: To address the need for an objective evaluation framework, we introduce a statistical
score that assesses the quality of a given multiple sequence alignment. The quality assessment is
based on counting the number of significantly conserved positions in the alignment using
importance sampling method in conjunction with statistical profile analysis framework. We first
evaluate a novel objective function used in the alignment quality score for measuring the positional
conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13,
subtilase and β-lactamase families demonstrate that the score can distinguish sequence patterns
with different degrees of conservation. Secondly, we evaluate the quality of the alignments
produced by several widely used multiple sequence alignment programs using a novel alignment
quality score and a commonly used sum of pairs method. According to these results, the Mafft
strategy L-INS-i outperforms the other methods, although the difference between the Probcons,
TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar
results than the sum of pairs method.
Conclusion: The results indicate that the proposed statistical score is useful in assessing the
quality of multiple sequence alignments.
Background
A wealth of molecular data concerning the linear structure
of proteins and nucleic acids is available in the form of
DNA, RNA and protein sequences. Multiple sequence
alignment has become an essential and widely used tool
for understanding the structure and function of these molecules. The results of annotation of gene/protein
sequences, prediction of protein structures or building of
phylogenetic trees, for instance, are critically dependent
on the quality of the given alignment. It has been recog-
Page 1 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
http://www.biomedcentral.com/1471-2105/7/484
nized that the automatic construction of a multiple
sequence alignment for a set of remotely related
sequences can be a very demanding task. Therefore, there
is a need for an objective approach to evaluate the alignments produced by alignment programs.
account the relative symbol frequencies in the column,
and (iii) their stereo-chemical properties. Additional
requirements for a good conservation score include the
possibility to incorporate (iv) the effect of gaps and (v)
sequence weighting into (vi) a simple scoring strategy.
Two popular measures for scoring entire multiple alignments are the sum of pairs (SP) score and the column
score (CS) [1]. These scores can, however, only be used if
a reference alignment of the same sequences is available.
The SP score calculates the proportion of identically
aligned residue pairs in the test and the reference alignments, whereas the CS score measures the fraction of
identically aligned positions. Several modifications have
been made to the SP score [2,3]. The APDB (Analyze
alignments with PDB) quality measure evaluates the quality of an alignment by using available tertiary structures of
the sequences in the alignment [4]. The recently introduced multiple overlap score (MOS) is a promising
approach, which does not need a reference alignment [5].
The MOS searches for identically aligned regions in many
alignments and presumes that the alignment with the
highest number of such residues also has the highest quality.
Existing positional scoring approaches can be roughly
divided into two categories with respect to the second and
third criteria. In the first category, the positional conservation is characterized based on the symbol frequencies
only. Such frequency-based methods include, for
instance, the information-content score that quantify the
variability among the observed symbols at a particular
position by means of Shannon's entropy [18,19]. A popular variation of the information-content (IC) score measures the Kullback-Leibler distance (relative entropy)
between the observed symbol distribution and a background distribution of a priori symbol probabilities [20].
The background probability of an individual symbol may
be calculated from the complete alignment, possibly supplemented with symbol-dependent pseudo-counts [21].
Alternatively, a priori distribution can be determined
using overall relative frequencies of symbols within the
sequences of the organism or protein family under investigation.
We introduce a statistical alignment quality score which
first quantifies the degree of conservation at each alignment position and then counts the number of significantly conserved positions over the alignment. For
measuring the degree of conservation, we use a type of Zscore that is based on profile analysis [6]. After deriving
the maximum Z-score for positional conservation, the statistical significance of an observed score value is estimated
using the importance sampling method [7]. The full alignment quality score is defined in terms of positional significance levels, where the multiple comparison problem is
addressed with false discovery rates (FDR) [8]. The practical performance of the maxZ score is demonstrated using
the SH2 domain, Ras-like proteins, peptidase M13, subtilase and β-lactamase families. The alignment quality score
is finally applied to evaluate the alignment programs
Clustal [9], TCoffee [10], Dialign2 [11], Probcons [12],
Muscle [13], and Mafft [14,15].
In the second category of scoring approaches, the positional conservation is characterized based on both symbol
frequencies and their similarity properties. Such similarity-based scores address the fact that some symbol combinations occur more frequently than others mainly because
of the chemical and physical properties. The most
straightforward strategy is to group all the symbols
according to their physicochemical properties before
applying a particular scoring scheme. For instance, Taylor
presented a classification of amino acids based on their
synthesis in the Dayhoff mutation data matrix [22,23].
Subsequently, the degree of positional conservation with
respect to each overlapping group of symbols can be
quantified using any frequency-based scoring approach,
such as the information content [24]. Different conservation scores accounting for the stereochemical sensitivity
can be obtained using different symbol properties [25].
Related work
Several approaches have been proposed for the conservation analysis of multiple sequence alignments to quantify
the degree of conservation at each aligned position using
column-specific score values [16]. Valdar reviewed a wide
range of such score types developed during the last two
decades for protein sequence analysis [17]. He also introduced the following three criteria that a positional conservation score should fulfill: (i) the score should be a
mathematical mapping from an alignment position into a
bounded interval of real values which (ii) takes into
In general, the symbol properties can be considered by
predefining an appropriate matrix where entries represent
the similarity or dissimilarity between a symbol pair. Frequently used symbol scoring matrices for amino acids
include the BLOSUM and Gonnet series of substitution
matrices and PAM distance matrices [26-28]. Perhaps the
most widely used scoring approach, 'sum-of-pairs', characterizes the positional conservation by calculating the
sum of all pairwise similarities between the symbols in the
particular column [29]. It should be noted, that this 'sum
of pairs' score is different from the SP score mentioned
Page 2 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
http://www.biomedcentral.com/1471-2105/7/484
earlier in the Background section. The SP score in [1] is
used to measure alignment quality with respect to the reference alignment, whereas the score by Carillo and Lipman [29] is more generally applicable. In this work, we
only use the reference alignment-based SP score. A similar
but more complex mean distance (MD) score is used as an
objective function in the multiple alignment software
Clustal [9]. This normalized MD score also considers the
fraction of gaps [30]. A number of variations can be made
by using different similarity matrices on symbols or
weighting schemes on sequences [31].
sequence identity. Sequences within each subgroup were
aligned against the profile, then the groups were aligned,
excluding positions with low sequence identity. The positions with gaps were also excluded from the final alignment. We used only the first sequence of each subgroup in
order to avoid over-representation of profiles with many
very similar sequences. This was necessary because the
current maxZ score does not take the pairwise identity of
the sequences into account or otherwise weight the
sequences. The alignment of Ras-like proteins consists of
334 sequences.
The present work is a continuation of our previous work
on a statistical (Dunn-Sidak) framework for detecting
conserved residues in the positions of a multiple sequence
alignment [32]. Here, we allow for the incorporation of
any symbol similarity matrix into the framework that was
based on simple frequency-based scoring function. We
have previously demonstrated the usefulness of this score
in the automatic detection of the conserved residues in a
multiple sequence alignment, and compared its results on
the SH2 domain with functionally and structurally important positions of the alignment [32]. Another application
of the conservation scores includes the improvement of
the reliability of HMMs in the sequence similarity search
by decreasing the number of false positive search results
[33]. In the present study, the emphasis is on positional
conservation rather than on individual residues with the
aim of assessing the quality of full alignment.
Upper panels of the Figures 1, 2, 3 illustrate parts of the
alignments of the Ras-like proteins, SH2 domain, peptidase M13, subtilase and β-lactamase families. The complete alignments of the Ras-like proteins and SH2 domain
can be found as additional files (Additional files 1, 2, 3, 4,
5, 6, 7, 8, 9). The figures were generated using MultiDisp
graphics program developed to visualize multiple
sequence alignments [37] (Riikonen et al., in preparation). The lower parts of the alignments include the maxZ,
MD and IC score values. The Blosum62 and grouping of
amino acids were used as a scoring matrix in the maxZ
score.
Results
Evaluating the maxZ score for positional conservation
In this section, we study the practical performance of the
maxZ score in SH2 domain, Ras-like proteins, peptidase
M13, subtilase and β-lactamase familes. We first demonstrate the effect of five different scoring matrices and then
we compare the performance of maxZ score with those of
information content (IC) and Mean Distance (MD) score
[20,9]. Finally, we demonstrate how the maxZ score can
be used to generate a consensus sequence.
Multiple sequence alignments
We used the multiple sequence alignments of the SH2
domains, Ras-like proteins, peptidase M13, subtilase and
β-lactamase families to evaluate the maxZ score. The
alignments for the SH2 domain, peptidase M13, subtilase
and β-lactamase families were obtained from the Pfam
database [34]. The seed alignments of the SH2 domain,
peptidase M13, subtilases and β-lactamases consist of 58,
24, 45 and 128 sequences, respectively. These alignments
also include gaps. The sequence alignment of the Ras-like
proteins was downloaded from the web page of an article
by Oliveira et al. [35]. The alignment was build with a
two-step alignment procedure [36]. First they classified
sequences into groups with approximately 90% pairwise
Effect of the scoring matrices
One advantage of the maxZ score is that it can consider
the physicochemical relationships of amino acids. The
user is able to choose an arbitrary scoring matrix or classification of the amino acids, which can be incorporated
into the calculation of the maxZ score. In addition to the
identity matrix, we demonstrate the use of three different
scoring matrices: Blosum62, Gonnet250 and PAM250
[26-28]. Additionally, we classify amino acids into six
physicochemically related groups as follows: hydrophobic {V, I, L, F, M, W, Y, C}, negatively charged {D, E}, positively charged {R, K}, conformational {G, P }, polar {N,
Q, S} and {A, T}. This classification has been used, for
example, by Shen and Vihinen [38]. Figure 1 shows the
scaled -log(p)-values for the Ras-like proteins using the
five different scoring schemata.
The residue positions in the alignment of Ras-like proteins
were divided into five groups according to the entropy
and variability [35]. The parameter values of the classification algorithm were chosen such that the groups represent
the known structural and/or functional roles of the residue positions. A rough overview of the categories is the
following:
• Box 11 contains positions with low entropy and variability. The positions in this group form a main functional
site.
Page 3 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
http://www.biomedcentral.com/1471-2105/7/484
Figure 1
MultiDisp visualization of part of the Ras-like proteins (upper) and the corresponding scaled -log(p)-values (lower)
MultiDisp visualization of part of the Ras-like proteins (upper) and the corresponding scaled -log(p)-values
(lower). The curves show the p-values calculated using (red) Blosum62, (green) Gonnet250, (black) PAM250, (magenta) identity scoring matrices and (blue) classification of the amino acids for the Ras-like proteins.
Page 4 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
a
http://www.biomedcentral.com/1471-2105/7/484
c
b
βB2
βB5
βD3 βD4 βD5 βD6
βF3
αB5 αB6
αB8 αB9
Figure 2
servation visualization of
MultiDisp scores (lower) the a) βB-stand, b) βD-stand and c) αB-helix of the SH2 domain (upper) and the corresponding conMultiDisp visualization of the a) βB-stand, b) βD-stand and c) αB-helix of the SH2 domain (upper) and the corresponding conservation scores (lower). The curves show (red) the scaled -log(p)-values, (blue) Mean Distance and
(green) Information content scores for the alignment. Consensus sequence for the alignment positions in c) is F P S L P E L V E
H Y.
• Box 12 consists of positions with low variability and
moderate entropy. These positions are located in the core
of the structure next to the residues in Box 11.
• Box 22 contains positions with moderate entropy and
variability. These residue positions are located in the core
structure but are not adjacent to the residues in the Box
11. The positions are involved in the structure of the protein, but also in signal transmission between the modulators and the main functional site.
• Box 23 consists of the positions with high entropy and
moderate variability. These positions are located at the
surface or in the core of the protein and are involved in
modulator interaction.
• Box 33 contains highly variable positions with high
entropy. These positions are mainly located at the surface
of the protein.
For a more detailed description of the categories, see the
original paper [35]. Table 1 shows the median (lower and
upper quartile) values of the -log(p)-values of the maxZ
scores with different scoring matrices, along with MD and
IC scores in each of the five groups. As expected, all conservation scores decreased gradually when moving from
the positions with low entropy and variability to those
with high entropy and variability. The performance of the
MD and maxZ scores was very similar. The maxZ score
with groups of amino acids distinguished slightly better
than the other scores the moderately conserved positions
(Boxes 12–23) from the highly conserved positions (Box
11) and unconserved ones (Box 33) (Table 1, Figure 1).
In both Ras-like protein and SH2 domain examples, all
the scoring schemes tend to provide very similar results
(see Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9 for Blosum62
and grouping of amino acids). The results with Blosum,
Gonnet and PAM matrices all rely heavily on the diagonal
Page 5 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
http://www.biomedcentral.com/1471-2105/7/484
b)
a)
1
2
3
c)
63
4
64
65
66
67
68
69
70
71
72
73
147
148
149
150
151
MD
39
43
7
9
MD
23
21
100
100
57
14
100
0
0
7
20
MC
92
38
27
37
56
IC
39
38
30
44
IC
27
35
84
67
41
32
84
70
62
23
45
IC
63
43
31
39
50
48
maxZ
25
30
100
71
42
22
100
5
0
15
51
maxZ
68
40
32
48
54
maxZ
38
36
40
e)
d)
f)
215
216
217
218
219
220
221
222
223
MD
2
9
74
27
38
8
28
14
41
IC
44
25
84
41
52
18
26
27
59
IC
59
66
34
47
IC
84
49
52
63
maxZ
22
17
100
25
84
2
28
19
51
maxZ
86
0
42
70
maxZ
100
71
90
100
g)
28
29
30
31
90
MD
0
34
89
254
255
256
257
92
MD
100
74
31
i)
h)
687
MD
688
689
690
691
692
693
120
121
122
123
306
307
308
100
92
100
26
75
59
100
MD
91
20
23
93
MD
21
27
34
IC
49
60
55
58
40
29
73
IC
51
17
26
59
IC
51
21
30
maxZ
71
100
68
68
56
29
100
maxZ
79
25
33
100
maxZ
100
27
53
Figure 3 visualization of the a) I, b) II, families and themotifs of the peptidase M13, e) I, f) II and g) III motifs of the subtilase, and
h) I and i)
MultiDisp II motifs of the β-lactamase c) III and d) IV table
conservation scores
MultiDisp visualization of the a) I, b) II, c) III and d) IV motifs of the peptidase M13, e) I, f) II and g) III motifs of
the subtilase, and h) I and i) II motifs of the β-lactamase families and the table of the conservation scores. MD =
mean distance, IC = information content scores and maxZ = scaled -log(p)-values for the alignment.
Page 6 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
http://www.biomedcentral.com/1471-2105/7/484
Table 1: Median (lower and upper quartiles) of the -log(p)-values with different residue scoring schema together with the MD and IC
scores.
Score
Box11
Box12
Box22
Box23
Box33
LogP Blosum62
LogP Gonnet
LogP Indep
LogP PAM
LogP 6 groups
MD
IC
708 (708, 708)
708 (708, 708)
708 (708, 708)
708 (212, 708)
644 (631, 683)
92 (86, 97)
57 (55, 59)
611 (198, 708)
190 (164, 708)
202 (158, 708)
201 (166, 708)
312 (300, 333)
43 (29, 55)
39 (34, 48)
208 (161, 547)
158 (131, 189)
171 (108, 202)
153 (125, 201)
279 (241, 341)
34 (24, 42)
31 (27, 35)
120 (99, 177)
98 (78, 136)
75 (63, 113)
94 (81, 133)
216 (77, 240)
24 (19, 31)
21 (19, 23)
75 (47, 123)
64 (56, 106)
57 (35, 96)
66 (56, 105)
43 (26, 91)
20 (15, 25)
13 (10, 19)
Box11 and Box33 represent positions with low and high entropy and variability, respectively. The three middle columns represent the moderately
conserved positions. More detailed description of the categories can be found in Oliveira et al. [35].
values of the scoring matrices. For instance, a position
with highly or moderately conserved leucine obtains a relatively low maxZ score (Figure 1), whereas a position with
an unconserved cysteine may be also assigned as highly
conserved. This is especially critical when the Gonnet
scoring matrix is used. The results with six amino acid
groups differed most from the other scoring schemes since
this calculates the maxZ score for the amino acid classes
instead of single residues. The grouping of amino acids
tends to give high scores for the positions where the
majority of the residues belong to the same class. The use
of the identity matrix corresponds to the special case
where similarities among the symbols are ignored, and
the amino acids are handled as if they where unrelated.
The corresponding score is thus based solely on the relative frequencies of the residues and background probabilities. The scoring based on the identity matrix shows quite
similar results with the Blosum62 and Gonnet matrices.
For some positions, however, the identity matrix fails to
detect the conserved positions. Similar behavior was seen
with the PAM matrix (Figure 1, position 10).
Comparisons with other scores
The results of the maxZ score were compared with those
of the MD and IC. Figures 2 and 3 show the MD and IC
scores together with the -log(p)-values of the maxZ scores
for the SH2 domain, peptidase M13, subtilase and β-lactamase family sequences. Scaling of the -log(p)-values was
performed using zero as a minimum. The maximum value
was obtained by calculating the -log(p)-values for each
possible invariant position and defining the 5% percentile
value to be the maximum. Blosum62 was used as a scoring matrix in the maxZ score. The default multiple
sequence alignment parameters of ClustalX were used to
calculate the MD score.
SH2 domain SH2 domains are binding modules recognizing phosphotyrosines and surrounding residues in
polypeptides and proteins [39,40]. Many SH2 domains
recognize especially residues +1 and +3 following the
phosphotyrosine and form binding pockets for these
amino acids [41]. All known SH2 domains share the same
architecture, consisting of a central antiparallel β-sheet
flanked by two α-helices. The central β-sheet (strands B, C
and D) forms the core of the structure and includes most
of the conserved residues.
All scores consider the positions forming the binding
pocket as highly conserved (> 0.4). These include invariant βB5, which interacts with phosphotyrosine, and βD4
and αA2 (data not shown), which form the binding
pocket for the phosphotyrosine [42] (Figure 2ab). Position βD6, which is also involved in forming the binding
pocket, obtains lower conservation score values (≈ 0.2)
indicating moderate conservation. The binding pockets
for phosphotyrosine-following residues are formed by the
αB-helix, especially positions αB5 – 6 are involved in
forming the hydrophobic core for residue +3 [43]. Positions βB2, αB9 and βF3 are occupied with aromatic residues. The MaxZ and IC scores determine these five
positions as highly conserved, whereas the MD score (0.2
– 0.4) determines positions αB9, and βF3 as moderately
conserved (Figure 2c). The binding site for ligand residue
+1 includes positions βD3 and βD5 [42]. While the maxZ
and IC scores determine position βD5 as moderately conserved, the MD score (< 0.2) rather considers that position
as unconserved (Figure 2b).
Peptidase family M13 Peptidase family M13, also known
as neprilysin family, consists of type II integral transmembrane proteins with short N-terminal cytoplasmic
domain, a hydrophobic transmembrane region, and a
large ectodomain containing a active site [44]. Three conserved motifs characterize all known M13 endopeptidases
(the numbers are Pfam alignment positions): I:0vNAfY4,
II:63XXHEXXH- -XX73, III:147EXXXD151 (Figures 3abc).
Additionally IV:217HXXXXXR223 is conserved in neprilysins (Figure 3d).
All measures scored as highly conserved the residues H65,
H69, E147 which are ligands for Zn2+, and E66 and H217,
which are involved in catalysis (Figure 3bcd). The maxZ
Page 7 of 19
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:484
score values varied from 0.68 to 1 in the invariant positions occupied with different amino acids, whereas the
corresponding MD score values were more stable. This
was due to different diagonal values of the scoring matrix.
The similar behavior was found in the position 219 of the
motif IV, where proline was the most frequent residue.
The maxZ score determined that position as highly conserved (0.84), whereas the other scores only considered it
as moderately conserved (0.38 and 0.52). For the other
important side-chains of N1, A2, D215, H217 and R223,
which have a role in substrate binding, the behavior of the
three scores was mostly very similar (Figure 3ad). The only
exception was the position D215, which was considered
as moderately conserved by the maxZ and IC scores (0.22
and 0.44), while the MD score considered it as unconserved. Another difference between the scores was in the
positions 70 and 71 of the motif II, where the IC score
could not determine these positions as inserts, but
obtained considerably high conservation score values.
Subtilisins Pfam subtilase is a family of serine proteases
consisting of S8 and S53 peptidase families of the
MEROPS database. The S8 peptidases are divided into two
subfamilies: S8A (e.g. subtilisin) and S8B (e.g. kexin). The
sequences in the S8 family have a catalytic triad Asp/His/
Ser. In the subfamily S8A, the active site residues occur (in