-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathadam_10_giardia_800481.pdf.txt
1841 lines (1409 loc) · 66.3 KB
/
adam_10_giardia_800481.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>The Giardia lamblia vsp gene repertoire: characteristics, genomic organization, and evolution</title>
<meta name="Subject" content="BMC Genomics 2010, 11:424. doi: 10.1186/1471-2164-11-424"/>
<meta name="Author" content="Rodney D Adam, Anuranjini Nigam, Vishwas Seshadri, Craig A Martens, Gregory A Farneth, Hilary G Morrison, Theodore E Nash, Stephen F Porcella, Rima Patel"/>
<meta name="Creator" content="FrameMaker 8.0"/>
<meta name="Producer" content="Acrobat Distiller 9.0.0 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
Open Access
RESEARCH ARTICLE
The Giardia lamblia vsp gene repertoire:
characteristics, genomic organization, and
evolution
Research article
Rodney D Adam*1, Anuranjini Nigam2, Vishwas Seshadri2, Craig A Martens3, Gregory A Farneth3, Hilary G Morrison4,
Theodore E Nash5, Stephen F Porcella3 and Rima Patel6
Abstract
Background: Giardia lamblia trophozoites colonize the intestines of susceptible mammals and cause diarrhea, which
can be prolonged despite an intestinal immune response. The variable expression of the variant-specific surface
protein (VSP) genes may contribute to this prolonged infection. Only one is expressed at a time, and switching
expression from one gene to another occurs by an epigenetic mechanism.
Results: The WB Giardia isolate has been sequenced at 10× coverage and assembled into 306 contigs as large as 870
kb in size. We have used this assembly to evaluate the genomic organization and evolution of the vsp repertoire. We
have identified 228 complete and 75 partial vsp gene sequences for an estimated repertoire of 270 to 303, making up
about 4% of the genome. The vsp gene diversity includes 30 genes containing tandem repeats, and 14 vsp pairs of
identical genes present in either head to head or tail to tail configurations (designated as inverted pairs), where the two
genes are separated by 2 to 4 kb of non-coding DNA. Interestingly, over half the total vsp repertoire is present in the
form of linear gene arrays that can contain up to 10 vsp gene members. Lastly, evidence for recombination within and
across minor clades of vsp genes is provided.
Conclusions: The data we present here is the first comprehensive analysis of the vsp gene family from the Genotype
A1 WB isolate with an emphasis on vsp characterization, function, evolution and contributions to pathogenesis of this
important pathogen.
Background
Giardia lamblia (syn. G. duodenalis, G. intestinalis) is an
anaerobic protist that is medically important as a common cause of intestinal infection and diarrhea [1].
Humans and other susceptible mammals become
infected when cysts are ingested from contaminated
water or food and excyst into trophozoites in the proximal small intestine. These trophozoites replicate and
cause the symptoms of diarrhea. Infections with Giardia
are frequently prolonged and malabsorption with weight
loss may last for months in the absence of treatment,
despite an immune response that would be expected to
eradicate the infection. One of the possible reasons for
* Correspondence: adamr@u.arizona.edu
1
Departments of Medicine and Immunobiology, University of Arizona College
of Medicine, Tucson, AZ, USA
the persistence of infection is antigenic variation of the
variant-specific surface proteins (VSPs).
A single trophozoite expresses only a single member of
this protein family at any one time [2], but may switch
expression from one VSP to another in vitro at a rate
which has been estimated at once every 6 to 13 generations [3]. Antigenic variation has also been identified in
gerbils [4], mice [5,6], and humans [7], and occurs during
the process of encystation/excystation [8,9]. Antigenic
variation occurs in the absence of alteration of DNA
sequence or chromosomal location of the vsp genes [1012], and likely occurs by epigenetic mechanisms involving
histone acetylation status [12] and/or RNAi [13].
The VSP proteins are all cysteine-rich and have frequent CXXC motifs [1]. They have a 14 to 17 amino acid
N-terminal signal peptide which is predicted to direct the
protein to the trophozoite surface, upon which it is
Full list of author information is available at the end of the article
© 2010 Adam et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
cleaved from the remainder of the peptide [14,15]. Most
of the mature VSP protein diffusely coats the outside of
the trophozoite [16].
The C-terminus concludes with a nearly invariant
CRGKA motif that most likely remains in the cytoplasm
[17]. Immediately adjacent to the CRGKA motif is a
hydrophobic domain that likely forms the membrane
anchor for the VSP [17,18]. The entire C-terminal conserved region is about 38 amino acids in length, but the
upstream portions of the VSPs can be highly different,
allowing the organism to expose very different antigens to
the host.
Despite the similarities in the N-terminal signal peptide, the CXXC motifs, and the conserved C-terminus,
other features of the vsp genes vary substantially, such as
the presence or absence of tandem repeats. VspA6 is
about 5.6 kb in length, with the initial 99 bp followed by
23+ copies of a 195 bp tandem repeat, then about 1.3 kb
of additional sequence that ends with the highly conserved 3' end [10]. Tandem repeats have also been
reported in other vsps [11,19-21]. The vspC5 gene is over
2 kb in length and begins with 66 bp at the 5' end followed
by over 20 copies of a 105 bp tandem repeat that
extended to the conserved 3' end [11]. Thus, almost the
entire portion of this molecule that is assumed to be
exposed to the extracellular environment consists of tandem repeats.
One vsp gene has been described that is encoded as a
duplicated pair. The two copies of the vsp1267 ORF exist
as an inverted pair in a tail to tail orientation with approximately 3 kb of intervening DNA [17].
Previous studies have shown that vsp genes fall into
related groups or families. One example is vspA6 and the
related genes, vspA6-S1 and A6-S2; all sequenced from
the WB isolate [22]. VspA6-S2 differs from vspA6 by its
possession of a 201 bp tandem repeat that is similar to the
195 bp repeat of vspA6. VspA6-S1 is nearly identical to
vspA6 with only scattered substitutions outside the
repeat region, but has only a bit more than one copy of
the 195 bp tandem repeat, rather than the approximately
23 copies found in vspA6. Interestingly, vspA6-S1 has also
been sequenced from a Genotype A1 Peruvian isolate
called G3 M [23], and both sequences were identical, suggesting that across Genotype A1 isolates, some vsp genes
may not be changing rapidly.
An additional example of vsp genes from Genotype A1
isolates falling within related families includes several
genes that are similar to TSA417 [24], including tsp11
[25-27], such that homology is apparent throughout the
entire sequence. In addition, there are vsps with near
identity in the upstream non-coding region followed by
substantial divergence within the coding region [28].
A genomic evaluation of the vsp genes has been made
possible by the sequencing of the genome of the WB iso-
Page 2 of 14
late [29] 10× coverage (greater than 98% coverage) [30].
The G. lamblia isolates obtained from humans were originally divided into three groups; 1, 2, and 3. These groups
are now designated as Genotypes/assemblages A1, A2,
and B, respectively. Genotype A1 is a highly homogeneous group and is about 98-99% identical to Genotype
A2. Relatively little sequence data is available for Genotype A2. Genotype B is only about 80-90% identical to
Genotype A1 and has been proposed as a separate species [31-33]. Although the genome of the Genotype B isolate, GS, has been reported, the coverage was not
sufficient to fully describe the vsp repertoire. Therefore,
the current manuscript will focus on the Genotype A1
vsp repertoire from the WB genome, which was completed at 10× coverage and assembled into 306 contigs,
the largest of which is nearly 1 Mb in size [30].
Results and Discussion
Identification of vsp genes
All the vsp genes are cysteine-rich with frequent CXXCs
motifs present throughout their length [34]. However, the
Giardia genome also encodes other cysteine-rich proteins including three cyst-wall protein (CWP) genes [35]
and 61 High Cysteine Membrane Proteins (HCMPs) [36].
The HCMPs also have CXXC motifs, but in addition, they
have frequent CXC motifs, which are distinctly uncommon in the VSPs. Most notably, the vsp genes differ from
these other genes encoding cysteine-rich proteins by
their possession of a highly conserved 3' end that concludes with the translated amino sequence, CRGKA. For
vsps that were complete at the 3' end, a great majority
contained the encoded CRGKA motif in its entirety,
while a few genes matched at three or four residues.
Using these criteria, we identified a total of 303 vsp
genes in the current WB assembly, of which 228 were
complete and 75 were in partial or incomplete ORF format. The 228 complete genes include 10 pairs of identical
coding inverted gene pairs, so many of the analyses of
complete genes included 218 sequences. (A complete list
of the vsp genes along with their features is shown in
Additional file 1). Of the 75 partial genes that were considered in some of the analyses, 32 were incomplete at the
3' end and 43 were incomplete at the 5' end. Since the
genome coverage is >98%, it is unlikely that any vsp genes
were missed completely, although it is possible that some
were excluded from the assembly because of the potential
for complete or near 100% sequence identity to other vsp
genes. The total number of 303 vsp genes is probably a
high estimate for the total genomic repertoire since it is
likely that some of the genes that are incomplete at the 5'
end should or could have been paired with genes that are
incomplete at the 3' end. Therefore, for the purposes of
discussion, we have estimated the total number of vsp
genes to be 228 + 43 or 271 total. These estimates are
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
consistent with earlier range estimates of 150 to 300 that
were based on DNA hybridization studies [3].
The 303 genes have been numbered sequentially, based
upon their contig number and location. The contigs were
numbered sequentially by decreasing size. Therefore, vsp
genes whose names include large numbers indicate that
these vsps are found on smaller contigs. Those vsps that
have a genomic context as duplicated, inverted pairs and
that therefore have identical coding regions, are each
given the same numerical designation, but these names
are followed by ".1" or ".2" sub-designation to highlight
this difference.
The vsp coding regions comprise approximately 3% of
the total genome. The inclusion of the upstream intergenic areas, which may be tasked with control of vsp gene
expression, increased the value to greater than 4%. Thus,
the vsp genes comprise at least 3-4% of the total Giardia
genome, which is consistent with estimates from a previous Giardia genome sampling survey [37].
The presence of a candidate Inr
All of the vsp genes previously described in the literature
have had the following DNA consensus sequence, PyAatgTT, at the beginning of the coding region, where atg represents the initiation codon. We have shown in transient
transfection studies that this consensus sequence is
required for efficient expression of luciferase from a vsp
promoter region (unpublished). Therefore, this consensus region surrounding the start codon is a candidate Initiator Element (Inr). We analyzed the vsp gene sequences
for the presence of the PyAatgTT sequence and found
that fewer than half of the genes contained this Inr
sequence. In other words, 107, or 41% of the 260 vsp
genes that were complete at the 5' end contained the
PyAatgTT sequence. Specifically, 65 vsp genes contained
putative Inrs of "CAatgTT", while 42 vsp genes contained
the sequence "TAatgTT". As described above and below,
many of the vsp genes were found in linear arrays in the
genome. Of the 25 vsp genes that represented the first vsp
gene within the linear arrays, 12 contained the putative
Inr. In surprising contrast, none of the downstream members of the linear arrays contained this sequence. In total,
107 (72%) of the 148 complete vsp genes that were not
present as downstream members of linear arrays contained the conserved Inr sequence. For those vsps containing the putative Inr, the Inr was usually (74%) the first
observed or most 5' oriented ATG start codon of the
gene. None of the subsequent downstream, in-frame
encoded ATGs contained a complete PyAatgTT
sequence. Further investigation and analysis will be
required to determine the role the Inr plays in vsp expression across the multitude of vsp variety and whether the
presence or absence of the Inr plays a role in in vivo vsp
expression.
Page 3 of 14
vsp genes without an ATG start codon may represent
pseudogenes
Three of the 218 vsp genes with complete sequences
available had no in frame ATG that could be found. These
three genes were identified by their possession of the conserved 3' sequence concluding with an encoded CRGKA
and were all found in the middle of large contigs. They
were relatively short vsp-encoding sequences with base
pair encoding regions of 224, 329, and 902 bases following the upstream stop codon. There were no obvious differences between these three genes and those containing
an ATG. Because of their internal contig location and the
absence of any mis-assembly evidence, we believe these
three genes may represent inactive, truncated vsp
pseudogenes.
Vsp size range, CXXC motifs, tandem encoded repeats and
recombination analysis of vsps
Vsp size range and CXXC motifs were analyzed within
the 218 unique vsps. The data demonstrated a wide range
in vsp size, from 222 to 6777 base encoding pairs. CXXC
motif analysis found an average of 12.6 encoded CXXCs
motifs per kb (range 3.6 to 18.6 per kb). The vast range in
vsp size along with differing numbers of CXXC motifs
implies a diverse range of function or performance within
the vsp family. More work remains to be performed in
order to analyze the context of VSP protein structure,
function and/or surface expression in the context of vsp
gene size and the presence of high or low numbers of
CXXC motifs.
Tandem repeats (TR) are a feature of some VSPs and
have been shown to be immunodominant in the case of
the 195 bp repeat domain found in VspA6 [19]. Thus, the
number and type of repeats encoded in vsps are likely to
play an important role in modulating, directing or avoiding the host immune response.
We identified a total of 30 complete vsp genes containing TR (Fig 1). The sequences of the TR were different
and unique for each of the 30 genes identified (see the
phylogenetic alignment of the tandem repeats in Additional file 2). The nucleotide sequences of the individual
repeats ranged in size from 105 to 516 bp in length. The
number of copies of the TR ranged from two, to more
than 20 per vsp. It is important to note that the repeat
copy numbers should be viewed as approximate because
vsp genes with larger numbers of repeats may have been
artificially condensed by the genome assembly process. In
addition, there is evidence of allelic variability in repeat
copy number for at least two of the previously identified
repeat-containing vsp genes vspA6 and vspC5 [10,11,28].
vsp Conserved 3' domain
A mechanism by which expression status of multiple vsps
could be monitored simultaneously would be valuable in
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
Page 4 of 14
A
Vsp67 (VspA6)
2
1
3
4
B
Vsp122
1
Vsp90 (VspC5)
1
2
4
2
4
.5kb
Figure 1 Examples of vsp genes with tandem repeats. Three examples of vsp genes with tandem repeats are shown. The ORFs are divided into 3 or 4 sections, numbered 1 through 4 on the diagrams. For
these 3 genes, the first section consists of 66 to 99 bp beginning with
the start codon and encoded leader peptide, and extending to the tandem repeats. The second section consists of multiple tandem repeats,
which for two of the vsp genes (B and C), extends up to the conserved
3' region. The third section (A only) is nonrepetitive DNA extending
from the tandem repeats to the conserved 3' region. (1) Vsp67 (also
CRP170, vspA6) has approximately 22 copies of a 195 bp tandem repeat [10]. (2) vsp122 has three copies of a 516 bp repeat, by far the largest repeat found. (3) vsp90 (vspC5) has approximately 25 copies of a
105 bp tandem repeat [11].
the continued study of this interesting gene family. The
conserved 3' end has enough similarity among vsps to
allow the development of "universal amplification primers". Therefore, we analyzed the conserved 117 nucleotides of the 3' terminus in the 254 vsp genes for which
this sequence was available to determine whether there
was sufficient diversity within the 3' end to allow unique
identification of most vsps. Among these 254 genes, there
were 190 distinct 3' sequences. These 190 sequences
encode 99 different amino acid sequences. Thus, the
great majority of vsp genes have unique 3' conserved
regions. The nucleotide alignments are shown in Additional file 3 and the amino acid alignments are shown in
Additional file 4.
DNA alignments and evidence for DNA recombination
between vsp genes
It was previously proposed that gene duplication followed
by divergence was one of the mechanisms by which the
current number and diversity of the vsp repertoire was
generated [22]. The sequence diversity of the DNA
sequences is illustrated by the phylogenetic tree shown in
Additional file 5). Since prior studies suggested recombination among vsps [28], we performed a systematic
search for evidence of horizontal movement of DNA
sequences in the form of recombination between vsp
genes. Recombination analysis of all 218 full length vsps
was not possible due to the wide range of sequence divergence across the entire gene family (data not shown).
However, a more focused approach by (1) utilizing the C-
terminus conserved region as an alignment reference
point for all vsps, and (2) gradually adding additional, longer upstream sequences allowed us to identify minor
clades of vsps for running alignments, generating trees,
and testing for evidence of recombination. We used the
Sawyers test in the Geneconv package of algorithms for
these minor clades analyses. For each incremental
increase of adjacent upstream 3' sequence, alignments
and trees produced greater diversity and longer branching patterns, which correlated with the increased diversity seen in the 5' coding regions of the vsps. Clades and
subgroups within clades were identified for various
lengths of upstream sequence, all anchored with the 3'
conserved region (data not shown). The first minor clade
analyzed for recombination involved a group of vsp genes
with similarity along the 234 bp comprising the 3' terminus. In the phylogenetic tree, these 12 vsp sequences
actually separated into two related minor groups or
branches (all within Clade II of the three major clades of
vsp genes; see Alignments and phylogenetic tree analysis
section and Fig 2). The percent polymorphism for these
12 genes as a group was 74.36% (Table 1). The two subgroups consisted of eight and four vsps genes with 64.52%
and 25.21% polymorphisms, respectively (Table 1).
Recombination analysis demonstrated significant recombination at Bonferroni corrected values of 0.0311 and
0.0071 respectively, within each of the two subgroups. No
significant recombination could be detected between the
group of eight and the group of four genes.
The longest sequence that we could find extending
upstream from the 3' terminus, with a sufficient number
of vsp genes containing related sequence was 657 bp,
found in 13 genes. These 13 vsp genes demonstrated
23.28% polymorphism and showed significant recombination between them (p = 0.0365) in this region.
We then asked if by shortening the upstream sequence
down from 657 bp, whether we could find minor clades
containing at least two subgroups of vsps that could then
be tested in a manner similar to what was attempted for
the first 234 base pair region studied (in other words,
recombination across sub-groups). Shortening the
sequence down to 546 bases allowed us to find seven vsp
genes that separated into two subgroups of three and four
members each. Here, a Bonferroni-corrected (BC) Pvalue for recombination was found within the clade of
seven sequences (0.0114), demonstrating evidence for
recombination across the two minor groups. Also, evidence of recombination was found within the individuals
of the three member clade (0.0002 BC-corrected) and
within the four member subgroup (0.0195). The Dualbrothers recombination algorithm confirmed the recombination data between these seven members and allowed
discovery of several high mutation frequency regions at
~25 bp and 250 bp (Additional file 6A). These high fre-
Page 5 of 14
Table 1: Recombination analysis
DNA polymorphism analysis
BP
GAPS
Sawyers test for recombination
GROUP
Number of
VSPs
POLYMORPH
ISMS (%)
SINGLE VAR
SITES
Pi-Value
Inner frags
Best inner p-val
Best inner
BC-val
Outer frags
Best outer p-val
Best outer
BC-val
All (1&2)
12
234
0
174 (74.36)
42
0.3403
26
0.0000
0.0007
0
N/A
N/A
clade1
8
234
0
151 (64.52)
85
0.2599
16
0.0000
0.0002
1
0.0027
0.0311
clade2
4
234
0
59 (25.21)
50
0.1467
4
0.0008
0.0073
4
0.0011
0.0071
longClade
13
657
0
153 (23.28)
17
0.0865
41
0.0000
0.0000
1
0.0365
0.3755
All (3 & 4)
7
546
0
143 (26.19)
14
0.1347
10
0.0000
0.0002
1
0.0013
0.0114
clade3
3
546
0
95 (17.4)
95
0.1160
2
0.0000
0.0002
2
0.0000
0.0002
clade4
4
546
0
110 (20.14)
72
0.1178
6
0.0048
0.0268
2
0.0195
0.1069
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
Clade 1 contains vsp genes; 41, 180, 49, 48.1, 204, 31, 43, and 38.
Clade 2 contains vsp genes: 95, 18, 21 and 23
longClade contains vsp genes: 279, 235, 272, 111, 248, 223, 191, 83, 234, 271, 222, 216 and 137
Clade 3 contains vsp genes: 217, 139 and 138
Clade 4 contains vsp genes: 270, 261, 290 and 62
Clade 5 contains vsp genes: 150, 140, 153 and 151
Clade 6 contains vsp genes: 245, 82 and 210.
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
Page 6 of 14
Alignments and phylogenetic tree analysis of 218 complete
VSP proteins and vsp DNA sequences
III
I
II
0.5
Figure 2 Unrooted phylogenetic tree of the translated amino
acid sequences of all 218 vsp genes with complete sequences. Full
length vsp proteins were aligned, the alignment manually corrected,
and an unrooted tree produced. Predominate clades are listed as I, II,
and III and colored Green, Blue and Red. Many of the 218 vsp names are
not shown on this tree due to confined space at the ends of the branch
points. Representative members of each clade are shown. The bar at
the bottom of the figure signifies branch length related to number of
substitutions per residue analyzed.
quency mutation regions could be bracketing DNA fragments and assisting in their horizontal movement, while
keeping the intervening sequence intact. A 552 base pair
fragment was also discovered for seven other vsp genes,
and in this analysis, BC values of .0001 for the seven
members again demonstrated recombination across two
minor subgroups. In addition, a 0.0000 value within the
four member group demonstrated recombination within
this four member subgroup. Within the three member
subgroup, no recombination could be detected, probably
due to the low amount of sequence polymorphism
observed across these three members (7.24%). Dualbrothers analysis confirmed evidence of recombination
across the 552 bases of these seven genes (Additional file
6B). In summary, evidence exists that recombination has
occurred within conserved but divergent regions of the
vsp genes. Evidence also exists that discreet high-mutation-frequency regions bracket low-frequency regions,
implying protection or movement of functionally important regions during recombination.
The 218 full length Giardia WB VSP proteins were
aligned and an unrooted tree produced (Fig 2). The
majority of the 218 VSP protein sequences fall into three,
well-defined clades labeled I, II and III (Fig 2). Three VSP
proteins, whose branches are colored black (vsps3, 13,
and 122) lie between clades II and III and contain
sequences common to both of those clades. This tree
appears to suggest an expansion of three functional
groupings of VSP proteins within the WB genome. Of the
30 vsp genes containing tandem repeats, 27 fell within
clades II and III, while the remaining three consisted of
the three vsp genes that were between clades II and III.
We also performed a protein alignment using two copies
of the tandem repeats, taken from each of the 30 TR-containing genes and discovered that they clustered into several distinct clades (Additional file 2). Since a previous
report described a vsp gene (vspG3M/vspA6-S1; vsp175)
that was highly similar along the entire gene with vspA6
(vsp67), but contained only one copy of the 195 bp repeat
[19], we theorized that degenerated sequence forms of
the conserved tandem repeats may exist within nonrepeat-containing vsp genes. We performed a BLAST
analysis using single copies of each of the 30 TR to query
the non-repeat-containing vsp genes. Nine of the 30 demonstrated >80% nucleotide identity over a stretch of at
least 80 bp to 10 of the non-repeat-containing genes. This
result suggests that additional divergent homologies
exist, that other vsp genes may have contained similar
repetitive motifs that diverged over time, or that recombination between repetitive motifs and non-repeat-containing vsp genes might be occurring.
In order to compare protein to DNA phylogenetic relatedness, and to determine if the DNA sequences differed
substantially from the protein sequences, the DNA
sequences encoding these 218 vsp genes were aligned and
a Neighbor Joining tree with boot strap values at significant nodes was produced (Additional file 5). A color coding scheme illustrating the three dominant clades
produced from the protein alignment along with those
associated numerical clade designations is shown in
Additional file 5. Much of the conservation and clade designation seen in the protein-based tree is apparent in the
DNA sequence-based tree. Further characterization and
finishing of the partial vsp sequences will determine if
these predominant grouping trends hold true for all vsp
genes and their protein encoding sequences within the
WB genome.
Protein motifs may imply potential VSP function
The protein sequences of all complete vsp genes were
analyzed for the presence of motifs that might provide
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
clues to their function. A zinc finger motif has been
described as being encoded by a subset of the vsp genes
[38], and zinc binding has been documented for VSPs,
although in substoichiometric amounts [39]. Therefore,
we were especially interested to see how commonly the
zinc finger motif was found across the entire repertoire of
218 complete WB VSPs. Using Pfam 22.0 to search the
218 complete VSPs for protein motifs, we found 36 VSPs
that had weak hits (E value between 1 and 1e-5) to the
C3HC4 type zinc finger (RING finger), which is a
cysteine-rich domain of 40-60 residues, has the consensus sequence: C-X2-C-X(9-39)-C-X(1-3)-H-X(2-3)-C-X2C-X(4-48)-C-X2-C where × is any amino acid and coordinates two zinc ions. Since we found only weak hits to a
Zinc RING finger motif, we hypothesized that the VSPs
contain a novel Giardia-specific Zinc finger. To elucidate
this further we searched for the previously described
motifs [39], namely, CxxCHxxCxxC and CxxCxxxCxxC.
Table 2 shows data for the number of these two motifs
found within the 218 complete WB VSPs and the alternative amino acids found in the 5th position of the CxxCxxxCxxC motif. One or both of these motifs was present in
151 vsps with 73 copies of the CxxCHxxCxxC motif in 67
VSPs and 255 copies of the CxxCxxxCxxC motif present
Table 2: Zinc finger motifs
Fifth AA
#
Charge
H
73
basic
N
60
polar
D
54
acidic
G
50
polar
A
33
nonpolar
P
18
nonpolar
S
17
polar
T
15
polar
V
5
nonpolar
Q
2
polar
E
1
acidic
Total
328
Page 7 of 14
in 149 VSPs. Only two vsps had the CxxCHxxCxxC, but
not the CxxCxxxCxxC motif. More experiments are
needed to determine if the CxxCHxxCxxC and/or CxxCxxxCxxC motif is required for zinc binding or if there is
a difference in their zinc binding properties. Interestingly,
the CxxCHxxCxxC motif does not seem to overlap with
the weak RING finger domain identified by Pfam, providing further evidence of a putative novel zinc or other
metal, or other substrate-binding domain within those
VSPs.
Of the 218 complete VSPs, 182 (83%) have a GGCY
motif. The average number of GGCYs in these 182 genes
is 1.4 per VSP, with a maximum of five occurring in one
vsp. Most GGCY motifs are in the carboxyl portion of
VSPs but some exist closer to the N terminus. The function of the GGCY motif is not known, some data suggest
that it may be immunogenic. For example, a 12 amino
acid epitope containing this motif induced late appearing
antibodies in neonatal mouse infections [40], and antibodies to a larger fragment containing the GGCY motif
caused trophozoite detachment and aggregation [41].
In addition to the weak RING finger domain, three
VSPs had very good (E value <1e-20) hits to the BmKX
domain, a short bioactive peptide present in the venom of
scorpions and thought to act as a potassium channel
blocker [42]. Five additional VSPs showed moderate (E
value between 1e-5 and 1e-20) hits to the Lamin_EGF
domain (2 VSPs) and the EGF_CA domain (3 VSPs).
Many other VSPs had weak hits to these three domains
and several additional domains (TIL and Furin-like) that
are also cysteine rich. Whether these similarities are of
biological significance remains to be determined, but it is
interesting to speculate that some of these data implies
diversity of vsp function.
The VSPs share a high degree of similarity of approximately 38 amino acids encoded by the 3' terminus of the
genes. Hidden Markov Modeling (HMM) to identify
membrane-spanning regions suggested that the C-terminal CRGKA motif is inside the membrane, while the 23
amino acid upstream and adjacent to the CRGKA span
the membrane, and the remainder of the VSP is extracellular (data not shown). This result is consistent with prior
hydrophilicity analyses [17]. As previously discussed
above, the CRGKA encoding C-terminal tail was used to
identify all known vsp members within the current WB
genome assembly and is found in all reported vsp genes,
suggesting an important function in VSP biology. Many
VSPs are palmitoylated [43,44], and in vivo, palmitic acid
binds specifically to the cysteine of the CRGKA tail controlling segregation of the VSPs to the detergent-resistant
domains within the plasma membrane [45]. Arginine
deiminase binds specifically in vivo to the arginine of the
CRGKA and catalyzes its citrullination [46]. These biochemical studies in addition to the highly conserved
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
nature of the CRGKA motif suggest an important biological role. Thus, it is of interest that eight vsps have amino
termini that differ from the CRGKA at one or two positions; CHKKA, CRGKL, CRSKA, GRRKA (2 examples),
FCGKA, YRGKA, and WRGKA (Additional file 1), and
six of these eight lack either the cysteine or the arginine
or both. Experimental studies will be required to determine whether these alternate N-termini are compatible
with observed or hypothesized VSP function.
Genomic organization and context of the vsp genes in the
WB genome
The vsp genes were distributed among all five chromosomes (Table 3). Of the 196 vsp genes that could be
assigned to a specific chromosome, the density ranged
from 8 vsp genes per Mb on chromosome 1 to 29 vsp
genes per Mb on chromosome 4 (based on the total size
determined by optical mapping). The distribution of vsp
genes along chromosome 4 is shown in Fig 3. Many of the
vsp genes existed independently on the chromosome with
no other vsp genes in the near proximity. However, the
majority of vsp genes were not randomly distributed, but
were clustered into head to head or tail to tail arrangements, or as linear arrays (LAs) of vsp genes (see Fig 4 for
examples).
Genes present in inverted or tandem pairs (H-H, T-T, H-T)
Vsp gene pairs in Giardia can appear as an inverted pair
of genes with or without identical reading frames (Table
4). Fourteen of the 19 inverted pairs consisted of two
identical ORFs. Conversely, there were no examples of
identical genes that were not in this inverted pair
arrangement. The one example reported in the literature,
(vsp1267), consisted of two identical ORFs in tail to tail
orientation, separated by approximately 3 kb [17]. In the
current study, we discovered 13 additional pairs of identical genes present in head to head (4) or tail to tail (9) configurations with 2190 to 3847 bp of intervening DNA
between the paired ORFs (Table 5). Outside the coding
regions of these identical pairs, the 5' and 3' flanking
sequences of the two members of a pair were nearly identical up to a point of abrupt divergence which ranged
from 139 bp to 729 bp away from the coding regions
(Table 5).
The occurrence of identical pairs of genes was especially problematic during genome assembly in that this
redundancy prematurely terminated the contigs. In fact,
vsp1267 (vsp98.1) was not identified as a pair by the
genome assembly, but was found as a single gene at the
end of a contig. Only two of the 14 pairs with identical
members were greater than 10 kb away from the end of
any given contig. Whether other inverted pairs were
missed by the assembly is not known but quite possible.
Page 8 of 14
An additional five pairs of genes where the two paired
members contained different ORFs were discovered in
head to head (4) or tail to tail (1) orientation and were
separated by approximately 3 kb of intervening DNA in
all cases. For four of these five pairs, the two members of
the pair were from different clades, implying translocation from other perhaps singular genomic locations into
this tandem arrangement. We believe that the 14 identical pairs have arisen purely by gene duplication and
transposition events. It is possible that these gene duplication events are relatively recent in the evolutionary history of the vsp gene family and that over time, genetic
diversity through recombination between identical or
distant vsp genes may produce new vsp genes. The paired,
identical vsps and non identical paired vsps remain an
interesting, but cryptic observation of the vsp gene family.
vsp genes present as linear arrays (LA)
Over half the vsp genes found in the genome lie within
the genomic context of vsp linear arrays that consist of
two or more adjacent vsp genes (Table 4). The intervening
regions between the genes in these linear arrays averaged
60 bp (ranging from overlapping encoded regions to 241
bp of intergenic sequence). Apart from these linear
arrays, there were only two other head to tail pairs of vsp
genes that were less than 5 kb apart, but containing
greater than one kb of intergenic sequence between the
two genes. For these two non-linear array pairs, the two
members of a pair had no significant sequence similarity.
Therefore, this paired association may be random.
Most of the linear arrays were in very short contigs,
such that they frequently included only two vsp genes,
and often terminated one or both ends of the contig on
which they were located. In fact, 97 of the 102 vsp genes
in contigs <10 kb in length were in linear arrays. In terms
of the contigs themselves, vsp linear arrays accounted for
the majority of coding capacity of 56 of the contigs < 10
kb in length, even after some were joined by directed,
manual sequencing efforts. In the larger contigs, there are
several examples of linear arrays at the ends of adjacent
contigs (see Fig 3), suggesting that these adjacent linear
arrays may indeed represent even longer arrays than what
is observed here. The possibility of longer arrays also is
supported by the observation that the longest array discovered to date (10 vsp genes) is on the end of contig 12
that is adjacent to the end of contig 32 (optical mapping
data not shown), which has two vsp genes in a linear array
(Fig 3). Therefore, while the Giardia linear arrays
described here are impressive in their quantity, length
and gene numbers, much longer arrays may exist in the
genome.
As noted above, the first vsp of a linear array often
showed the typical features of vsp genes that have been
Adam et al. BMC Genomics 2010, 11:424
http://www.biomedcentral.com/1471-2164/11/424
Page 9 of 14
Table 3: Vsp chromosomal distribution
Chromosome
(size Mb)
vsps
#/Mb
5 (4.43)
65
15
4 (2.79)
81
29
3 (1.94)
18
9
2 (1.50)
20
13
1 (1.46)
12
8
described in the literature, including an extended noncoding region upstream and a putative initiator element
(Inr). Fig 3 shows the spacing within the linear array and
the variable placing of the first ATG after the nearest
upstream stop codon. In fact, vsp14 provides an example
of a sequence that contains a typically conserved 3'
region, but which lacks an initiation codon. Vsp13 is the
first member of a linear array on chromosome 4. The
coding region for Vsp 13 is followed by a stop codon, 38