-
Notifications
You must be signed in to change notification settings - Fork 10
/
acosta_11_genomic_790983.pdf.txt
1546 lines (1284 loc) · 65.1 KB
/
acosta_11_genomic_790983.pdf.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Genomic lineages of Rhizobium etli revealed by the extent of nucleotide polymorphisms and low recombination</title>
<meta name="Subject" content="BMC Evolutionary Biology 2011, 11:305. doi:10.1186/1471-2148-11-305"/>
<meta name="Keywords" content=" "/>
<meta name="Author" content="José L Acosta"/>
<meta name="Creator" content="Arbortext Advanced Print Publisher 10.0.1082/W Unicode"/>
<meta name="Producer" content="Acrobat Distiller 9.4.2 (Windows)"/>
<meta name="CreationDate" content=""/>
</head>
<body>
<pre>
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
RESEARCH ARTICLE
Open Access
Genomic lineages of Rhizobium etli revealed by
the extent of nucleotide polymorphisms and low
recombination
José L Acosta1*, Luis E Eguiarte2, Rosa I Santamaría1, Patricia Bustos1, Pablo Vinuesa1, Esperanza Martínez-Romero1,
Guillermo Dávila1 and Víctor González1
Abstract
Background: Most of the DNA variations found in bacterial species are in the form of single nucleotide
polymorphisms (SNPs), but there is some debate regarding how much of this variation comes from mutation
versus recombination. The nitrogen-fixing symbiotic bacteria Rhizobium etli is highly variable in both genomic
structure and gene content. However, no previous report has provided a detailed genomic analysis of this variation
at nucleotide level or the role of recombination in generating diversity in this bacterium. Here, we compared draft
genomic sequences versus complete genomic sequences to obtain reliable measures of genetic diversity and then
estimated the role of recombination in the generation of genomic diversity among Rhizobium etli.
Results: We identified high levels of DNA polymorphism in R. etli, and found that there was an average divergence
of 4% to 6% among the tested strain pairs. DNA recombination events were estimated to affect 3% to 10% of the
genomic sample analyzed. In most instances, the nucleotide diversity (π) was greater in DNA segments with
recombinant events than in non-recombinant segments. However, this degree of recombination was not
sufficiently large to disrupt the congruence of the phylogenetic trees, and further evaluation of recombination in
strains quartets indicated that the recombination levels in this species are proportionally low.
Conclusion: Our data suggest that R. etli is a species composed of separated lineages with low homologous
recombination among the strains. Horizontal gene transfer, particularly via the symbiotic plasmid characteristic of
this species, seems to play an important role in diversity but the lineages maintain their evolutionary cohesiveness.
Background
Bacterial species typically contain large amounts of
genetic variation in the form of single nucleotide polymorphisms (SNPs), which originate by mutation and
have dynamics that depend on the balance between natural selection and genetic drift [1,2]. There is some
debate on whether or not most of these polymorphisms
are selectively neutral at the molecular level [3]. Species
have been genetically defined through the analysis of
DNA variation using comparative techniques such as
hybridization, the sequencing of gene markers, and
(more recently) complete genome sequences [4,5]. It has
* Correspondence: jlacosta@ccg.unam.mx
1
Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México,
Av. Universidad N/C Col. Chamilpa, Apdo. Postal 565-A, Cuernavaca 62210,
México
Full list of author information is available at the end of the article
been proposed that similarity values greater than 70%
obtained in DNA-DNA hybridization experiments are
sufficient to define a coherent group of organisms as
belonging to the same species [6]. These estimates are
very rough, subject to experimental variation, and they
only indirectly measure similarity (i.e. via hybridization
efficiency) [7]. A comparative analysis of complete genomes minimizes most of these limitations. Several measures of genomic relatedness, such as the Average
Nucleotide Identity (ANI) and the Maximal Unique
Matches (MUM) have been proposed for such analyses
[8,9]. Both ANI and MUM are based on pairwise
nucleotide comparisons of complete genomes, and several reports have shown good correlations between the
results from these analyses and other measures of
genetic relatedness, such as those based on Multilocus
Sequencing Typing (MLST), 16S sequencing, and gene
© 2011 Acosta et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
content [10]. However, these comparative methods rely
on the availability of complete genome sequences and
are affected by the quality of the DNA sequencing data,
which in the case of draft genomes might not be optimal [10]. The latter issue has not been thoroughly
addressed in past studies. One exception was the comparisons made by Richter and Roselló-Mora [10], who
suggested that low genome sequence coverage can be
sufficient for inferring DNA similarity values comparable
to ANI obtained with complete genomes.
Bacterial species have mechanisms for gene exchange
(transformation, conjugation and transduction), and
genetic recombination is believed to play a prominent
role in diversifying species by distributing variation and
generating new allele combinations [11]. Horizontal
gene transfer is an important source of genomic variation within and between species [12-16], and homologous recombination frequently results in the exchange
of small genomic regions between members of the same
or closely related species [17]. The estimated rates of
homologous recombination vary widely among bacteria;
in some instances, recombination seems to have contributed to species diversification to a greater extent than
even point mutations, whereas in other species homologous recombination appears to be rare [18].
Recombination has typically been assessed by molecular techniques such as Multilocus Enzyme Electrophoresis (MLEE), Amplified Fragment Length Polymorphism
(AFLP), or Multi Locus Sequence Typing MLST
[19-21]. These methods primarily measure linkage disequilibrium (LD), and are based on the degree of allele
association at different housekeeping loci. For example,
E. coli strains show strong LD, reflecting infrequent
genetic mixing within local populations [22]. More
recently, the availability of complete genomic sequences
has allowed recombination to be assessed more accurately [23]. Interestingly, genomic sequencing combined
with analyses of population genetics have shown that
the recombination rates within E. coli are higher than
the mutation rates, but not to the extent that the phylogenetic signal is distorted [24]. Despite frequent recombination between strains, therefore, the genes seem to
coexist in an organized genome, resulting in a chromosomal plasticity that accelerates the adaptation of E. coli
to various environments.
In this work, we studied the intraspecific variability
and recombination in Rhizobium etli, a soil bacterium
that associates with bean roots to fix nitrogen. Previous
studies have noted that this species has a variable gene
content and high genomic divergence [24], as well as a
low rate of recombination (in housekeeping genes)
among isolates from the same geographical site
[22,25,26]. However, in isolates (from the same geographical site) of Sinorhizobium medicae, it was found that
Page 2 of 13
frequency of recombination was higher in plasmids and
megaplasmids, as compared to the chromosome [27].
The first purpose of this work was to perform a detailed
genomic analysis of the nucleotide variation in this species. Accordingly, we used stringent methods to identify
SNPs from a set of complete and draft genomes of R.
etli, assessed the value of draft genomes and low coverage data when seeking to obtain global measures of
genetic relatedness, and then examined the nucleotide
differences among various strains of R. etli. The second
purpose was to assess the role of recombination in generating genomic diversity in R. etli. Our results confirm
and extend the previous estimations on the genomic
diversity of R. etli, and indicate that recombination
might play only a minor role in generating such diversity. Therefore, we conclude that the species R. etli is
composed of separate genomic lineages that share a low
rate of recombination but have a common symbiotic
phenotype.
Results
Nucleotide variation assessment in complete and draft
genomes
Since accurate SNP identification relies largely on the
quality of the sequence data, the use of draft genome
sequences could potentially introduce errors into the
variation estimates. Therefore, stringent parameters (see
Methods) were used to identify high-quality SNPs in a
set of two complete R. etli genomes, CFN42 and
CIAT652, isolated from México and Costa Rica respectively, and six draft genome sequences from strains isolated in different places of the world: BRASIL5 (Brazil),
CIAT894 (Colombia), GR56, IE4771 (México), KIM5
(USA), and 8C-3 and GR56 (Spain) [24]. All the Sanger
reads were collected from the draft genomes (about
13,000 reads of 1000 nucleotides in length per genome
on average) were aligned against the predicted ORFs of
the CFN42 or CIAT652 genomes, and the alignments
were evaluated using Polybayes (additional file 1 Figure
S1), which determined the probability that a nucleotide
site was polymorphic, based on the Phred quality of the
read. A Phred value of Q20 and a probability greater
than 0.90 are generally considered acceptable for the
detection of SNPs [28]. Most of the SNPs in our data
set had probability scores > 0.975, indicating that more
than 100,000 SNPs per genome had Phred qualities over
Q45 (additional file 1 Figure S1). To avoid the possible
inclusion of false positives (in average 27,000 SNPs by
each strain), we used only SNPs with a minimum Phred
score of Q45 and the highest Bayesian probabilities (>
0.99) throughout this work [29].
Additional errors in SNP determination might arise
from poorly aligned regions. Since R. etli genomes have
a high proportion of paralogous sequences [24,30], a
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
Page 3 of 13
Strains of E. coli vs K12
6
Chicago, IL)) when we compared the results obtained at
1× coverage versus those obtained with the complete
genome assembled at about 10× coverage, indicating
that 1× coverage of the genome sequence could be considered a robust proxy of full variation at the genomic
level in this species.
SNP frequencies among the R. etli strains
We quantified the SNPs in R. etli by computing the
pairwise nucleotide differences between individual draft
genomes versus the complete genomes of strains CFN42
or CIAT652. More SNPs were found in comparisons
made versus the CFN42 genome (Figure 2, gray boxes)
than the CIAT652 genome (Figure 2, blue boxes). For
example, the BRASIL5 strain had a median of 5% SNPs
per aligned fragment when compared with CFN42 but
only 2% compared to CIAT652, indicating that BRASIL5
is more closely related to CIAT652 than CFN42. Similarly, variance was higher when BRASIL5 was compared
with CFN42 rather than CIAT652 (Figure 2). A very
similar pattern was found for strain 8C-3. The other
strains showed similar levels of variation, on the order
of 6% (CFN42) and 4% (CIAT652), with the latter comparison always showing a lower variance. Comparison
between the complete genomes of CFN42 and CIAT652
(Figure 2, red box) result in a median variation of 9%,
that is high but still lower than the comparisons
between CFN42 and R. leguminosarum bv viciae 3841
(Figure 2 green box). Moreover, when we compared R.
leguminosarum bv. viciae 3841 with all of the R. etli
strains (complete and draft genomes) (additional file 1
5
3
2
2
* Simulation
O157H7
* Sanger reads
(based on the Sanger reads)
HS
F11
E1100019
B171
53638
HS
O157H7
101-1
F11
B171
E1100019
0
53638
1
(download from ncbi)
% SNPs by gene fragment
24
4
101-1
% SNPs by gene fragment
stringent identification of orthologous segments of genes
was performed. We aligned the contigs of each draft
genome sequence against the ORFs from the complete
genomes of either CFN42 or CIAT652, using both
ungapped and gapped alignments, along with the reciprocal best hit criteria. We considered DNA gene segments as being orthologous to the reference sequence if
they had nucleotide identities higher than 85% and coverage higher than 60% of the reference gene. Various
numbers of orthologous segments were identified from
the draft genomes, covering about 40% of the total gene
contents of the reference strains. The total amount of
data collected by this procedure is about 2 to 2.5 Mb
per draft genome (additional file 1 Table S1).
To determine the robustness of the above-described
procedure, we simulated a draft assembly by using Sanger read samples of the complete genomes of different
E. coli strains at low coverage (1x) (see Methods). The
contigs of the simulated assembly were aligned with the
genome of E. coli K12, and SNPs were detected as
described above. On average, the obtained nucleotide
variation ranged from about 1% to 2% (SNPs/alignment
length) (Figure 1). There was no significant difference
(p-value lower at 0.05, according to Mann-Whitney and
Kolmogorov-Smirnov tests obtained from Predictive
Analytics Software PASW Statistics 18 (SPSS Inc.,
CIAT652
20
16
12
8
4
0
BRASIL5
CIAT894
GR56
IE4771
KIM5
8C-3
Figure 1 SNP assessment on lower coverage. Paired comparisons
between E. coli strains and K12 strain as reference genome. We used
two set of E. coli strains: Complete genomes obtained from Sanger
reads (green box) and Sanger reads simulated from complete
genomes with coverage approximate of 1X (yellow box). For each
alignment, we determined the percentage of SNPs by gene
fragment (SNP number/length of contig; Y axis) by our
methodology. Boxes inside the graphic include the median values
(middle line) and the first and third quartiles (lower and upper lines)
of the distribution. Abscissa: E. coli strains.
R.etli Strains vs CFN42
CIAT652 vs CFN42
R. etli Strains vs CIAT652
RLEG
RLEG-3841 vs CFN42
Figure 2 Paired comparisons between R. etli strains. We
performed four paired comparisons used our methodology: draft
genomes of R. etli against CFN42 (gray boxes), draft genomes of R.
etli against CIAT652 (blue boxes), CIAT652 against CFN42 (red boxes)
and finally R. leguminosarum bv viciae 3841 against CFN42 (green
boxes). For all comparisons the Y axis is the percentage of SNPs by
gene fragment (SNP number/length of contig). Boxes inside the
graphic include the median values (middle line) and the first and
third quartiles (lower and upper lines) of the distribution. Abscissa:
R. etli and R. leguminosarum bv viciae 3841 strains.
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
Page 4 of 13
Figure S2), the greatest difference in SNP percentage
(median 11%) was seen in the comparison with strain
CFN42 (Figure 2 green boxes, and discussion section).
We sought to obtain a single measure of the nucleotide
variation across the whole set of genomes. To this end,
we averaged the medians of the SNP distributions for
each alignment (i.e., the number of SNPs/alignment
length of each draft genome with respect to CFN42 or
CIAT652) and generated average confidence interval
(obtained and adjusted by distribution of genes size
medians) using Predictive Analytics Software PASW Statistics 18 (SPSS Inc., Chicago, IL). This statistical test of
proportions compares the observed proportions of an
event (here, SNPs) in k samples (here, strains), uses a
chi-squared test to seek significant differences among
the proportions, and subsequently adjusts the confidence intervals for each sample. The generated measure,
herein called the average nucleotide variation (ANV),
might represent the species-level variation. We obtained
ANV values of 4% and 6% when we compared all the
analyzed strains against CIAT652 and CFN42, respectively (Figure 3). Although the largest numbers of SNPs
were found in comparisons with the CFN42 genome, all
strains were similarly divergent according to the 95%
confidence intervals with respect to the median (blue
lines in Figure 3). This observation indicates that
CFN42 is almost equally divergent with respect to all
other strains. Comparisons with the CIAT652 genome
showed that strains BRASIL5 and 8C-3 were closer to
this strain than to CFN42. Moreover, the CIAT894
strain yielded the highest number of SNPs, causing its
average SNP proportion to fall outside the average confidence interval (red lines in Figure 3). Strains CIAT894
and IE4771 showed greater divergences than the rest of
the strains, regardless of the reference strain (CFN42 or
CIAT652) used in the comparison.
Nucleotide variation profiles in homologous genomic
segments from different R. etli strains
To explore how SNPs are distributed in the R. etli genomes, we first identified orthologous segments for which
we had sequence information in all eight studied strains
(Figure 4). A total of 240 segments with a median size
of 275 bp were common to all strains, and spanned a
total of about 71,630 bp that represent about 1% of the
genome length. These sequences mapped mainly to the
chromosomes of CFN42 and CIAT652 (92%), with a
lower proportion (8%) distributing to plasmids. We generated a concatenated alignment of these shared segments according to the gene order found in the CFN42
genome, and then inferred a consensus sequence and
computed the number of nucleotide differences across
0.0846
medians of snps distribution
0.0612
ANV
Average Nucleotide Variation
Average nucleotide variation
Upper range
0.0379
Lower range
average of confidence intervals
BRASIL5 CIAT894 GR56
IE4771
KIM5
8C-3
outliers
Upper range
0.0455
ANV
0.0659
confidence intervals adjusted by genes size
0.0251
Lower range
outliers
Reference genome
CFN42
CIAT652
Figure 3 Average nucleotide variation. We calculated the
Average nucleotide variation (middle lines of each graphic) from
the median SNP percentages (dots indicates by dashes lines) for
each aligned comparison (Y axis) of test strain versus the reference
strains, CFN42 (blue) or CIAT652 (red). Average Confidence interval
was adjusted (arrows with dashes lines) to the medians of the
length distributions of the aligned fragments (genes). The medians
SNPs that exceed the average confidence interval were outliers.
Abscissa: BRASIL5, CIAT894, GR56, IE4771, KIM5 and 8C-3.
windows of 250 bp. Using this procedure, we detected
the patterns of shared and unique (singleton) SNPs particular to each strain. As shown in Figure 4, we were
able to distinguish two classes of shared SNPs: biallelic
SNPs (Figure 4 gray smoothed areas), which showed
only one nucleotide difference with respect to the consensus; and polyallelic (Figure 4, white bars), which
showed multiple differences at the same nucleotide site
with respect to the consensus. Some of these SNP patterns were shared in some strains but not others. For
example, as shown in Figure 4, pattern A was shared by
strains CIAT652, CIAT894 and 8C-3, whereas pattern B
was found in strains GR56, IE4771 and Kim5. Further
shared patterns were identified through a careful inspection of the plot. In addition, a large number of polymorphisms were not shared, but instead appeared to be
strain-specific variants. Interestingly, strain CFN42 was
found to have the greatest number of differences with
respect to the consensus (Figure 4, black bars). Even
thought this approach is limited by the amount of common segments among the eight strains, we were able to
cover 3.7% (223) of the total gene content (5,963) of the
Page 5 of 13
20
10
0
10
CFN42
20
20
10
0
10
A
CIAT652
20
20
10
0
10
A
BRASIL5
20
SNP number by shared region
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
20
10
0
10
CIAT894
20
B
20
10
0
10
GR56
20
B
20
10
0
10
IE4771
20
20
10
0
10
B
KIM5
20
20
10
0
10
A
8C-3
20
Recombination sites
chromosome
biallelic regions
plasmids
singletons
polyallelic regions
Figure 4 SNP distribution profiles. Alignments were performed on a total of 240 sequence segments available for all tested strains of
Rhizobium etli. Each nucleotide position in the alignment is represented by a consensus. In instances where half of the strains had the same
nucleotide and the other half a different nucleotide, the consensus was defined as the nucleotide present in R. etli CIAT652. Common segments
were concatenated according to the gene order found in the CFN42 genome (chromosome and after plasmids), yielding 71,630 aligned base
pairs. The numbers of nucleotides differing from the consensus are plotted as bars, across independent windows of 250 nucleotides. The black
bars (running downwards) show SNPs present in a single strain; the gray areas indicate when the same SNP pattern was present in at least two
strains at the same position within the alignment (patterns A and B); and the white bars indicate polymorphic sites where at least three alleles
were present in at least two strains, again within the alignment. Segments showing significant recombination events are indicated by bars at the
bottom of the plot, and with bars indicating the genomic location of segments with respect to CFN42 (chromosome, white; plasmids, black).
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
Page 6 of 13
CFN42 reference strain that include the main COG
categories and subcategories (see Methods). For
instance, metabolism (transport and metabolism of
sugar, amino acids, and carbohydrates); cellular processes and signaling (envelope biogenesis, signal transduction); information storage and processing
(transcription, replication, and recombination); and
poorly characterized proteins (function unknown). A
detailed annotation of the gene segments can be seen in
additional file 2 Table S1.
Phylogenetic congruence
Since recombination can distort phylogenetic trees such
a way that no two individual trees are topologically
equivalent, we decided to perform phylogenetic reconstructions using a) a neighbor-joining network [31]; and
b) a comparison of a consensus tree with individual
trees constructed using the 187 segments common to
the eight studied R. etli genomes and R. leguminosarum
bv viciae 3841 (RLEG). The consensus trees obtained
from the concatenated alignments had identical topologies when constructed by maximum likelihood, Bayesian, and neighbor joining network methods (see
Methods). Only the tree based on neighbor joining network is shown in Figure 5. This tree was found to contain six internal branches (denoted by split numbers).
There are two main clusters in the tree, separated by
branches 2 and 3 that group the most closely related
strains: one containing KIM5, IE4771, and GR56
(branch 2) and another grouping BRASIL5, 8C3, and
CIAT652 (branch 3). These branches are internal in
0.01
IE4771
KIM5
CFN42
1
GR56
5
2
RLEG
3
BRASIL5
CIAT652
6
4
8C-3
CIAT894
Figure 5 Genetic relatedness. Network joining network phylogeny
inferred from 186 concatenated regions shared among strains of
Rhizobium etli and Rhizobium leguminosarum bv viciae 3841 (see
Methods). The tree is unrooted and has six internal branches,
indicated by split numbers on each internal branch. The scale bar
denotes the expected number of nucleotide substitutions per site.
relation to branch 5, which separates CFN42, CIAT894,
and RLEG that are the strains with the longest branches
(greatest number of nucleotide substitution per site). A
few inconsistencies were found among the topologies
recovered from reconstructions based on individual
gene segments (187), as compared to the topology of
the consensus tree (not shown). These alternative topologies are mainly due to the position of CIAT894 and
RLEG, whereas the splits 2, 3, and 5 where consistently
recovered. Thirty out of 187 trees supported the placement of RLEG as the most distant strain, 39 trees supported placement of CIAT894 as the external strain,
whereas the most frequent topology shows that these
strains are equally distant to the rest of strains (Figure
5). These alternative topologies could be the result of
shared ancestral polymorphisms, as suggested by the
long branches coupled with low frequency of recombination. Altogether, the phylogenetic reconstructions suggested that the levels of recombination were insufficient
to erase the phylogenetic signal, thus allowing for the
identification of the most probable strain tree. Consistent with this conclusion, only nine (3.75%) of the 223
gene segments common among the eight R. etli strains
(Figure 4) showed at least one recombination event.
Extent of recombination
To evaluate the extent of the probable recombination
events among strains of R. etli, we performed a recombination analysis in orthologous quartets (see Methods).
We aligned the shared gene segments from each draft
genome with the corresponding segments of the ORFs
from CFN42, CIAT652, and the R. leguminosarum bv
viciae 3841 complete genomes, yielding six different
groups of quartets (one group for each incomplete genome; Figure 6). The proportion of aligned segments varied across the six groups of quartets, from ~2,781
segments in the group containing BRASIL5, to ~3,672
in the group containing CIAT894. The segments ranged
from 200 to 4651 bp in length and covering approximately 50% of the genome (additional file 1 Table S2).
For each group of quartets, we performed four different
recombination tests (see Methods), and determined the
number of recombination events (only those that were
detected by at least two methods) for each quartet
(describe above) (Figure 6). The lowest proportions of
recombination events were detected for the quartets
containing strains BRASIL5 and 8C-3, which showed
4.42% (123 out 2781) and 3.57% (102 out 2854) recombination events, respectively. The other groups showed
approximately twice as many recombination events, with
frequencies ranging from 8.67% (KIM5 quartets) to
10.86% (GR56). In addition, for each group of recombinant quartets, we determined the number of events of
recombination between pairs of strains (Figure 6). In
7%
25%
CIAT652
5%
25%
2900 quartets in total
(20%),(CIAT652,RLEG)
BRASIL5
(18%),(CFN42,BRASIL5)
315 recombinant quartets (10.86%)
CFN42
9%
24%
CIAT652
RLEG
6%
20%
(13%),(CIAT652,RLEG)
3134 quartets in total
RLEG
GR56
CFN42
11%
22%
CIAT652
(17%),(CIAT652,RLEG)
RLEG
6%
21%
KIM5
(23%),(CFN42,KIM5)
CFN42
10%
11%
CIAT652
RLEG
9%
23%
CIAT894
(34%),(CFN42,CIAT894)
264 recombinant quartets (8.94%)
CFN42
8%
24%
CIAT652
RLEG
8%
25%
(9%),(CIAT652,RLEG)
(28%),(CFN42,GR56)
272 recombinant quartets (8.67%)
380 recombinant quartets (10.34%)
(13%),(CIAT652,RLEG)
2953 quartets in total
CFN42
3672 quartets in total
123 recombinant quartets (4.42%)
Page 7 of 13
2854 quartets in total
2781 quartets in total
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
IE4771
(26%),(CFN42,IE4771)
102 recombinant quartets (3.57%)
CFN42
9%
28%
CIAT652
(18%),(CIAT652,RLEG)
RLEG
5%
29%
8C-3
(11%),(CFN42,8C-3)
Figure 6 Detection of recombination in quartets. Six groups of quartets of orthologous segments were built with the Mauve program; they
included shared sequences from CFN42, CIAT652, Rhizobium leguminosarum bv viciae 3841, and each one of the six test strains (incomplete
genomes). Test of recombination were performed for all the quartets as described in Methods section. The number (as well as the percentage)
of recombinant segments predicted for each group of quartets is indicated above the quartet diagram. The total number of quartets analyzed in
each group is indicated at the left side of the diagrams. The percentages of recombinant quartets between pairs of strains are shown inside the
diagram and by dashed and continuous lines defined below the diagram. To facilitate searching each test strains o incomplete genome has its
own color.
general, recombination events were more frequently predicted between R. etli strains pairs than between any
given R. etli strain and R. leguminosarum bv viciae 3841
(Figure 6). For instance, in the group of quartets containing BRASIL5, the percentage of recombinant segments is about 7% in CFN42-RLEG, 5% in BRASIL5RLEG, and 20% in CIAT652-RLEG pairs, whereas
recombinant segments were detected more frequently
between pairs of R. etli strains: 18% (CFN42-BRASIL 5),
25% (CFN42-CIAT652), and 25% (CFN42-CIAT652).
The same pattern was seen for the other five groups of
quartets. This effect is because homologous recombination depends on a high nucleotide identify, and greater
divergence is associated with less homologous recombination [32]. Therefore, recombination might be more
frequent between strains (populations) that are closely
related. Indeed, we observed the same recombination
events in different groups of quartets (of different
strains), as indicated by a presence/absence matrix. In
general, the number of common recombination events
(small number of events) was related to the phylogenetic
proximity of the strains, for instance BRASIL5 and 8C-3
share the most recombination events in common (data
not shown).
To explore whether the recombination is particularly
acting on some classes of genes, we assigned the
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
Genetic diversity
Together the above-described data suggest that recombination may not be a major driver of genomic diversification in R. etli, but rather might have relatively limited
effects. To directly examine this point, we estimated the
mean nucleotide diversity per nucleotide site (π) for the
recombinant and non-recombinant gene segments of
each strain (Figure 7a). In general, recombinant segments showed higher π values than non-recombinant
segments. These differences were significant only for
strains CIAT894, GR56, IE4771 and KIM5 (Student’s ttest, p < 0.001), but the combined data for the π values
of the 240 recombinant and non-recombinant gene segments common to the eight strains showed the lowest π
values (0.06 on average). Although there was no significant difference between recombinant (red circles) and
non-recombinant segments (blue circles) with regard to
the regions common to all eight strains (Figure 7b),
most of the recombinant segments had higher-thanaverage π values and generally showed the highest transition/transversion ratios (indicated by the size of the
circles in Figure 7b). Since the probability of transitions
is higher than transversions [34], high ratios of transition/transversion suggest that they were under strong
purifying selection, because transitions at the third ‘wobble’ position are more likely to be synonymous than
transversions [35].
Discussion
In the present work, we used a genomic approach to
detect and measure variation in the form of SNPs, and
to analyze the contribution of recombination to the
genomic diversification of R. etli strains. Our results
demonstrated that draft genomic sequences samples
representing ~1× of the genome can be used to measure
Phi by gene segments (Quartets)
A)
8C-3
KIM5
IE4771
GR56
BRASIL5
CIAT894
Recombinant
Non Recombinant
B)
Shared regions by all strains
ratio ts/tv
30
% SNPs by gene segment
recombinant segments to COGs (see Methods), as
shown additional file 1 Figure S3. All the functional
classes annotated in the CFN42 genome are present in
the draft genomes but they are represented unevenly in
the recombinant segments. For instance, the categories:
amino acid transport and metabolism, carbohydrate
transport and metabolism, energy production and conversion, lipid transport and metabolism, general function
prediction only and function unknown appear overrepresented among the recombinant segments. In counterpart, some other categories like transcription and signal
transduction mechanisms are in lower frequency among
the recombinant segments than in CFN42. Even though
we performed a chi-square and Range tests [33] to
assess the significance of these differences, the incomplete nature of draft genomes does not allow to conclude about some bias toward recombination in certain
classes of genes.
Page 8 of 13
25
n=8, mean pi 0.06, mean SNPs (%) by gene segment 12.89
4.4268
20
15
10
5
0
0
0.03
0.06
0.09
0.12
0.15
Nucleotide diversity by gene segment
Shared segment
Shared segment with recombination
Figure 7 Genetic diversity in recombinant segments. A) For
each homologous segment (quartet; regardless of evidence of
recombination), we calculated the nucleotide diversity phi (Y axis;
see Methods). The dots indicate the distribution means and the bars
represent the 95% confidence intervals. Blue and red dots indicate
recombinant and non-recombinant segments, respectively.
Moreover, we determined the nucleotide diversity of the sequence
regions shared across all of the tested strains of R. etli (green dot).
Abscissa: BRASIL5, CIAT894, GR56, IE4771, KIM5, 8C-3 and shared
regions. B) Magnification of the results from the 240 common
sequence segments shared by all tested strains. The average
percentage of SNPs was 12.89% per segment. The sphere sizes
indicate the proportions of ts/tv for both recombinant segments
(red circles) and non-recombinant segments (blue circles). The Y axis
denotes the SNP percentage by gene segment, while the X axis
shows the nucleotide diversity.
variation at the whole-genome level in this species. In R.
etli we found a great amount of variation (more than
161,998 SNPs) when any draft genome was compared to
the complete genomes of CFN42 and CIAT652. To
assess the reliability of this method for identifying SNPs,
we quantified the SNPs in E. coli genomes at 1× and in
complete genomes assembled at about 10× coverage.
We found the same variation level using either draft or
complete E. coli genomes, indicating that draft genomes
produced estimations of DNA variability comparable to
those generated using complete genomes even at only
Acosta et al. BMC Evolutionary Biology 2011, 11:305
http://www.biomedcentral.com/1471-2148/11/305
1× coverage. Richter and Roselló-Mora [10] previously
reported on the use of partial sequences representing
about 20% of the genomes of several bacterial species to
infer reliable values of DNA divergence between strains.
The authors of the prior paper showed that ANI values
obtained with these samples correlated well with the
DDH values, indicating that draft genome sequences are
an acceptable data source. At present, the rapid
improvement of DNA sequencing technology is allowing
researchers to use multiplex sequencing to simultaneously process an increasing number of genomic
sequences. These experiments will produce additional
draft genome sequences of different qualities, and the
approach proposed herein should prove useful for their
early analysis.
We identified a higher proportion of SNPs in R. etli
strains than in E. coli strains, and the differences
between the various R. etli strains and Rhizobium leguminosarum bv viciae 3841 ranged from 7% to 11%
(median; additional file 1 Figure S2), with the latter figure corresponding to the CFN42 comparison. R. etli and
R. leguminosarum are different species according to 16S
comparison; however, they share a common genomic
core and are distinguished by variable accessory components (e.g., plasmids) [24,36,37]. Therefore, an ANV
range of 7-11% might be a good indicator of speciation
within Rhizobium. Despite of the variability in ANV
among the tested strains of R. etli (about 4-6%), none
had ANV values comparable to those obtained with
respect to R. leguminosarum. The levels of ANV were
higher for comparisons using CFN42 than those done
with CIAT652. For taxonomic purposes, CFN42 is the
type strain of R. etli [38]. In the present analysis, however, we found that CFN42 was the most differentiated
of the studied samples, had the highest proportion of