data/corpus/PMC1500873.xml

<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article"><?properties open_access?><front><journal-meta><journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id><journal-id journal-id-type="publisher-id">Nucleic Acids Research</journal-id><journal-title>Nucleic Acids Research</journal-title><issn pub-type="ppub">0305-1048</issn><issn pub-type="epub">1362-4962</issn><publisher><publisher-name>Oxford University Press</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="pmid">16835308</article-id><article-id pub-id-type="pmc">PMC1500873</article-id><article-id pub-id-type="doi">10.1093/nar/gkl433</article-id><article-categories><subj-group subj-group-type="heading"><subject>Article</subject></subj-group></article-categories><title-group><article-title>Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits</article-title></title-group><contrib-group><contrib contrib-type="author"><name><surname>Dessimoz</surname><given-names>Christophe</given-names></name><xref ref-type="corresp" rid="cor1">*</xref></contrib><contrib contrib-type="author"><name><surname>Boeckmann</surname><given-names>Brigitte</given-names></name><xref rid="au1" ref-type="aff">1</xref></contrib><contrib contrib-type="author"><name><surname>Roth</surname><given-names>Alexander C. J.</given-names></name></contrib><contrib contrib-type="author"><name><surname>Gonnet</surname><given-names>Gaston H.</given-names></name></contrib><aff><institution>ETH Zurich, Institute of Computational Science</institution><addr-line>CH-8092 Z&#x000fc;rich</addr-line></aff><aff id="au1"><sup>1</sup><institution>Swiss Institute of Bioinformatics, CMU</institution><addr-line>Michel-Servet 1, CH-1211 Gen&#x000e8;ve</addr-line></aff></contrib-group><author-notes><corresp id="cor1"><sup>*</sup>To whom correspondence should be addressed. Tel: +41 44 6327472; Fax: +41 44 6321172; Email: <email>cdessimoz@inf.ethz.ch</email></corresp></author-notes><!--For NAR: both ppub and collection dates generated for PMC processing 1/27/05 beck--><pub-date pub-type="collection"><year>2006</year></pub-date><pub-date pub-type="ppub"><year>2006</year></pub-date><pub-date pub-type="epub"><day>11</day><month>7</month><year>2006</year></pub-date><volume>34</volume><issue>11</issue><fpage>3309</fpage><lpage>3316</lpage><history><date date-type="received"><day>14</day><month>3</month><year>2006</year></date><date date-type="rev-recd"><day>23</day><month>5</month><year>2006</year></date><date date-type="accepted"><day>01</day><month>6</month><year>2006</year></date></history><copyright-statement>&#x000a9; 2006 The Author(s)</copyright-statement><copyright-year>2006</copyright-year><license license-type="openaccess"><p>This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/2.0/uk/"/>) which permits unrestricted non-commerical use, distribution, and reproduction in any medium, provided the original work is properly cited.</p></license><abstract><p>Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.</p></abstract></article-meta></front><body><sec><title>INTRODUCTION</title><p>The identification of orthologous genes is a central problem in bioinformatics. Orthologs are genes that evolve from a common ancestor through speciation events, as opposed to paralogs, that result from gene duplication (<xref ref-type="bibr" rid="b1">1</xref>). Discriminating orthologs from paralogs is an important, but non-trivial task. It is important, because function conservation is considerably higher among orthologs (<xref ref-type="bibr" rid="b2">2</xref>), and also because only orthologs reflect the history of their species (<xref ref-type="bibr" rid="b1">1</xref>), meaning that phylogeny inferences must be based on orthologs. It is non-trivial because this distinction requires precise estimates of evolutionary distances from data that are often noisy. Other complications include gene deletion, variations in evolutionary rates, lateral gene transfer (LGT), or simply the fact that orthology and paralogy are non-transitive relations, meaning that the relation of every pair of genes must be analyzed separately.</p><p>So far, several projects have addressed this problem systematically. Of those, the COGs database (<xref ref-type="bibr" rid="b3">3</xref>,<xref ref-type="bibr" rid="b4">4</xref>) is by far the best established, probably due to its early inception, its wide scope, its reasonable performance and its presence on the NCBI website. The significance of COG in the community is reflected by hundreds of references in scientific articles. Even more importantly, most current initiatives for the identification of orthologs use ideas derived from the methodology of COG, in particular the idea of genome-specific best hit (<xref ref-type="bibr" rid="b5">5</xref>&#x02013;<xref ref-type="bibr" rid="b7">7</xref>). Of all those projects depending either on the methods or results from COG, few question the accuracy of them.</p><p>In its last accessible release (2003), the COGs database groups 138&#x02009;458 proteins from 66 prokaryotes into 4873 groups that consist of orthologs and in-paralogs. The term in-paralog was coined by Remm and coworkers (<xref ref-type="bibr" rid="b6">6</xref>) and describes in this context paralogs inside the same species (&#x02018;trivial paralogs&#x02019;), as opposed to out-paralogs that result from a duplication event prior to the last speciation event. [Strictly speaking, in/out-paralogy is a relation defined over two sequences and a speciation event of reference. When that event is omitted, it is here the last speciation event that is implied.] The inclusion of in-paralogs is usually justified by the fact that such sequences are orthologous to every other sequence within their group. Consequently, the relation of every pair of sequences inside the same COG is unambiguous: pairs of sequences from the same species are paralogs, otherwise, they are expected to be orthologous. The construction of COG groups is based on the fact that orthologous genes almost always have a higher level of sequence conservation than paralogs. Hence, genome-specific best hits (&#x02018;BeTs&#x02019;) are likely to be formed between orthologs. Yet, if the corresponding ortholog is missing, a BeT might link paralogous sequences. That problem is partly taken care of by COG's approach: BeTs are only grouped when they form triangles, and triangles are merged only when they have a common side. However, if more than one species have lost the corresponding ortholog, the construction over triangles will not suffice to prevent paralogs from being clustered together. This scenario is far from being unlikely, because losses occurring before speciation events get replicated, and therefore the problem becomes very significant as more species and strains are included for analysis. In fact, simple situations, such as the one illustrated on <xref ref-type="fig" rid="fig1">Figure 1</xref> are sufficient to have paralogs clustered together. It is then up to the human curation step at the end of the COG building process (<xref ref-type="bibr" rid="b3">3</xref>) to resolve all such cases.</p><p>The difficulty caused by a single missing ortholog can be easily avoided by requiring that all BeTs be symmetrical, which is what most other projects do. However, if the corresponding ortholog is missing in both genomes, even a symmetrical BeT will link paralogs. Therefore, BeTs, even symmetrical, are not necessarily linking orthologs.</p><p>This problem could be solved through phylogenetic analysis of the relevant gene families, in particular tree reconciliation (<xref ref-type="bibr" rid="b8">8</xref>), but this procedure is not yet practical in large-scale, automated contexts (<xref ref-type="bibr" rid="b2">2</xref>). In the following, we present an algorithm that detects non-orthology without the need of gene tree construction, then report its application on the last version of the COGs database. The algorithm was developed in the context of our own orthology classification project OMA (<xref ref-type="bibr" rid="b9">9</xref>), in which it is used to verify every predicted orthologous relation.</p></sec><sec sec-type="materials|methods"><title>MATERIALS AND METHODS</title><p>The algorithm presented here is designed to detect non-trivial paralogous relations within groups of orthologs such as COG groups. Knowing that a paralogous relation within a group is likely to be caused by the loss of the corresponding ortholog in both species, the algorithm looks for a third-party species, which we call the &#x02018;witness of non-orthology&#x02019;, in which both corresponding orthologs are present (<xref ref-type="fig" rid="fig2">Figure 2</xref>). Under the assumptions of good and complete data, and similar evolutionary rates among orthologs, such a situation is characterized by the following three requirements on the evolutionary distances: (i) In <italic>Z, z</italic><sub>3</sub> is the closest protein to <italic>x</italic><sub>1</sub> and <italic>z</italic><sub>4</sub> is the closest protein to <italic>y</italic><sub>2</sub>. (ii) The pair (<italic>x</italic><sub>1</sub>, <italic>z</italic><sub>3</sub>) must be significantly closer than (<italic>x</italic><sub>1</sub>, <italic>z</italic><sub>4</sub>), and conversely, (<italic>y</italic><sub>2</sub>, <italic>z</italic><sub>4</sub>), must be significantly closer than (<italic>y</italic><sub>2</sub>, <italic>z</italic><sub>3</sub>), That excludes cases where <italic>z</italic><sub>3</sub> and <italic>z</italic><sub>4</sub> are in-paralogs (<xref ref-type="fig" rid="fig3">Figure 3</xref>, left), because for in-paralogs to fulfill those conditions, convergent evolution at the sequence level would be required, a phenomenon that is so unlikely that we ignore it (<xref ref-type="bibr" rid="b10">10</xref>). (iii) The distance between (<italic>x</italic><sub>1</sub>, <italic>z</italic><sub>4</sub>), must be similar to (<italic>y</italic><sub>2</sub>, <italic>z</italic><sub>3</sub>). That excludes cases where <italic>X</italic> (respectively <italic>Y</italic>) speciated before the duplication event, in which case <italic>x</italic><sub>1</sub> (respectively <italic>y</italic><sub>2</sub>) is orthologous to all three other genes (<xref ref-type="fig" rid="fig3">Figure 3</xref>, right).</p><p>We finish this overview of the algorithm by considering the impact of LGT and gene fusion/fission. Clearly, the algorithm presented here was not designed to detect LGT events between <italic>x</italic><sub>1</sub> and <italic>y</italic><sub>2</sub>, an interesting problem in itself that remains largely unsolved. More importantly here, an LGT in a third-party species <italic>Z</italic> can lead to a situation where <italic>Z</italic> wrongly appears to be witness of non-orthology: consider three orthologous proteins <italic>x</italic><sub>1</sub>, <italic>y</italic><sub>2</sub> and <italic>z</italic><sub>3</sub> in three species <italic>X</italic>, <italic>Y</italic> and <italic>Z</italic>. At some point, <italic>Z</italic> acquires through LGT a member of that orthologous family, which we now refer to as <italic>z</italic><sub>4</sub>. <italic>Z</italic> keeps both copies <italic>z</italic><sub>3</sub> and <italic>z</italic><sub>4</sub>. Furthermore, <italic>Z</italic> happens to be closer to <italic>X</italic> than <italic>Y</italic>, while the donor of <italic>z</italic><sub>4</sub> is closer to <italic>Y</italic> than <italic>X</italic>. This situation leads to a misclassification by our algorithm. Although such cases cannot be ruled out, we did not encounter any among the numerous case-by-case analysis performed on the results. It could be that orthologous gene displacement of <italic>z</italic><sub>3</sub> by <italic>z</italic><sub>4</sub> through homologous recombination is a much more likely scenario, and besides, the frequency of LGT appears to be higher among closely related species (<xref ref-type="bibr" rid="b11">11</xref>). As for gene fusion or gene fission, the units for amino acid sequence analysis are no longer proteins but domains. Even though the analysis of homologous domains from distinct proteins is scientifically meaningful, our analysis remains at the level of entire proteins to simplify matters.</p><p>Note that the complications caused by LGT events and, probably to a lesser extent, by gene fusion/fission are not specific to our method and pose challenges to other approaches as well, in particular tree reconciliation.</p><sec><title>Input data</title><p>The algorithm uses two inputs: the COGs database and pairwise sequence alignments between all proteins involved in the analysis. As introduced above, the orthology of two sequences is verified through an exhaustive search of the corresponding sequences in complete, third-party genome. Therefore, a large number of genomes is desirable. However, since the relation between every pair of sequence is needed, such searches require the computation of a very large number of pairwise alignments. For practical reasons, all results presented here use results from the Smith&#x02013;Waterman (<xref ref-type="bibr" rid="b12">12</xref>) all-against-all protein alignments precomputed in the scope of the OMA project (<xref ref-type="bibr" rid="b9">9</xref>).</p><p>For each alignment, a PAM distance estimate and the corresponding variance is computed using maximum likelihood and numeric integration (<xref ref-type="bibr" rid="b13">13</xref>,<xref ref-type="bibr" rid="b14">14</xref>).</p></sec><sec><title>Comparison of evolutionary distances</title><p>The algorithm uses evolutionary distances to detect paralogs. However, the distances estimates are subject to perturbation, which must be taken into account when comparing them. Therefore, assuming that errors are normally distributed, the difference &#x00394;(<italic>d</italic><sub>1</sub>, <italic>d</italic><sub>2</sub>) of two distances <italic>d</italic><sub>1</sub>, <italic>d</italic><sub>2</sub> has expected value:
<disp-formula><mml:math id="M1"><mml:mrow><mml:mi>E</mml:mi><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo><mml:mo>=</mml:mo><mml:mi>E</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>E</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
with variance
<disp-formula><mml:math id="M2"><mml:mrow><mml:msup><mml:mi>&#x003c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x003c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x003c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>2</mml:mn><mml:mtext>Cov</mml:mtext><mml:mo>(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
If the two distances are independent, the covariance term disappears and the variance of the difference can be obtained directly from the individual variances. But more often than not, <italic>d</italic><sub>1</sub> and <italic>d</italic><sub>2</sub> involve a common protein and are therefore not independent, meaning that not taking the covariance into account overestimates the error. We have developed a method to approximate the covariance of two evolutionary distances, which will be the subject of a separate article.</p></sec><sec><title>Algorithm</title><p>The algorithm goes through each COG group, and verifies inside each of them that every two genes <italic>x</italic><sub>1</sub>, <italic>y</italic><sub>2</sub> coming from different species have a significant alignment, and are indeed orthologs. Alignments are considered significant if the score is above 130 (47 bits, which typically corresponds to an <italic>E</italic>-value around 2e&#x02212;6) and the length of the alignment not &#x0003c;50% of the smallest sequence. The verification of orthology is performed through the search, in each third-party genome <italic>Z</italic>, of two genes <italic>z</italic><sub>3</sub> and <italic>z</italic><sub>4</sub> that fulfill the three conditions (i&#x02013;iii) presented at the beginning of this section:
<disp-formula id="e1"><label>1</label><mml:math id="M3"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mo>&#x02200;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02260;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>:</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x0003c;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x022c5;</mml:mo><mml:mi>&#x003c3;</mml:mi><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x02200;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x02260;</mml:mo><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>:</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x0003c;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x022c5;</mml:mo><mml:mi>&#x003c3;</mml:mi><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="e2"><label>2</label><mml:math id="M4"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x0003e;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x022c5;</mml:mo><mml:mi>&#x003c3;</mml:mi><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>&#x0003e;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x022c5;</mml:mo><mml:mi>&#x003c3;</mml:mi><mml:mo>[</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="e3"><label>3</label><mml:math id="M5"><mml:mrow><mml:mrow><mml:mo>&#x02223;</mml:mo><mml:mrow><mml:mi>&#x00394;</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x02223;</mml:mo></mml:mrow><mml:mo>&#x0003c;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x022c5;</mml:mo><mml:msqrt><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mi>&#x003c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>4</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x003c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>(</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msub><mml:mi>z</mml:mi><mml:mn>3</mml:mn></mml:msub><mml:mo>)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:msqrt><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
where <italic>k</italic> is the confidence level, which we set to 1.96. If the quartet (<italic>x</italic><sub>1</sub>, <italic>y</italic><sub>2</sub>, <italic>z</italic><sub>3</sub>, <italic>z</italic><sub>4</sub>), fulfills all three conditions, there is enough evidence to consider <italic>x</italic><sub>1</sub>, <italic>y</italic><sub>2</sub> paralogs. The algorithm was implemented in the programming environment Darwin (<xref ref-type="bibr" rid="b15">15</xref>).</p><p>A note about parameter choice. As mentioned previously, the classification of protein pairs in orthologs and non-orthologs can be very difficult or even impossible, especially when a speciation event immediately follows a duplication event, or in the situation of frequent gene gain and gene loss, as it is observed in certain groups of proteins, such as metabolic enzymes. Here, the choice of <italic>k</italic> = 1.96 standard deviations was established empirically such that the false-positive rate (orthologs misclassified as non-orthologs) is much smaller than the false-negatives rate (missed non-orthologs). In other words, we expect that our algorithm reports only clear-cut cases of paralogy.</p></sec><sec><title>Phylogenetic analysis</title><p>To verify individual cases reported by the algorithm, phylogenetic trees were constructed using independent, common software packages, as follows: sequences were aligned using Muscle (<xref ref-type="bibr" rid="b16">16</xref>) and ClustalW (<xref ref-type="bibr" rid="b17">17</xref>). Whenever they differed, the one that seemed more likely was selected. Short sequences, suspicious regions and most gap-containing columns removed. Distance matrices (JTT, gamma) generated with protdist (<xref ref-type="bibr" rid="b18">18</xref>) were used to construct phylogenetic trees using neighbor (<xref ref-type="bibr" rid="b18">18</xref>). Clusters of interest were selected for detailed analysis. Alignments of the selected data were performed using Tcoffee (<xref ref-type="bibr" rid="b19">19</xref>) and the result subsequently modified as described above, and considering the Tcoffee CORE (consistency of overall residue evaluation) values for the alignment. Information on the stability of the tree topology was assessed building an extended majority rule consensus tree using consense (<xref ref-type="bibr" rid="b18">18</xref>) from BIONJ (<xref ref-type="bibr" rid="b20">20</xref>) searches performed on 1000 bootstrap replicates, which were constructed with seqboot (<xref ref-type="bibr" rid="b18">18</xref>). Protein trees of the data subset were constructed using the Bayesian tree-building method MrBayes (<xref ref-type="bibr" rid="b21">21</xref>) (JTT; invgamma-4; 1&#x02009;000&#x02009;000 generations). The trees were rooted using an outgroup whenever a suitable ancient paralog could be found. Note that since the analysis attempts at clustering homologs into clans, and not at predicting their hierarchical order, placement of the root is not critical here.</p></sec><sec><title>Validation</title><p>The performances of the algorithm were evaluated using the HAMAP database (<xref ref-type="bibr" rid="b22">22</xref>), a collection of orthologous microbial protein families generated manually by expert curators in the Swiss&#x02013;Prot group. The database was retrieved on November 23, 2005. Proteins from the 99 most represented species also present in our OMA project were used in the analysis: of all 29 245 proteins, there were 21&#x02009;831 proteins (75.6%), grouped in 1189 orthologous families. That yielded 309&#x02009;829 pairwise relations to be verified by our procedure.</p><p>The algorithm classified 279&#x02009;568 (90.2%) relations as orthologous and 9420 (3.0%) as paralogous. The remaining 20&#x02009;841 (6.7%) relations had alignments below our significance threshold and could therefore not be processed. The accuracy of the algorithm, in particular its very low false-positive rate was confirmed by following observations:</p><p>First, paralogy is often reflected by different Swiss&#x02013;Prot ID names (e.g. GREA/GREB) (<xref ref-type="bibr" rid="b23">23</xref>). From the 9420 predicted paralogs, only 2728 (29.0%) of them have identical ID names. Second, the distribution of the paralogs among HAMAP families was investigated: all 9420 cases of paralogy found by the algorithm are concentrated in only 150 (12.6%) of the 1189 HAMAP families. This is consistent with the fact that the inclusion of just one paralogous protein into an orthologous family is likely to result in several paralogous relations inside that family. And indeed, in all except 8 of these 150 families, more than one paralogous pair was detected. Third, these 8 improbable cases were inspected individually using phylogenetic analysis, which confirmed that they are bona fide paralogs (possibly xenologs). Fourth, the predicted cases of paralogy were compared to the gene trees over HAMAP families built by the group of Laurent Duret (<ext-link ext-link-type="uri" xlink:href="http://pbil.univ-lyon1.fr/help/HAMAP.html"/>), in a similar way as HOBACGEN (<xref ref-type="bibr" rid="b24">24</xref>). 7217 predicted cases could be mapped to those trees. In 6418 (88.9%) instances, paralogy was confirmed by the trees, a remarkably high level of consistency considering that the two methods are very different. As for the conflicting 799 cases, which are distributed among 51 families, we believe that most of them are caused by inaccuracies on the gene trees, which are constructed using a variant of Neighbor Joining on observed divergence, a rather crude measure of evolutionary distance.</p></sec></sec><sec><title>RESULTS AND DISCUSSION</title><p>The algorithm was run on the current release of the COGs database (<xref ref-type="bibr" rid="b4">4</xref>) (<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471&#x02013;2105/4/41"/>). We used the precomputed all-against-all results from 107 complete genomes, of which 52 are represented in COGs, whereas the remaining 55 genomes were only used as potential witnesses of non-orthology. [The complete list is available in the Supplementary Data.] From all 4654 COGs, there is a total of 5&#x02009;537&#x02009;713 pairwise relations. Pairs between proteins from the same species (484 043) were not considered further. Additionaly, 2&#x02009;733&#x02009;371 relations involve at least one protein from a species outside our set of 107 genomes. Consequently, the following results were obtained through the verification of 2&#x02009;320&#x02009;199 relations, 45.9% of all potential orthologous relations.</p><p>The results are presented in <xref ref-type="table" rid="tbl1">Table 1</xref>. Surprisingly, 44% of the relations had alignment scores below our significance threshold of 130, which corresponds to an <italic>E</italic>-value of about 2e&#x02212;6, and could therefore not be verified. This implies that an important fraction of relations within COGs cannot be, on the basis of pairwise alignments, reliably considered homologous.</p><p>The other result is the significant proportion of non-orthologous relations found by the algorithm, more than a quarter of the pairs that could be verified. They are distributed among about a third of all COGs. The list of such groups, along with all detected non-orthology cases are available in the Supplementary Data.</p><p>If we require the presence of at least two witnesses of non-orthology for a pair to be considered non-orthologous, the algorithm still finds 251 391 (19.4%) such pairs within 1146 (24.6%) COGs. When removing the sequence with the most non-orthologous relations from each COG group, the total number of non-orthologous pairs decreases by only 24&#x02009;868 (1.9%).</p><p>The majority (70%) of the groups predominantly non-orthologs are involved in metabolic processes, according to the functional description of the COGs database, although they only constitute a minority of all COGs. In contrast, groups involved in information storage and processing (8%) or cellular processing and signaling (11%) include less frequently non-orthologs. The remainder 11% are poorly characterized proteins. This result is in agreement with previous studies, which state that in prokaryotes, metabolic functions are under high evolutionary pressure from changing environments (<xref ref-type="bibr" rid="b25">25</xref>).</p><sec><title>Phylogenetic analysis of selected COG groups</title><p>The presence of non-orthology in some COG groups is hardly a surprise and was in fact recently acknowledged by Koonin, coauthor of COG, in a review article (<xref ref-type="bibr" rid="b2">2</xref>). What is surprising here is rather the extent of non-orthology detected by the algorithm. That prompted us to verify, in addition to the validation work reported in the previous section, a number of our predictions using detailed phylogenetic analysis. In this section, we report the conclusion of such analysis on three COGs, for which we could build Bayesian likelihood trees of high confidence, confirmed by consensus NJ trees with high bootstrap values. Clan assignments were made based on those trees, and considering lineage and function, whenever reliable annotations could be found. We strongly expect that pairs of proteins across clans be non-orthologous, and use these results to evaluate the accuracy of the predictions made by the algorithm.</p><p>COG0508 consists of complex-forming acyltransferases that are composed of an N-terminal biotin or lipoic acid attachment domain, a central protein&#x02013;protein interaction domain, followed by the catalytic 2-oxoacid dehydrogenases acyltransferase domain. The phylogenetic analysis of roughly half of the proteobacterial sequence data from COG0508 suggests the existence of at least four distinct subgroups (see <xref ref-type="fig" rid="fig4">Figure 4</xref>): clan 1 is formed by sequences from gammaproteobacteria, including the dihydrolipoyllysine-residue acetyltransferase component of the pyruvate dehydrogenase complex (EC 2.3.1.12) (AceF) from <italic>Escherichia coli</italic>. Clan 2 consists of proteins highly similar to the <italic>Bacillus subtilis</italic> lipoamide acyltransferase component of the branched-chain alpha-keto acid dehydrogenase complex (EC 2.3.1.168). All sequences in clan 2 are alphaproteobacterial, except for <italic>Pseudomonas aeruginosa</italic> proteins, which are found in both clan 1 and clan 2. As mentioned in section 2, such situation could arise through lateral gene transfer from an alphaproteobacteria to <italic>P.aeruginosa</italic>. If that was the case, there would be strong evidence that clans 1 and 2 should be merged. However, in the present case, it is possible to populate both clans with additional sequences from more distant species (data not shown), legitimating the separation in two clans. Additionally, the long distance between the two clans and the distinct function of at least one family member of each subgroup also supports this conclusion. Clan 3 includes the dihydrolipoyllysine-residue succinyltransferase component of 2-oxoglutarate dehydrogenase complex (EC 2.3.1.61) (SucB) of <italic>E.coli</italic>. Note that clan 3 includes two protein sequences of <italic>Rhizobium meliloti</italic>, but those are clearly ancient duplicates, and thus sequence 3b is likely to form yet a separate clan on its own. Finally clan 4 is formed by a presumably further dehydrogenase component from alphaproteobacteria. The algorithm predicted 382 cases of non-orthologous relations within the sequences considered here. An extract of the result list is given in <xref ref-type="table" rid="tbl2">Table 2</xref> (the full list of paralogy is available in the Supplementary Data). A total of 379 predictions are consistent with the clan assignment, while the remaining three predictions support the exclusion of <italic>R.meliloti</italic> 3b from clan 3. Furthermore, comparison with the clan assignment reveals that the algorithm missed 24 non-orthologous relations, which implies a false-negative rate of 6.0%.</p><p>COG0513 includes various DEAD-box containing RNA helicases. The phylogenetic analysis of the proteobacterial data from this group suggests the existence of six clans (see <xref ref-type="fig" rid="fig5">Figure 5</xref>), of which five are formed around the following proteins from <italic>E.coli</italic>: (i) the ATP-dependent RNA helicase SrmB, which is involved in an early assembly step of 50S ribosomal subunits (<xref ref-type="bibr" rid="b26">26</xref>); (ii) the cold-shock DEAD-box protein A (DeaD), required for cell division and normal cell growth at low temperature (<xref ref-type="bibr" rid="b27">27</xref>); (iii) the DEAD-box RNA helicase B (RhlB), a component of the RNA degradosome, which seems to have little activity unless being activated by the endoribonuclease RNase E (<xref ref-type="bibr" rid="b28">28</xref>); (iv) the putative RNA helicase RhlE, which has been shown to be non-essential for normal cell growth (<xref ref-type="bibr" rid="b29">29</xref>); (v) the ATP-independent RNA 3&#x02032;&#x02192;5&#x02032; helicase DbpA (<xref ref-type="bibr" rid="b30">30</xref>) and (vi) the subgroup includes RNA helicases that are conserved in some alphaproteobacteria. The algorithm predicted 408 cases of non-orthology, 88.9% of the 459 non-orthologous relations that can be deduced from the clan assignment. In this case, there was no false-positive prediction.</p><p>COG1113 consists of members of the amino acid-polyamine-organo-cation (APC) superfamily from bacteria, specifically those integral membrane proteins that are involved in the transport of amino acids in prokaryotes. The phylogenetic analysis of this group suggests the existence of various clans (see <xref ref-type="fig" rid="fig6">Figure 6</xref>), including those formed around the seven proteins found in <italic>E.coli</italic>: (i) phenylalanine-specific permease (PheP), (ii) aromatic amino acid transport protein (AroP), (iii) probable transport protein YifK, (iv) proline-specific permease (ProY), (v) <sc>d</sc>-serine/<sc>d</sc>-alanine/glycine transporter (CycA), (vi) <sc>l</sc>-asparagine permease (AnsP), (vii) GABA (4-aminobutyrate) permease (GabP). The seven clans were predicted with high probability and their clusterings confirmed by significant bootstrap values (99&#x02013;100%) except for one (92%). The analyzed dataset includes members of quite related organisms, but most clans can already be populated with further members from other species of COG1113. The algorithm predicted 257 pairs of non-orthologs, of which 254 are consistent with the phylogenetic analysis. That represents 97.7% of the 260 non-orthologous relations that can be deduced from the clan assignment. The conflicting three predictions suggest that <italic>P.aeruginosa</italic> 4a is non-orthologous to <italic>E.coli K12</italic> ProY and to <italic>E.coli H7 EDL933</italic> 4, and that <italic>P.aeruginosa</italic> 4b is non-orthologous to <italic>Yersinia pestis</italic> 4b. But here too, the extension of the phylogenetic analysis using additional sequences from the UniProtKB database supports the division of clan 4 into further subgroups (data not shown).</p></sec></sec><sec><title>CONCLUSION</title><p>We present here a new algorithm for the detection of non-orthologous relations caused by the limitations of genome-specific best hit methods, such as the COGs database. The algorithm, rather than building gene trees, a process both computationally expensive and error-prone, works with pairwise distance estimates. The accuracy of the algorithm was evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with prediction from other projects using independent methods. Using conservative parameters, the algorithm detected non-orthology in a third of the COG groups. Methods sensitive to correct orthology assignments, such as function prediction, phylogenetic trees or genome rearrangement analysis, will profit from both the algorithm and the results presented here.</p></sec><sec><title>SUPPLEMENTARY DATA</title><p>Supplementary Data are available at NAR Online.</p></sec></body><back><ack><p>The authors thank G. Cannarozzi, D. Margadant, A. Schneider and two anonymous reviewers for their comments and suggestions on the manuscript.</p><p><italic>Conflict of interest statement.</italic> None declared.</p></ack><ref-list><title>REFERENCES</title><ref id="b1"><label>1</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fitch</surname><given-names>W.M.</given-names></name></person-group><article-title>Distinguishing homologous from analogous proteins</article-title><source>Syst Zool.</source><year>1970</year><volume>19</volume><fpage>99</fpage><lpage>113</lpage><pub-id pub-id-type="pmid">5449325</pub-id></citation></ref><ref id="b2"><label>2</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koonin</surname><given-names>E.V.</given-names></name></person-group><article-title>Orthologs, paralogs, and evolutionary genomics</article-title><source>Annu. Rev. Genet.</source><year>2005</year><volume>39</volume><fpage>309</fpage><lpage>338</lpage><pub-id pub-id-type="pmid">16285863</pub-id></citation></ref><ref id="b3"><label>3</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tatusov</surname><given-names>R.L.</given-names></name><name><surname>Koonin</surname><given-names>E.V.</given-names></name><name><surname>Lipman</surname><given-names>D.J.</given-names></name></person-group><article-title>A genomic perspective on protein families</article-title><source>Science</source><year>1997</year><volume>278</volume><fpage>631</fpage><lpage>637</lpage><pub-id pub-id-type="pmid">9381173</pub-id></citation></ref><ref id="b4"><label>4</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tatusov</surname><given-names>R.L.</given-names></name><name><surname>Fedorova</surname><given-names>N.D.</given-names></name><name><surname>Jackson</surname><given-names>J.D.</given-names></name><name><surname>Jacobs</surname><given-names>A.R.</given-names></name><name><surname>Kiryutin</surname><given-names>B.</given-names></name><name><surname>Koonin</surname><given-names>E.V.</given-names></name><name><surname>Krylov</surname><given-names>D.M.</given-names></name><name><surname>Mazumder</surname><given-names>R.</given-names></name><name><surname>Mekhedov</surname><given-names>S.L.</given-names></name><name><surname>Nikolskaya</surname><given-names>A.N.</given-names></name><etal/></person-group><article-title>The cog database: an updated version includes eukaryotes</article-title><source>BMC Bioinformatics</source><year>2003</year><volume>4</volume><fpage>41</fpage><pub-id pub-id-type="pmid">12969510</pub-id></citation></ref><ref id="b5"><label>5</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fujibuchi</surname><given-names>W.</given-names></name><name><surname>Ogata</surname><given-names>H.</given-names></name><name><surname>Matsuda</surname><given-names>H.</given-names></name><name><surname>Kanehisa</surname><given-names>M.</given-names></name></person-group><article-title>Automatic detection of conserved gene clusters in multiple genomes by graph comparison and p-quasi grouping</article-title><source>Nucleic Acids Res.</source><year>2000</year><volume>28</volume><fpage>4029</fpage><lpage>4036</lpage><pub-id pub-id-type="pmid">11024184</pub-id></citation></ref><ref id="b6"><label>6</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Remm</surname><given-names>M.</given-names></name><name><surname>Storm</surname><given-names>C.</given-names></name><name><surname>Sonnhammer</surname><given-names>E.</given-names></name></person-group><article-title>Automatic clustering of orthologs and in-paralogs from pairwise species comparisons</article-title><source>J. Mol. Biol.</source><year>2001</year><volume>314</volume><fpage>1041</fpage><lpage>1052</lpage><pub-id pub-id-type="pmid">11743721</pub-id></citation></ref><ref id="b7"><label>7</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname><given-names>Y.</given-names></name><name><surname>Sultana</surname><given-names>R.</given-names></name><name><surname>Pertea</surname><given-names>G.</given-names></name><name><surname>Cho</surname><given-names>J.</given-names></name><name><surname>Karamycheva</surname><given-names>S.</given-names></name><name><surname>Tsai</surname><given-names>J.</given-names></name><name><surname>Parvizi</surname><given-names>B.</given-names></name><name><surname>Cheung</surname><given-names>F.</given-names></name><name><surname>Antonescu</surname><given-names>V.</given-names></name><name><surname>White</surname><given-names>J.</given-names></name><etal/></person-group><article-title>Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA)</article-title><source>Genome Res.</source><year>2002</year><volume>12</volume><fpage>493</fpage><lpage>502</lpage><pub-id pub-id-type="pmid">11875039</pub-id></citation></ref><ref id="b8"><label>8</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goodman</surname><given-names>M.</given-names></name><name><surname>Czelusniak</surname><given-names>J.</given-names></name><name><surname>Moore</surname><given-names>G.W.</given-names></name><name><surname>Romero-Herrara</surname><given-names>A.E.</given-names></name></person-group><article-title>Fitting the gene lineage into its species lineage: a parsimony strategy illustrated by cladograms constructed from globin sequences</article-title><source>Syst. Zool.</source><year>1979</year><volume>28</volume><fpage>132</fpage><lpage>168</lpage></citation></ref><ref id="b9"><label>9</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Dessimoz</surname><given-names>C.</given-names></name><name><surname>Cannarozzi</surname><given-names>G.</given-names></name><name><surname>Gil</surname><given-names>M.</given-names></name><name><surname>Margadant</surname><given-names>D.</given-names></name><name><surname>Roth</surname><given-names>A.</given-names></name><name><surname>Schneider</surname><given-names>A.</given-names></name><name><surname>Gonnet</surname><given-names>G.H.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>McLysath</surname><given-names>A.</given-names></name><name><surname>Huson</surname><given-names>D.H.</given-names></name></person-group><article-title>OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements</article-title><source>Lecture Notes in Computer Science</source><year>2005</year><volume>Vol. 3678</volume><publisher-name>Springer-Verlag</publisher-name><fpage>61</fpage><lpage>72</lpage></citation></ref><ref id="b10"><label>10</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Doolittle</surname><given-names>R.F.</given-names></name></person-group><article-title>Convergent evolution: the need to be explicit</article-title><source>Trends Biochem. Sci.</source><year>1994</year><volume>19</volume><fpage>15</fpage><lpage>18</lpage><pub-id pub-id-type="pmid">8140615</pub-id></citation></ref><ref id="b11"><label>11</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lawrence</surname><given-names>J.G.</given-names></name><name><surname>Hendrickson</surname><given-names>H.</given-names></name></person-group><article-title>Lateral gene transfer: when will adolescence end?</article-title><source>Mol. Microbiol.</source><year>2003</year><volume>50</volume><fpage>739</fpage><lpage>749</lpage><pub-id pub-id-type="pmid">14617137</pub-id></citation></ref><ref id="b12"><label>12</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smith</surname><given-names>T.F.</given-names></name><name><surname>Waterman</surname><given-names>M.S.</given-names></name></person-group><article-title>Identification of common molecular subsequences</article-title><source>J. Mol. Biol.</source><year>1981</year><volume>147</volume><fpage>195</fpage><lpage>197</lpage><pub-id pub-id-type="pmid">7265238</pub-id></citation></ref><ref id="b13"><label>13</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Gonnet</surname><given-names>G.H.</given-names></name></person-group><article-title>A <italic>Tutorial Introduction to Computational Biochemistry Using Darwin. Technical Report Informatik</italic></article-title><year>1994</year><publisher-loc>Switzerland</publisher-loc><publisher-name>ETH Zurich</publisher-name></citation></ref><ref id="b14"><label>14</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Muller</surname><given-names>T.</given-names></name><name><surname>Vingron</surname><given-names>M.</given-names></name></person-group><article-title>Modeling amino acid replacement</article-title><source>J. Comput. Biol.</source><year>2000</year><volume>7</volume><fpage>761</fpage><lpage>776</lpage><pub-id pub-id-type="pmid">11382360</pub-id></citation></ref><ref id="b15"><label>15</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gonnet</surname><given-names>G.H.</given-names></name><name><surname>Hallett</surname><given-names>M.T.</given-names></name><name><surname>Korostensky</surname><given-names>C.</given-names></name><name><surname>Bernardin</surname><given-names>L.</given-names></name></person-group><article-title>Darwin v. 2.0: an interpreted computer language for the biosciences</article-title><source>Bioinformatics</source><year>2000</year><volume>16</volume><fpage>101</fpage><lpage>103</lpage><pub-id pub-id-type="pmid">10842729</pub-id></citation></ref><ref id="b16"><label>16</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname><given-names>R.C.</given-names></name></person-group><article-title>Muscle: a multiple sequence alignment method with reduced time and space complexity</article-title><source>BMC Bioinformatics</source><year>2004</year><volume>5</volume><fpage>113</fpage><pub-id pub-id-type="pmid">15318951</pub-id></citation></ref><ref id="b17"><label>17</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chenna</surname><given-names>R.</given-names></name><name><surname>Sugawara</surname><given-names>H.</given-names></name><name><surname>Koike</surname><given-names>T.</given-names></name><name><surname>Lopez</surname><given-names>R.</given-names></name><name><surname>Gibson</surname><given-names>T.J.</given-names></name><name><surname>Higgins</surname><given-names>D.G.</given-names></name><name><surname>Thompson</surname><given-names>J.D.</given-names></name></person-group><article-title>Multiple sequence alignment with the clustal series of programs</article-title><source>Nucleic Acids Res.</source><year>2003</year><volume>31</volume><fpage>3497</fpage><lpage>3500</lpage><pub-id pub-id-type="pmid">12824352</pub-id></citation></ref><ref id="b18"><label>18</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Felsenstein</surname><given-names>J.</given-names></name></person-group><year>1993</year><comment>Phylip (phylogeny inference package) version 3.5c. distributed by the author</comment></citation></ref><ref id="b19"><label>19</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Poirot</surname><given-names>O.</given-names></name><name><surname>O'Toole</surname><given-names>E.</given-names></name><name><surname>Notredame</surname><given-names>C.</given-names></name></person-group><article-title>Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments</article-title><source>Nucleic Acids Res.</source><year>2003</year><volume>31</volume><fpage>3503</fpage><lpage>3506</lpage><pub-id pub-id-type="pmid">12824354</pub-id></citation></ref><ref id="b20"><label>20</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gascuel</surname><given-names>O.</given-names></name></person-group><article-title>Bionj: an improved version of the nj algorithm based on a simple model of sequence data</article-title><source>Mol. Biol. Evol.</source><year>1997</year><volume>14</volume><fpage>685</fpage><lpage>695</lpage><pub-id pub-id-type="pmid">9254330</pub-id></citation></ref><ref id="b21"><label>21</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ronquist</surname><given-names>F.</given-names></name><name><surname>Huelsenbeck</surname><given-names>J.P.</given-names></name></person-group><article-title>Mrbayes 3: Bayesian phylogenetic inference under mixed models</article-title><source>Bioinformatics</source><year>2003</year><volume>19</volume><fpage>1572</fpage><lpage>1574</lpage><pub-id pub-id-type="pmid">12912839</pub-id></citation></ref><ref id="b22"><label>22</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gattiker</surname><given-names>A.</given-names></name><name><surname>Michoud</surname><given-names>K.</given-names></name><name><surname>Rivoire</surname><given-names>C.</given-names></name><name><surname>Auchincloss</surname><given-names>A.H.</given-names></name><name><surname>Coudert</surname><given-names>E.</given-names></name><name><surname>Lima</surname><given-names>T.</given-names></name><name><surname>Kersey</surname><given-names>P.</given-names></name><name><surname>Pagni</surname><given-names>M.</given-names></name><name><surname>Sigrist</surname><given-names>C.J.A.</given-names></name><name><surname>Lachaize</surname><given-names>C.</given-names></name><etal/></person-group><article-title>Automated annotation of microbial proteomes in swiss-prot</article-title><source>Comput. Biol. Chem.</source><year>2003</year><volume>27</volume><fpage>49</fpage><lpage>58</lpage><pub-id pub-id-type="pmid">12798039</pub-id></citation></ref><ref id="b23"><label>23</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boeckmann</surname><given-names>B.</given-names></name><name><surname>Bairoch</surname><given-names>A.</given-names></name><name><surname>Apweiler</surname><given-names>R.</given-names></name><name><surname>Blatter</surname><given-names>M.-C.</given-names></name><name><surname>Estreicher</surname><given-names>A.</given-names></name><name><surname>Gasteiger</surname><given-names>E.</given-names></name><name><surname>Martin</surname><given-names>M.J.</given-names></name><name><surname>Michoud</surname><given-names>K.</given-names></name><name><surname>O'Donovan</surname><given-names>C.</given-names></name><name><surname>Phan</surname><given-names>I.</given-names></name><etal/></person-group><article-title>The swiss-prot protein knowledgebase and its supplement trembl in 2003</article-title><source>Nucleic Acids Res.</source><year>2003</year><volume>31</volume><fpage>365</fpage><lpage>370</lpage><pub-id pub-id-type="pmid">12520024</pub-id></citation></ref><ref id="b24"><label>24</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perriere</surname><given-names>G.</given-names></name><name><surname>Duret</surname><given-names>L.</given-names></name><name><surname>Gouy</surname><given-names>M.</given-names></name></person-group><article-title>Hobacgen: database system for comparative genomics in bacteria</article-title><source>Genome Res.</source><year>2000</year><volume>10</volume><fpage>379</fpage><lpage>385</lpage><pub-id pub-id-type="pmid">10720578</pub-id></citation></ref><ref id="b25"><label>25</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pal</surname><given-names>C.</given-names></name><name><surname>Papp</surname><given-names>B.</given-names></name><name><surname>Lercher</surname><given-names>M.J.</given-names></name></person-group><article-title>Adaptive evolution of bacterial metabolic networks by horizontal gene transfer</article-title><source>Nature Genet.</source><year>2005</year><volume>37</volume><fpage>1372</fpage><lpage>1375</lpage><pub-id pub-id-type="pmid">16311593</pub-id></citation></ref><ref id="b26"><label>26</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Charollais</surname><given-names>J.</given-names></name><name><surname>Pflieger</surname><given-names>D.</given-names></name><name><surname>Vinh</surname><given-names>J.</given-names></name><name><surname>Dreyfus</surname><given-names>M.</given-names></name><name><surname>Iost</surname><given-names>I.</given-names></name></person-group><article-title>The DEAD-box RNA helicase srmb is involved in the assembly of 50s ribosomal subunits in <italic>Escherichia coli</italic></article-title><source>Mol. Microbiol.</source><year>2003</year><volume>48</volume><fpage>1253</fpage><lpage>1265</lpage><pub-id pub-id-type="pmid">12787353</pub-id></citation></ref><ref id="b27"><label>27</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname><given-names>P.G.</given-names></name><name><surname>Mitta</surname><given-names>M.</given-names></name><name><surname>Kim</surname><given-names>Y.</given-names></name><name><surname>Jiang</surname><given-names>W.</given-names></name><name><surname>Inouye</surname><given-names>M.</given-names></name></person-group><article-title>Cold shock induces a major ribosomal-associated protein that unwinds double-stranded RNA in <italic>Escherichia coli</italic></article-title><source>Proc. Natl Acad. Sci. USA</source><year>1996</year><volume>93</volume><fpage>76</fpage><lpage>80</lpage><pub-id pub-id-type="pmid">8552679</pub-id></citation></ref><ref id="b28"><label>28</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carpousis</surname><given-names>A.J.</given-names></name></person-group><article-title>The <italic>Escherichia coli</italic> RNA degradosome: structure, function and relationship in other ribonucleolytic multienzyme complexes</article-title><source>Biochem. Soc. Trans.</source><year>2002</year><volume>30</volume><fpage>150</fpage><lpage>155</lpage><pub-id pub-id-type="pmid">12035760</pub-id></citation></ref><ref id="b29"><label>29</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ohmori</surname><given-names>H.</given-names></name></person-group><article-title>Structural analysis of the rhle gene of <italic>Escherichia coli</italic></article-title><source>Jpn. J. Genet.</source><year>1994</year><volume>69</volume><fpage>1</fpage><lpage>12</lpage><pub-id pub-id-type="pmid">8037924</pub-id></citation></ref><ref id="b30"><label>30</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Diges</surname><given-names>C.M.</given-names></name><name><surname>Uhlenbeck</surname><given-names>O.C.</given-names></name></person-group><article-title><italic>Escherichia coli</italic> dbpa is a 3&#x02032;&#x02192;5&#x02032; RNA helicase</article-title><source>Biochemistry</source><year>2005</year><volume>44</volume><fpage>7903</fpage><lpage>7911</lpage><pub-id pub-id-type="pmid">15910005</pub-id></citation></ref></ref-list><sec sec-type="display-objects"><title>Figures and Tables</title><fig id="fig1" position="float"><label>Figure 1</label><caption><p>A simple evolutionary scenario under which the COG algorithm groups paralogous sequences.</p></caption><graphic xlink:href="gkl433f1"/></fig><fig id="fig2" position="float"><label>Figure 2</label><caption><p>Suitable case of a witness. A duplication occurred before all speciations and <italic>Z</italic> is a witness of the non-orthology between the sequences <italic>x</italic><sub>1</sub> and <italic>y</italic><sub>2</sub>.</p></caption><graphic xlink:href="gkl433f2"/></fig><fig id="fig3" position="float"><label>Figure 3</label><caption><p>Unsuitable cases of witnesses. To the left, duplication occurred only in <italic>Z</italic>, and therefore <italic>z</italic><sub>3</sub> and <italic>z</italic><sub>4</sub> are in-paralogs with respect to (<italic>X</italic>, <italic>Y</italic>) and cannot act as witness of non-orthology. To the right, <italic>X</italic> speciated before the duplication event. Hence, <italic>x</italic><sub>1</sub> is orthologous to all three other proteins and cannot act as witness of non-orthology.</p></caption><graphic xlink:href="gkl433f3"/></fig><fig id="fig4" position="float"><label>Figure 4</label><caption><p>Unrooted phylogenetic consensus tree constructed from a Bayesian analysis of a subgroup from COG0508. Posterior probabilities are indicated to the right of the nodes and clan-supporting bootstrap values are indicated below the probability value. Predicted clans are indicated by the vertical bars on the right side. The leaf labels correspond to the following COG identifiers: <italic>Agrobacterium tumefaciens</italic> (2: AGl2719, 3: AGc4775, 4: AGc2641), <italic>Brucella melitensis</italic> (2: BMEII0746, 3: BMEI0141, 4: BMEI0856), <italic>Buchnera sp.</italic> (1: BU206, 3: BU303), <italic>E.coli</italic> K12 (COG identifier corresponds to the gene name: aceF, sucB), <italic>E.coli</italic> H7 (1: ECs0119, 3: ECs0752), <italic>Haemophilus influenzae</italic> (1: HI1232, 3: HI1661), <italic>Neisseria meningitidis</italic> (1: NMB1342, 3: NMB0956), <italic>Pasteurella multocida</italic> (1: PM0894, 3: PM0278), <italic>Pseudomonas aeruginosa</italic> (1: PA5016, 2: PA2249, 3: PA1586), <italic>Rhizobium loti</italic> (2: mll4471, 3: mll4300, 4a: mlr0385, 4b: mll3627), <italic>Rhizobium meliloti</italic> (2: SMc03203, 3a: SMc02483, 3b: SMb20019, 4: SMc01032), <italic>Rickettsia conorii</italic> (3: RC0226, 4: RC0764), <italic>Rickettsia prowazekii</italic> (3: RP179, 4: RP530), <italic>Vibrio cholerae</italic> (1: VC2413, 3: VC2086), <italic>Y.pestis</italic> (1: YPO3418, 3: YPO1114).</p></caption><graphic xlink:href="gkl433f4"/></fig><fig id="fig5" position="float"><label>Figure 5</label><caption><p>Unrooted phylogenetic consensus tree for COG0513, constructed from a Bayesian analysis. Posterior probabilities are drawn to the right of the nodes and clan-supporting bootstrap values are below the relevant nodes. The vertical bars bars to the right indicate the prediced clans. The leaf labels correspond to the COG identifiers: <italic>A.tumefaciens</italic> (2: AGl1362, 5: AGc4238, 6: AGc3366), <italic>B.melitensis</italic> (2: BMEI1824, 5: BMEI0934, 6: BMEI1035), <italic>E.coli</italic> K12 (COG identifier corresponds to the gene name: dbpA, deaD, rhlB, rhlE, srmB), <italic>H.influenzae</italic> (1: HI0422, 3: HI0231, 4: HI0892), <italic>P.multocida</italic> (1: PM1840, 3: PM1112, 4: PM1921), <italic>P.aeruginosa</italic> (2: PA0455, 3: PA2840, 4: PA3861, 5: PA0428), <italic>R.loti</italic> (2: mlr4393, 5: mlr0349, 6: mll0224), <italic>R.meliloti</italic> (2: SMc01090, 5: SMb20880, 6: SMc00522), <italic>V.cholerae</italic> (1: VC0660, 2: VC2564, 4: VC0305, 5: VCA0204), <italic>Y.pestis</italic> (1: YPO2708, 2: YPO1776, 3: YPO3488, 4: YPO3869).</p></caption><graphic xlink:href="gkl433f5"/></fig><fig id="fig6" position="float"><label>Figure 6</label><caption><p>Phylogenetic consensus tree rooted by outgroups for COG1113, constructed from a Bayesian analysis of a data subgroup from COG1113. Posterior probabilities of the Bayesian analysis are drawn to the right of the nodes and clan-supporting bootstrap values below relevant nodes. Predicted clans are indicated by vertical bars to the right. The leaf labels correspond to the COG identifiers: <italic>A.tumefaciens</italic> C58 (6: AGl2082), <italic>Bacillus halodurans</italic> (out: BH2171), <italic>B.melitensis</italic> (5: BMEII0038), <italic>E.coli</italic> K12 (COG identifier corresponds to the gene name: ansP, aroP, cycA, gabP, pheP, proY, yifK), <italic>E.coli</italic> H7 EDL933 (1: ZpheP, 2: ZaroP, 3: ZyifK, 4: ZproY, 5: ZcycA, 6: ZansP, 7: ZgabP), <italic>E.coli</italic> H7 (1: ECs0614, 2: ECs0116, 3: ECs4729, 4: ECs0452, 5: ECs5186, 6: ECs2057, 7: ECs3524), <italic>P.aeruginosa</italic> (2a: PA3000, 2b: PA0866, 4a: PA5097, 4b: PA0789, 7: PA0129, out: PA2079), <italic>Salmonella typhimurium</italic> LT2 (1: STM0568, 2: STM0150, 3: STM3930, 4: STM0400, 5: STM4398, 6: STM1584, 7: STM2793), <italic>Y.pestis</italic> (2a: YPO3421, 2b: YPO1743, 3: YPO3854, 4a: YPO3201, 4b: YPO4015, 5: YPO1859, 6: YPO1937).</p></caption><graphic xlink:href="gkl433f6"/></fig><table-wrap id="tbl1" position="float"><label>Table 1</label><caption><p>Results of the algorithm on the COGs database</p></caption><table frame="hsides" rules="groups"><thead><tr><th rowspan="1" colspan="1"/><th align="center" rowspan="1" colspan="1">#</th><th align="left" rowspan="1" colspan="1">%</th></tr></thead><tbody><tr><td align="left" rowspan="1" colspan="1">Pairs with score below threshold, not tested</td><td align="right" rowspan="1" colspan="1">1&#x02009;021&#x02009;764</td><td align="left" rowspan="1" colspan="1">44.0</td></tr><tr><td align="left" rowspan="1" colspan="1">Pairs with score above threshold</td><td align="right" rowspan="1" colspan="1">1&#x02009;298&#x02009;435</td><td align="left" rowspan="1" colspan="1">66.0</td></tr><tr><td align="left" rowspan="1" colspan="1">Non-orthologous pairs</td><td align="right" rowspan="1" colspan="1">360&#x02009;856</td><td align="left" rowspan="1" colspan="1">27.8</td></tr><tr><td align="left" rowspan="1" colspan="1">Orthologous pairs</td><td align="right" rowspan="1" colspan="1">937&#x02009;579</td><td align="left" rowspan="1" colspan="1">72.2</td></tr><tr><td align="left" rowspan="1" colspan="1">COG groups with non-orthology</td><td align="right" rowspan="1" colspan="1">1604</td><td align="left" rowspan="1" colspan="1">34.5</td></tr><tr><td align="left" rowspan="1" colspan="1">COG groups without non-orthology</td><td align="right" rowspan="1" colspan="1">3050</td><td align="left" rowspan="1" colspan="1">65.5</td></tr></tbody></table></table-wrap><table-wrap id="tbl2" position="float"><label>Table 2</label><caption><p>Predicted non-orthologous relations for the data shown in <xref ref-type="fig" rid="fig4">Figure 4</xref></p></caption><table frame="hsides" rules="groups"><thead><tr><th rowspan="1" colspan="1"/><th align="left" rowspan="1" colspan="1">Predicted non-orthologs</th><th align="left" rowspan="1" colspan="1">Pair of witnesses</th></tr></thead><tbody><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli H7 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli K12 acef</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>H.influencae 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>Neisseria meniningitidis 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 4a</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.meliloti 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>V.cholerae 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>Y.pestis 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 3 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 3 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B. melitensis 3 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 3 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 3 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.conorii 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 4a</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.meliloti 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.meliloti 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.prowazekii 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 3 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 4a + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>Buchnera sp. 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli H7 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli K12 sucB</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>H.influencae 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 4a + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>H.influencae 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>N.meniningitidis 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>N.meniningitidis 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.conorii 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R. oti 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.meliloti 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 2</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.meliloti 3a</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.prowazekii 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>V.cholerae 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>Y.pestis 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 4 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli H7 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli H7 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli K12 acef</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>E.coli K12 sucB</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>H.influencae 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>H.influencae 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>N.meniningitidis 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>N.meniningitidis 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 1</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.aeruginosa 2 + 1</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>P.multocida 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.conorii 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.conorii 4</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 4</italic></td></tr><tr><td align="left" rowspan="1" colspan="1"><italic>B.melitensis 2</italic></td><td align="left" rowspan="1" colspan="1"><italic>R.loti 3</italic></td><td align="left" rowspan="1" colspan="1"><italic>A.tumefaciens 2 + 3</italic></td></tr></tbody></table><table-wrap-foot><fn><p>The sequences in the first two columns are predicted to be non-orthologous by the pair of witnesses in the third column.</p></fn></table-wrap-foot></table-wrap></sec></back></article>