[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

Prunoideae · 2024-09-14T12:32:42Z

Currently, I'm using mmseqs taxonomy and mmseqs create_tsv to make draft taxonomy predictions to run taxometer, however, I noticed that the lineage order from mmseqs usually violates vamb's requirement. My MMSeqs database is built from NCBI nt, so it would be a problem for many other people if they're unable to use taxonomy from similar databases directly.

A good strategy is to get the taxonomic ranks from all names output by mmseqs, then find the most specific one and use the lineage of that specific tax id as the taxonomic prediction. I think the mmseqs output is not directly usable because there are parallel taxonomic labels (e.g. environmental samples), which might be unordered when being printed out with the lineage. So replacing it with the lineage of the most specific taxa will solve the problem.

I've attached a script that implements the strategy I mentioned:

import pandas as pd
from ete3 import NCBITaxa

ncbi = NCBITaxa()

# Load the data
df = pd.read_csv(
    "input.taxa.tsv",
    sep="\t",
    header=None,
    names=["contigs", "unknown", "rank", "last_taxa", "predictions"],
)
df["predictions"] = df["predictions"].astype(str)

# validate the data
from tqdm import tqdm

progress = tqdm(total=len(df))

# a list of ncbi taxonomic ranks
# need to encode the ranks so we don't need to sort out the
# topology (which is slow like hell)
taxa_ranks = [
    "superkingdom",
    "kingdom",
    "subkingdom",
    "superphylum",
    "phylum",
    "subphylum",
    "superclass",
    "class",
    "subclass",
    "infraclass",
    "cohort",
    "subcohort",
    "superorder",
    "order",
    "suborder",
    "infraorder",
    "parvorder",
    "superfamily",
    "family",
    "subfamily",
    "tribe",
    "subtribe",
    "genus",
    "subgenus",
    "species group",
    "species subgroup",
    "species",
    "subspecies",
    "varietas",
    "forma",
]

lineage_cache: dict[int, list[int]] = {}

# despite we have last_taxa here I don't trust it will always output tax id
# need to prevent mismatch and name changing due to old nt vs new ete3 taxa db
def sort_predictions(predictions: str) -> str:
    global lineage_cache
    if predictions == "nan" or not predictions:
        return "1" # root
    predictions = predictions.split(";")
    parsed_names: list[str] = []

    for prediction in predictions:
        if prediction[1] == "_":
            prediction = prediction[2:] # no x_ prefix
        parsed_names.append(prediction)

    translated_ids = [v[0] for v in ncbi.get_name_translator(parsed_names).values()]
    ranks = ncbi.get_rank(translated_ids)
    # get the rank that is most specific
    most_specific_id = None
    most_specific_index = -1
    for id, rank in ranks.items():
        if rank in taxa_ranks:
            index = taxa_ranks.index(rank)
            if index > most_specific_index:
                most_specific_index = index
                most_specific_id = id
    # tax id's rank is only "no rank"
    if most_specific_id is None: 
        progress.update(1)
        return "1"

    if most_specific_id not in lineage_cache:
        lineage = ncbi.get_lineage(most_specific_id)
        lineage_cache[most_specific_id] = lineage
    lineage = lineage_cache.get(most_specific_id)

    progress.update(1)
    return ";".join([str(taxid) for taxid in lineage])


df["predictions"] = df["predictions"].apply(sort_predictions)

# no empty columns in the predictions column
assert not df["predictions"].isnull().values.any()

# retain only the contigs and predictions columns
df = df[["contigs", "predictions"]]

# Write to output.taxa.tsv
df.to_csv("output.taxa.tsv", sep="\t", index=False)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

Prunoideae commented Sep 14, 2024

[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

Comments

Prunoideae commented Sep 14, 2024