Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Include a way to wash and sort lineage output from MMSeqs2 #361

Open
Prunoideae opened this issue Sep 14, 2024 · 0 comments
Open

Comments

@Prunoideae
Copy link

Currently, I'm using mmseqs taxonomy and mmseqs create_tsv to make draft taxonomy predictions to run taxometer, however, I noticed that the lineage order from mmseqs usually violates vamb's requirement. My MMSeqs database is built from NCBI nt, so it would be a problem for many other people if they're unable to use taxonomy from similar databases directly.

A good strategy is to get the taxonomic ranks from all names output by mmseqs, then find the most specific one and use the lineage of that specific tax id as the taxonomic prediction. I think the mmseqs output is not directly usable because there are parallel taxonomic labels (e.g. environmental samples), which might be unordered when being printed out with the lineage. So replacing it with the lineage of the most specific taxa will solve the problem.

I've attached a script that implements the strategy I mentioned:

import pandas as pd
from ete3 import NCBITaxa

ncbi = NCBITaxa()

# Load the data
df = pd.read_csv(
    "input.taxa.tsv",
    sep="\t",
    header=None,
    names=["contigs", "unknown", "rank", "last_taxa", "predictions"],
)
df["predictions"] = df["predictions"].astype(str)

# validate the data
from tqdm import tqdm

progress = tqdm(total=len(df))

# a list of ncbi taxonomic ranks
# need to encode the ranks so we don't need to sort out the
# topology (which is slow like hell)
taxa_ranks = [
    "superkingdom",
    "kingdom",
    "subkingdom",
    "superphylum",
    "phylum",
    "subphylum",
    "superclass",
    "class",
    "subclass",
    "infraclass",
    "cohort",
    "subcohort",
    "superorder",
    "order",
    "suborder",
    "infraorder",
    "parvorder",
    "superfamily",
    "family",
    "subfamily",
    "tribe",
    "subtribe",
    "genus",
    "subgenus",
    "species group",
    "species subgroup",
    "species",
    "subspecies",
    "varietas",
    "forma",
]

lineage_cache: dict[int, list[int]] = {}

# despite we have last_taxa here I don't trust it will always output tax id
# need to prevent mismatch and name changing due to old nt vs new ete3 taxa db
def sort_predictions(predictions: str) -> str:
    global lineage_cache
    if predictions == "nan" or not predictions:
        return "1" # root
    predictions = predictions.split(";")
    parsed_names: list[str] = []

    for prediction in predictions:
        if prediction[1] == "_":
            prediction = prediction[2:] # no x_ prefix
        parsed_names.append(prediction)

    translated_ids = [v[0] for v in ncbi.get_name_translator(parsed_names).values()]
    ranks = ncbi.get_rank(translated_ids)
    # get the rank that is most specific
    most_specific_id = None
    most_specific_index = -1
    for id, rank in ranks.items():
        if rank in taxa_ranks:
            index = taxa_ranks.index(rank)
            if index > most_specific_index:
                most_specific_index = index
                most_specific_id = id
    # tax id's rank is only "no rank"
    if most_specific_id is None: 
        progress.update(1)
        return "1"

    if most_specific_id not in lineage_cache:
        lineage = ncbi.get_lineage(most_specific_id)
        lineage_cache[most_specific_id] = lineage
    lineage = lineage_cache.get(most_specific_id)

    progress.update(1)
    return ";".join([str(taxid) for taxid in lineage])


df["predictions"] = df["predictions"].apply(sort_predictions)

# no empty columns in the predictions column
assert not df["predictions"].isnull().values.any()

# retain only the contigs and predictions columns
df = df[["contigs", "predictions"]]

# Write to output.taxa.tsv
df.to_csv("output.taxa.tsv", sep="\t", index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant