You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, I'm using mmseqs taxonomy and mmseqs create_tsv to make draft taxonomy predictions to run taxometer, however, I noticed that the lineage order from mmseqs usually violates vamb's requirement. My MMSeqs database is built from NCBI nt, so it would be a problem for many other people if they're unable to use taxonomy from similar databases directly.
A good strategy is to get the taxonomic ranks from all names output by mmseqs, then find the most specific one and use the lineage of that specific tax id as the taxonomic prediction. I think the mmseqs output is not directly usable because there are parallel taxonomic labels (e.g. environmental samples), which might be unordered when being printed out with the lineage. So replacing it with the lineage of the most specific taxa will solve the problem.
I've attached a script that implements the strategy I mentioned:
importpandasaspdfromete3importNCBITaxancbi=NCBITaxa()
# Load the datadf=pd.read_csv(
"input.taxa.tsv",
sep="\t",
header=None,
names=["contigs", "unknown", "rank", "last_taxa", "predictions"],
)
df["predictions"] =df["predictions"].astype(str)
# validate the datafromtqdmimporttqdmprogress=tqdm(total=len(df))
# a list of ncbi taxonomic ranks# need to encode the ranks so we don't need to sort out the# topology (which is slow like hell)taxa_ranks= [
"superkingdom",
"kingdom",
"subkingdom",
"superphylum",
"phylum",
"subphylum",
"superclass",
"class",
"subclass",
"infraclass",
"cohort",
"subcohort",
"superorder",
"order",
"suborder",
"infraorder",
"parvorder",
"superfamily",
"family",
"subfamily",
"tribe",
"subtribe",
"genus",
"subgenus",
"species group",
"species subgroup",
"species",
"subspecies",
"varietas",
"forma",
]
lineage_cache: dict[int, list[int]] = {}
# despite we have last_taxa here I don't trust it will always output tax id# need to prevent mismatch and name changing due to old nt vs new ete3 taxa dbdefsort_predictions(predictions: str) ->str:
globallineage_cacheifpredictions=="nan"ornotpredictions:
return"1"# rootpredictions=predictions.split(";")
parsed_names: list[str] = []
forpredictioninpredictions:
ifprediction[1] =="_":
prediction=prediction[2:] # no x_ prefixparsed_names.append(prediction)
translated_ids= [v[0] forvinncbi.get_name_translator(parsed_names).values()]
ranks=ncbi.get_rank(translated_ids)
# get the rank that is most specificmost_specific_id=Nonemost_specific_index=-1forid, rankinranks.items():
ifrankintaxa_ranks:
index=taxa_ranks.index(rank)
ifindex>most_specific_index:
most_specific_index=indexmost_specific_id=id# tax id's rank is only "no rank"ifmost_specific_idisNone:
progress.update(1)
return"1"ifmost_specific_idnotinlineage_cache:
lineage=ncbi.get_lineage(most_specific_id)
lineage_cache[most_specific_id] =lineagelineage=lineage_cache.get(most_specific_id)
progress.update(1)
return";".join([str(taxid) fortaxidinlineage])
df["predictions"] =df["predictions"].apply(sort_predictions)
# no empty columns in the predictions columnassertnotdf["predictions"].isnull().values.any()
# retain only the contigs and predictions columnsdf=df[["contigs", "predictions"]]
# Write to output.taxa.tsvdf.to_csv("output.taxa.tsv", sep="\t", index=False)
The text was updated successfully, but these errors were encountered:
Currently, I'm using
mmseqs taxonomy
andmmseqs create_tsv
to make draft taxonomy predictions to runtaxometer
, however, I noticed that the lineage order frommmseqs
usually violatesvamb
's requirement. My MMSeqs database is built from NCBI nt, so it would be a problem for many other people if they're unable to use taxonomy from similar databases directly.A good strategy is to get the taxonomic ranks from all names output by
mmseqs,
then find the most specific one and use the lineage of that specific tax id as the taxonomic prediction. I think themmseqs
output is not directly usable because there are parallel taxonomic labels (e.g.environmental samples
), which might be unordered when being printed out with the lineage. So replacing it with the lineage of the most specific taxa will solve the problem.I've attached a script that implements the strategy I mentioned:
The text was updated successfully, but these errors were encountered: