From e9ddcd4cc7453fc67f79569aa8af8130b21c208e Mon Sep 17 00:00:00 2001 From: planemo-autoupdate <71008622+planemo-autoupdate@users.noreply.github.com> Date: Tue, 19 Apr 2022 11:09:44 +0200 Subject: [PATCH] Update pangolin to version 4 (#4494) --- tools/pangolin/pangolin.xml | 292 ++++++++++++++++++++++++++++++------ 1 file changed, 246 insertions(+), 46 deletions(-) diff --git a/tools/pangolin/pangolin.xml b/tools/pangolin/pangolin.xml index d378fb1bd79..c329d5e5e85 100644 --- a/tools/pangolin/pangolin.xml +++ b/tools/pangolin/pangolin.xml @@ -1,55 +1,65 @@ Phylogenetic Assignment of Outbreak Lineages - 3.1.20 + 4.0.5 pangolin scorpio csvtk + '$output1' - #if $alignment - && mv sequences.aln.fasta '$align1' - #end if ]]> - - - - - - + + + + + + + + + + + - + - - + + @@ -63,18 +73,31 @@ - + + - - - + + + + + + + - + + + + + + + + + @@ -84,42 +107,55 @@ + - + - + - - - + - + + - + @@ -133,41 +169,205 @@ + + + + + + + + + + + + + + + + + + + + - + + + - + + + `_ +(Phylogenetic Assignment of Named Global Outbreak LINeages) is used to assign a +SARS-CoV-2 genome sequence the most likely lineage based on the PANGO +nomenclature system. -`Pangolin `_ (Phylogenetic Assignment of Named Global Outbreak LINeages) -is used to assign a SARS-CoV-2 genome sequence the most likely lineage based on the PANGO nomenclature system. -Pangolin uses the `pangoLEARN `_ stored model for lineage assignment. This -model is updated more frequently than the pangolin tool is. In general one should use the most recent model for lineage -assignment, and the default option for this tool is to download the latest version of the model before the pangolin -tool runs. A pangoLEARN data manager exists so that the Galaxy admin can download specific versions of the pangoLEARN -model as required. Finally the pangolin tool can use its default built-in model, but this is **not recommended** as the -default model rapidly becomes out of date. +**Data sources/versioning** + +Pangolin uses the +`pangolin-data `_ repository as +a source of its required model, protobuf, designation hash and alias files, and +the `constellations `_ +repository for `scorpio `_ -based +assignment of lineages of concern. +The tool ships with a copy of this data, but the data gets updated more +frequently than the tool! In general one should use the most recent model for +lineage assignment, and the default option for this tool is to download the +latest versions of pangolin-data and constellations before embarking on +analysis. +A pangoLEARN data manager exists so that the Galaxy admin can download specific +versions of the pangolin-data/constellations as required. Finally the pangolin +tool can use its default built-in data packages, but this is +**not recommended** as it will almost certainly be out of date. + +.. class:: infomark + + The exact combination of pangolin, inference engine (UShER/pangoLEARN), + scorpio, and data packages used for a particular run of the tool can be + extracted from the four "version" columns in the output (see below for + details). .. class:: warningmark - The "Download latest from web" updates the pangolin database but not the pangolin (and scorpio) software. If - the database format changes this can cause the pangolin job and the tool to fail. The solution to this kind of - failure is to update the pangolin tool installed in the Galaxy server. + The "Download latest from web" updates the *pangolin-data* and + *constellations* packages but not the software (pangolin and scorpio) using + these data packages. + If the data package format changes upstream, this can cause the tool run to + fail. Cached data packages (or, in the worst case, the built-in data) can + serve as a fallback until switching to an updated pangolin tool + version. + + +**Output** + +The main output of the tool is a tabular file with one line per input sequence +and with columns providing the +`following information `_: + +taxon: + The name of the input sequence + +lineage: + The most likely lineage assigned to a given sequence based on the inference + engine used and the SARS-CoV-2 diversity designated. + This assignment is sensitive to missing data at key sites. + +conflict: + In the pangoLEARN model, a given sequence gets assigned to the most likely + category based on known diversity. + If a sequence can fit into more than one category, the conflict score will + be greater than 0 and reflect the number of categories the sequence could + fit into. + If the conflict score is 0, this means that within the current decision + tree there is only one category that the sequence could be assigned to. + +ambiguity_score: + This score is a function of the quantity of missing data in a sequence. + It represents the proportion of relevant sites in a sequnece which were + imputed to the reference values. + A score of 1 indicates that no sites were imputed, while a score of 0 + indicates that more sites were imputed than were not imputed. + This score only includes sites which are used by the decision tree to + classify a sequence. + +scorpio_call: + If a query is assigned a constellation by scorpio this call is output in + this column. + The full set of constellations searched by default can be found at the + constellations repository. + +scorpio_support: + The support score is the proportion of defining variants which have the + alternative allele in the sequence. + +scorpio_conflict: + The conflict score is the proportion of defining variants which have the + reference allele in the sequence. Ambiguous/other non-ref/alt bases at each + of the variant positions contribute only to the denominators of these + scores. + +scorpio_notes: + A notes column specific to the scorpio output. + +version: + A version number that represents both the inference method and the + pangolin-data version number, which as of pangolin 4.0 corresponds to the + pango-designation version used to prepare the inference files. For example: + + PANGO-1.2 indicates an identical sequence has been previously designated + this lineage, and has so gone through manual curation. + The number 1.2 indicates the version of pango-designation that this + assignment is based on. These hashes and pango-designation version are + bundled with the pangoLEARN and UShER models. + + PLEARN-1.2 indicates that this sequence is different from any previously + designated and that the pangoLEARN model was used as an inference engine to + predict the most likely lineage based on the given version of + pango-designation upon which the pangoLEARN model was trained. + + PUSHER-1.2 indicates that this sequence is different from any previously + designated and that UShER was used as an inference engine with fast tree + placement and parsimony-based lineage assignment, based on a guide tree + (protobuf) file built from the data in a given pango-designation release + version. + +pangolin_version: + The version of pangolin software running. + +scorpio_version: + The version of the scorpio software installed. + +constellation_version: + The version of constellations that scorpio has used to curate the lineage + assignment. + +is_designated: + A boolean (True/False) column indicating whether that particular sequence + has been offically designated a lineage. + +qc_status: + Indicates whether the sequence passed the QC thresholds for minimum length + and maximum N content. + +qc_notes: + Notes specific to the QC checks run on the sequences. + +note: + If any conflicts from the decision tree, this field will output the + alternative assignments. If the sequence failed QC this field will describe + why. + If the sequence met the SNP thresholds for scorpio to call a constellation, + it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, + reference and ambiguous) alleles for that call. ]]> 10.1093/ve/veab064