Merge pull request #77 from pha4ge/update_parsers

Update parsers
pha4ge · Aug 21, 2022 · d40f2f9 · d40f2f9
2 parents 849b98e + ad6f193
commit d40f2f9
Show file tree

Hide file tree

Showing 69 changed files with 1,265 additions and 1,277 deletions.
diff --git a/README.md b/README.md
@@ -5,18 +5,18 @@
 This repo contains the hAMRonization module and CLI parser tools combine the outputs of 
 disparate antimicrobial resistance gene detection tools into a single unified format.
 
-This is an implementation of the [hAMRonization AMR detection specification scheme](docs/hAMRonization_specification_details.csv).
+This is an implementation of the [hAMRonization AMR detection specification scheme](docs/hAMRonization_specification_details.csv) which supports gene presence/absence resistance and mutational resistance (if supported by the underlying tool).
 
 This supports a variety of summary options including an [interactive summary](https://finlaymagui.re/assets/interactive_report_demo.html).
 
 ![hAMRonization overview](https://github.com/pha4ge/hAMRonization/blob/master/docs/overview_figure.png?raw=true)
 
+
 ## Installation
 
 This tool requires python>=3.7 and [pandas](https://pandas.pydata.org/)
 and the latest release can be installed directly from pip, conda, docker, this repository, or from the galaxy toolshed:
 
-```
 pip install hAMRonization
 ```
 [![PyPI version](https://badge.fury.io/py/hAMRonization.svg)](https://badge.fury.io/py/hAMRonization)
@@ -47,6 +47,10 @@ Alternatively, hAMRonization can also be installed and used in [galaxy](https://
 
 ## Usage
 
+**NOTE**: Only the output format used in the "last updated" version of the AMR prediction tool has been tested for accuracy. Older tool versions or updates which lead to a change in output format may not work. 
+In theory, this should only be a problem with major version changes but not all tools follow semantic versioning.
+If you encounter any issues with newer tool versions then please create an issue in this repository.
+
 ```
 >hamronize -h
 usage: hamronize <tool> <options>
@@ -58,13 +62,12 @@ optional arguments:
   -v, --version         show program's version number and exit
 
 Tools with hAMRonizable reports:
-  {abricate,amrfinderplus,ariba,rgi,resfinder,resfinder4,srax,deeparg,kmerresistance,srst2,staramr,csstar,amrplusplus,resfams,groot}
+  {abricate,amrfinderplus,ariba,rgi,resfinder,srax,deeparg,kmerresistance,srst2,staramr,csstar,amrplusplus,resfams,groot}
     abricate            hAMRonize abricate's output report i.e., OUTPUT.tsv
     amrfinderplus       hAMRonize amrfinderplus's output report i.e., OUTPUT.tsv
     ariba               hAMRonize ariba's output report i.e., OUTDIR/OUTPUT.tsv
     rgi                 hAMRonize rgi's output report i.e., OUTPUT.txt or OUTPUT_bwtoutput.gene_mapping_data.txt
-    resfinder           hAMRonize resfinder's output report i.e., data_resfinder.json
-    resfinder4          hAMRonize resfinder4's tabular output report i.e., ResFinder_results_tab.txt
+    resfinder          hAMRonize resfinder's tabular output report (includes pointfinder as of v4) i.e., ResFinder_results_tab.txt
     srax                hAMRonize srax's output report i.e., sraX_detected_ARGs.tsv
     deeparg             hAMRonize deeparg's output report i.e., OUTDIR/OUTPUT.mapping.ARG
     kmerresistance      hAMRonize kmerresistance's output report i.e., OUTPUT.KmerRes
@@ -173,44 +176,32 @@ If you want to write multiple reports to one file, this `.write` method can acce
 
 Parsers needing tested (both automated and just sanity checking output), see [test.sh](parsers/test.sh) for example invocations.
 `
-1. [abricate](hAMRonization/AbricateIO.py)
-2. [ariba](hAMRonization/AribaIO.py)
-3. [NCBI AMRFinderPlus](hAMRonization/AmrFinderPlusIO.py)
-4. [RGI](hAMRonization/RgiIO.py) (includes RGI-BWT)
-5. [resfinder](hAMRonization/ResFinderIO.py)
-6. [resfinder4](hAMRonization/ResFinder4IO.py)
-7. [sraX](hAMRonization/SraxIO.py)
-8. [deepARG](hAMRonization/DeepArgIO.py)
-9. [kmerresistance](hAMRonization/KmerResistanceIO.py) 
-10. [srst2](hAMRonization/Srst2IO.py)
-11. [staramr](hAMRonization/StarAmrIO.py)
-12. [c-sstar](hAMRonization/CSStarIO.py)
-13. [amrplusplus](hAMRonization/AmrPlusPlusIO.py)
-14. [resfams](hAMRonization/ResFamsIO.py)
-15. [groot](hAMRonization/GrootIO.py)
-
-Parsers excluded as needing variant specification to implement:
-1. [mykrobe](test/data/raw_outputs/mykrobe/report.json)
-2. [pointfinder](test/data/raw_outputs/pointfinder/report.tsv)
+1. [abricate](hAMRonization/AbricateIO.py): last updated for v1.0.0
+2. [ariba](hAMRonization/AribaIO.py): last updated for v2.14.6
+3. [NCBI AMRFinderPlus](hAMRonization/AmrFinderPlusIO.py): last updated for v3.10.40
+4. [RGI](hAMRonization/RgiIO.py) (includes RGI-BWT) last updated for v5.2.0
+5. [resfinder](hAMRonization/ResFinderIO.py): last updated for v4.1.0
+6. [pointfinder](hAMRonization/PointFinderIO.py): last updated for v4.1.0
+7. [sraX](hAMRonization/SraxIO.py): last updated for v1.5
+8. [deepARG](hAMRonization/DeepArgIO.py): last updated for v1.0.2
+9. [kmerresistance](hAMRonization/KmerResistanceIO.py): late updated for v2.2.0
+10. [srst2](hAMRonization/Srst2IO.py): last updated for v0.2.0
+11. [staramr](hAMRonization/StarAmrIO.py): last updated for v0.8.0
+12. [c-sstar](hAMRonization/CSStarIO.py): last updated for v2.1.0
+13. [amrplusplus](hAMRonization/AmrPlusPlusIO.py): last updated for c6b097a
+14. [resfams](hAMRonization/ResFamsIO.py): last updated for hmmer v3.3.2
+15. [groot](hAMRonization/GrootIO.py): last updated for v1.1.2
+16. [mykrobe](test/data/raw_outputs/mykrobe/report.json): last updated for v0.8.1
+17. [tbprofilder](test/data/raw_outputs/tbprofiler/tbprofiler.json): last updated for v3.0.8
 
 ### Issues
 
-#### Coding
-
-- sanity checks need done
-
-- automated tests need added
-
-- output to file options (with appending and check headers) should be added
-
 #### Specification
 
 - mandatory fields: `gene_symbol` and `gene_name` are confusing and not usually both present (only consistently used in AFP). Means tools either need 1:2 mapping i.e. single output field maps to both `gene_symbol` and `gene_name` OR have fragile text splitting of single field that won't be robust to databases changes.  Current solution is 1:2 mapping e.g. staramr
 
 - inconsistent nomenclature of terms being used in specification fields: target, query, subject, reference. Need to stick to one name for sequence with which the database is being searched, and one the hit that results from that search.
 
-- variant specification needed to fully exploit ariba (or make mykrobe and pointfinder worth implementing): *discard these tools for now*
-
 - `sequence_identity`: is sequence type specific %id amino acids != %id nucleotide but does this matter?
 
 - `coverage_depth` seems to include both tool fields that are average depth of read and just plain overall read-count, 

diff --git a/hAMRonization/AbricateIO.py b/hAMRonization/AbricateIO.py
@@ -2,6 +2,7 @@
 
 import csv
 from .Interfaces import hAMRonizedResultIterator
+from hAMRonization.constants import GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version']
@@ -11,6 +12,7 @@ class AbricateIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'abricate'
+        metadata['genetic_variation_type'] = GENE_PRESENCE
         self.metadata = metadata
 
         self.field_mapping = {
@@ -24,7 +26,7 @@ def __init__(self, source, metadata):
                 '%COVERAGE': 'coverage_percentage',
                 'COVERAGE': None,
                 '%IDENTITY': 'sequence_identity',
-                'DATABASE': 'reference_database_id',
+                'DATABASE': 'reference_database_name',
                 'ACCESSION': 'reference_accession',
                 'RESISTANCE': 'drug_class',
                 'COVERAGE_MAP': None,

diff --git a/hAMRonization/AmrFinderPlusIO.py b/hAMRonization/AmrFinderPlusIO.py
@@ -1,8 +1,10 @@
 #!/usr/bin/env python
 
 import csv
+import re
 import warnings
 from .Interfaces import hAMRonizedResultIterator
+from hAMRonization.constants import NUCLEOTIDE_VARIANT, AMINO_ACID_VARIANT, GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version',
@@ -13,14 +15,12 @@ class AmrFinderPlusIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'amrfinderplus'
-        metadata['reference_database_id'] = 'NCBI Reference Gene Database'
+        metadata['reference_database_name'] = 'NCBI Reference Gene Database'
         self.metadata = metadata
 
         # check source for whether AMFP has been run in protein or nt mode
-        with open(source) as fh:
-            header = next(fh)
-            if 'Contig id' in header.strip().split('\t'):
-                self.field_mapping = {
+
+        nucleotide_field_mapping = {
                     'Protein identifier': None,
                     'Contig id': 'input_sequence_id',
                     'Start': 'input_gene_start',
@@ -42,11 +42,12 @@ def __init__(self, source, metadata):
                     'Accession of closest sequence': 'reference_accession',
                     'Name of closest sequence': None,
                     'HMM id': None,
-                    'HMM description': None
+                    'HMM description': None,
+                    'AA Mutation': 'amino_acid_mutation',
+                    'Nucleotide Mutation': 'nucleotide_mutation',
+                    'genetic_variation_type': 'genetic_variation_type'
                }
-            else:
-                self.field_mapping = {
-                    'Protein identifier': 'input_sequence_id',
+        protein_field_mapping = {'Protein identifier': 'input_sequence_id',
                     'Gene symbol': 'gene_symbol',
                     'Sequence name': 'gene_name',
                     'Scope': None,
@@ -64,29 +65,55 @@ def __init__(self, source, metadata):
                     'Name of closest sequence': None,
                     'HMM id': None,
                     'HMM description': None,
+                    'AA Mutation': 'amino_acid_mutation',
+                    'genetic_variation_type': 'genetic_variation_type'
                     }
 
+        with open(source) as fh:
+            header = next(fh)
+            try:
+                first_result = next(fh)
+                if first_result.strip().split('\t')[0] == 'NA':
+                    self.field_mapping = nucleotide_field_mapping
+                else:
+                    self.field_mapping = protein_field_mapping
+            except StopIteration:
+                # doesn't really matter which mapping as this error indicates
+                # this is an empty results file
+                self.field_mapping = nucleotide_field_mapping
+
         super().__init__(source, self.field_mapping, self.metadata)
 
+
     def parse(self, handle):
         """
         Read each and return it
         """
-        # skip any manually specified fields for later
-        skipped_mutational = 0
         reader = csv.DictReader(handle, delimiter='\t')
         for result in reader:
             # replace NA value with None for consitency
             for field, value in result.items():
                 if value == "NA":
                     result[field] = None
 
-            # "point" for mutational variants in this field, homolog are "AMR"
-            if result['Element subtype'] != 'AMR':
-                skipped_mutational += 1
+            # "POINT" indicates mutational resistance
+            # amrfinderplus has no special fields but the mutation itself is
+            # appended to the symbol name so we want to split this
+            result['AA Mutation'] = None
+            result['Nucleotide Mutation'] = None
+            result['genetic_variation_type'] = GENE_PRESENCE
 
-            yield self.hAMRonize(result, self.metadata)
+            if result['Element subtype'] == 'POINT':
+                gene_symbol, mutation = result["Gene symbol"].rsplit('_', 1)
+                result['Gene symbol'] = gene_symbol
+                _, ref, pos, alt, _ = re.split(r"(\D+)(\d+)(\D+)", mutation)
+                # this means it is a protein mutation
+                if result['Method'] in ['POINTX', 'POINTP']:
+                    result['AA Mutation'] = f"p.{ref}{pos}{alt}"
+                    result['genetic_variation_type'] = AMINO_ACID_VARIANT
+                elif result['Method'] == 'POINTN':
+                    # e.g., 23S_G2032G ampC_C-11C -> c.2032G>G
+                    result['Nucleotide Mutation'] = f"c.{pos}{ref}>{alt}"
+                    result['genetic_variation_type'] = NUCLEOTIDE_VARIANT
 
-        if skipped_mutational > 0:
-            warnings.warn(f"Skipping {skipped_mutational} mutational AMR "
-                          f"records from {self.metadata['input_file_name']}")
+            yield self.hAMRonize(result, self.metadata)
diff --git a/hAMRonization/AmrPlusPlusIO.py b/hAMRonization/AmrPlusPlusIO.py
@@ -2,7 +2,7 @@
 
 import csv
 from .Interfaces import hAMRonizedResultIterator
-
+from hAMRonization.constants import GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version',
@@ -13,7 +13,9 @@ class AmrPlusPlusIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'amrplusplus'
-        metadata['reference_database_id'] = 'megares'
+        metadata['reference_database_name'] = 'megares'
+        metadata['genetic_variation_type'] = GENE_PRESENCE
+
         self.metadata = metadata
         self.field_mapping = {
                 # Sample  Gene    Hits    Gene Fraction

diff --git a/hAMRonization/AribaIO.py b/hAMRonization/AribaIO.py
@@ -2,13 +2,13 @@
 
 import csv
 from .Interfaces import hAMRonizedResultIterator
+from hAMRonization.constants import NUCLEOTIDE_VARIANT, AMINO_ACID_VARIANT, GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version',
-                     'reference_database_id',
+                     'reference_database_name',
                      'input_file_name']
 
-
 class AribaIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
@@ -40,13 +40,16 @@ def __init__(self, source, metadata):
             "ref_end": None,    # for variant
             "ref_nt": None,
             "ctg_start": None,
-            "ctig_end": None,
+            "ctg_end": None,
             "smtls_total_depth": None,
             "smtls_nts": None,
             "smtls_nts_depth": None,
             "var_description": None,
             "free_text": None,
-            '_gene_symbol': 'gene_symbol'
+            '_gene_symbol': 'gene_symbol',
+            '_nucleotide_mutation': 'nucleotide_mutation',
+            '_amino_acid_mutation': 'amino_acid_mutation',
+            '_genetic_variation_type': 'genetic_variation_type'
         }
         super().__init__(source, self.field_mapping, self.metadata)
 
@@ -59,4 +62,20 @@ def parse(self, handle):
         for result in reader:
             _gene_symbol = result['ref_name'].split(".")[0]
             result['_gene_symbol'] = _gene_symbol
+            # default valuej
+            result['_genetic_variation_type'] = GENE_PRESENCE
+            result['_nucleotide_mutation'] = None
+            result['_amino_acid_mutation'] = None
+
+            if str(result['known_var']) == '1' and str(result['has_known_var']) == '1':
+                if result['var_seq_type'] == 'n':
+                    _, ref, pos, alt, _ = re.split(r"(\D+)(\d+)(\D+)", result['known_var_change'])
+                    result['_genetic_variation_type'] = NUCLEOTIDE_VARIANT
+                    result['_nucleotide_mutation'] = f"n.{pos}{ref}>{alt}"
+                    result['_amino_acid_mutation'] = None
+                elif result['var_seq_type'] == 'p':
+                    result['_genetic_variation_type'] = AMINO_ACID_VARIANT
+                    result['_nucleotide_mutation'] = f"n.{result['ctg_start']}{result['ref_nt']}>{result['ctg_nt']}"
+                    result['_amino_acid_mutation'] = f"p.{result['known_var_change']}"
+
             yield self.hAMRonize(result, self.metadata)
diff --git a/hAMRonization/CSStarIO.py b/hAMRonization/CSStarIO.py
@@ -3,18 +3,19 @@
 import csv
 from collections import OrderedDict
 from .Interfaces import hAMRonizedResultIterator
-
+from hAMRonization.constants import GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version',
                      'input_file_name',
-                     'reference_database_id']
+                     'reference_database_name']
 
 
 class CSStarIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'csstar'
+        metadata['genetic_variation_type'] = GENE_PRESENCE
         self.metadata = metadata
         self.field_mapping = OrderedDict([
             (0, "gene_symbol"),

diff --git a/hAMRonization/DeepArgIO.py b/hAMRonization/DeepArgIO.py
@@ -2,6 +2,7 @@
 
 import csv
 from .Interfaces import hAMRonizedResultIterator
+from hAMRonization.constants import GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
                      'reference_database_version',
@@ -12,7 +13,8 @@ class DeepArgIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'deeparg'
-        metadata['reference_database_id'] = 'deeparg_db'
+        metadata['reference_database_name'] = 'deeparg_db'
+        metadata['genetic_variation_type'] = GENE_PRESENCE
         self.metadata = metadata
 
         self.field_mapping = {
@@ -29,7 +31,7 @@ def __init__(self, source, metadata):
             'alignment-bitscore': None,
             'alignment-evalue': None,
             'counts': None,
-            # gather from splititng besthit
+            # gather from splitting besthit
             '_reference_accession': 'reference_accession'}
 
         super().__init__(source, self.field_mapping, self.metadata)

diff --git a/hAMRonization/GrootIO.py b/hAMRonization/GrootIO.py
@@ -3,9 +3,10 @@
 import csv
 from collections import OrderedDict
 from .Interfaces import hAMRonizedResultIterator
+from hAMRonization.constants import GENE_PRESENCE
 
 required_metadata = ['analysis_software_version',
-                     'reference_database_id',
+                     'reference_database_name',
                      'reference_database_version',
                      'input_file_name']
 
@@ -14,6 +15,7 @@ class GrootIterator(hAMRonizedResultIterator):
 
     def __init__(self, source, metadata):
         metadata['analysis_software_name'] = 'groot'
+        metadata['genetic_variation_type'] = GENE_PRESENCE
         self.metadata = metadata
 
         self.field_mapping = OrderedDict([