Skip to content

Commit

Permalink
Merge pull request #77 from pha4ge/update_parsers
Browse files Browse the repository at this point in the history
Update parsers
  • Loading branch information
fmaguire authored Aug 21, 2022
2 parents 849b98e + ad6f193 commit d40f2f9
Show file tree
Hide file tree
Showing 69 changed files with 1,265 additions and 1,277 deletions.
59 changes: 25 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@
This repo contains the hAMRonization module and CLI parser tools combine the outputs of
disparate antimicrobial resistance gene detection tools into a single unified format.

This is an implementation of the [hAMRonization AMR detection specification scheme](docs/hAMRonization_specification_details.csv).
This is an implementation of the [hAMRonization AMR detection specification scheme](docs/hAMRonization_specification_details.csv) which supports gene presence/absence resistance and mutational resistance (if supported by the underlying tool).

This supports a variety of summary options including an [interactive summary](https://finlaymagui.re/assets/interactive_report_demo.html).

![hAMRonization overview](https://github.com/pha4ge/hAMRonization/blob/master/docs/overview_figure.png?raw=true)


## Installation

This tool requires python>=3.7 and [pandas](https://pandas.pydata.org/)
and the latest release can be installed directly from pip, conda, docker, this repository, or from the galaxy toolshed:

```
pip install hAMRonization
```
[![PyPI version](https://badge.fury.io/py/hAMRonization.svg)](https://badge.fury.io/py/hAMRonization)
Expand Down Expand Up @@ -47,6 +47,10 @@ Alternatively, hAMRonization can also be installed and used in [galaxy](https://
## Usage
**NOTE**: Only the output format used in the "last updated" version of the AMR prediction tool has been tested for accuracy. Older tool versions or updates which lead to a change in output format may not work.
In theory, this should only be a problem with major version changes but not all tools follow semantic versioning.
If you encounter any issues with newer tool versions then please create an issue in this repository.
```
>hamronize -h
usage: hamronize <tool> <options>
Expand All @@ -58,13 +62,12 @@ optional arguments:
-v, --version show program's version number and exit

Tools with hAMRonizable reports:
{abricate,amrfinderplus,ariba,rgi,resfinder,resfinder4,srax,deeparg,kmerresistance,srst2,staramr,csstar,amrplusplus,resfams,groot}
{abricate,amrfinderplus,ariba,rgi,resfinder,srax,deeparg,kmerresistance,srst2,staramr,csstar,amrplusplus,resfams,groot}
abricate hAMRonize abricate's output report i.e., OUTPUT.tsv
amrfinderplus hAMRonize amrfinderplus's output report i.e., OUTPUT.tsv
ariba hAMRonize ariba's output report i.e., OUTDIR/OUTPUT.tsv
rgi hAMRonize rgi's output report i.e., OUTPUT.txt or OUTPUT_bwtoutput.gene_mapping_data.txt
resfinder hAMRonize resfinder's output report i.e., data_resfinder.json
resfinder4 hAMRonize resfinder4's tabular output report i.e., ResFinder_results_tab.txt
resfinder hAMRonize resfinder's tabular output report (includes pointfinder as of v4) i.e., ResFinder_results_tab.txt
srax hAMRonize srax's output report i.e., sraX_detected_ARGs.tsv
deeparg hAMRonize deeparg's output report i.e., OUTDIR/OUTPUT.mapping.ARG
kmerresistance hAMRonize kmerresistance's output report i.e., OUTPUT.KmerRes
Expand Down Expand Up @@ -173,44 +176,32 @@ If you want to write multiple reports to one file, this `.write` method can acce
Parsers needing tested (both automated and just sanity checking output), see [test.sh](parsers/test.sh) for example invocations.
`
1. [abricate](hAMRonization/AbricateIO.py)
2. [ariba](hAMRonization/AribaIO.py)
3. [NCBI AMRFinderPlus](hAMRonization/AmrFinderPlusIO.py)
4. [RGI](hAMRonization/RgiIO.py) (includes RGI-BWT)
5. [resfinder](hAMRonization/ResFinderIO.py)
6. [resfinder4](hAMRonization/ResFinder4IO.py)
7. [sraX](hAMRonization/SraxIO.py)
8. [deepARG](hAMRonization/DeepArgIO.py)
9. [kmerresistance](hAMRonization/KmerResistanceIO.py)
10. [srst2](hAMRonization/Srst2IO.py)
11. [staramr](hAMRonization/StarAmrIO.py)
12. [c-sstar](hAMRonization/CSStarIO.py)
13. [amrplusplus](hAMRonization/AmrPlusPlusIO.py)
14. [resfams](hAMRonization/ResFamsIO.py)
15. [groot](hAMRonization/GrootIO.py)

Parsers excluded as needing variant specification to implement:
1. [mykrobe](test/data/raw_outputs/mykrobe/report.json)
2. [pointfinder](test/data/raw_outputs/pointfinder/report.tsv)
1. [abricate](hAMRonization/AbricateIO.py): last updated for v1.0.0
2. [ariba](hAMRonization/AribaIO.py): last updated for v2.14.6
3. [NCBI AMRFinderPlus](hAMRonization/AmrFinderPlusIO.py): last updated for v3.10.40
4. [RGI](hAMRonization/RgiIO.py) (includes RGI-BWT) last updated for v5.2.0
5. [resfinder](hAMRonization/ResFinderIO.py): last updated for v4.1.0
6. [pointfinder](hAMRonization/PointFinderIO.py): last updated for v4.1.0
7. [sraX](hAMRonization/SraxIO.py): last updated for v1.5
8. [deepARG](hAMRonization/DeepArgIO.py): last updated for v1.0.2
9. [kmerresistance](hAMRonization/KmerResistanceIO.py): late updated for v2.2.0
10. [srst2](hAMRonization/Srst2IO.py): last updated for v0.2.0
11. [staramr](hAMRonization/StarAmrIO.py): last updated for v0.8.0
12. [c-sstar](hAMRonization/CSStarIO.py): last updated for v2.1.0
13. [amrplusplus](hAMRonization/AmrPlusPlusIO.py): last updated for c6b097a
14. [resfams](hAMRonization/ResFamsIO.py): last updated for hmmer v3.3.2
15. [groot](hAMRonization/GrootIO.py): last updated for v1.1.2
16. [mykrobe](test/data/raw_outputs/mykrobe/report.json): last updated for v0.8.1
17. [tbprofilder](test/data/raw_outputs/tbprofiler/tbprofiler.json): last updated for v3.0.8
### Issues
#### Coding

- sanity checks need done

- automated tests need added

- output to file options (with appending and check headers) should be added

#### Specification
- mandatory fields: `gene_symbol` and `gene_name` are confusing and not usually both present (only consistently used in AFP). Means tools either need 1:2 mapping i.e. single output field maps to both `gene_symbol` and `gene_name` OR have fragile text splitting of single field that won't be robust to databases changes. Current solution is 1:2 mapping e.g. staramr
- inconsistent nomenclature of terms being used in specification fields: target, query, subject, reference. Need to stick to one name for sequence with which the database is being searched, and one the hit that results from that search.
- variant specification needed to fully exploit ariba (or make mykrobe and pointfinder worth implementing): *discard these tools for now*

- `sequence_identity`: is sequence type specific %id amino acids != %id nucleotide but does this matter?
- `coverage_depth` seems to include both tool fields that are average depth of read and just plain overall read-count,
Expand Down
4 changes: 3 additions & 1 deletion hAMRonization/AbricateIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import csv
from .Interfaces import hAMRonizedResultIterator
from hAMRonization.constants import GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version']
Expand All @@ -11,6 +12,7 @@ class AbricateIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'abricate'
metadata['genetic_variation_type'] = GENE_PRESENCE
self.metadata = metadata

self.field_mapping = {
Expand All @@ -24,7 +26,7 @@ def __init__(self, source, metadata):
'%COVERAGE': 'coverage_percentage',
'COVERAGE': None,
'%IDENTITY': 'sequence_identity',
'DATABASE': 'reference_database_id',
'DATABASE': 'reference_database_name',
'ACCESSION': 'reference_accession',
'RESISTANCE': 'drug_class',
'COVERAGE_MAP': None,
Expand Down
63 changes: 45 additions & 18 deletions hAMRonization/AmrFinderPlusIO.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
#!/usr/bin/env python

import csv
import re
import warnings
from .Interfaces import hAMRonizedResultIterator
from hAMRonization.constants import NUCLEOTIDE_VARIANT, AMINO_ACID_VARIANT, GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version',
Expand All @@ -13,14 +15,12 @@ class AmrFinderPlusIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'amrfinderplus'
metadata['reference_database_id'] = 'NCBI Reference Gene Database'
metadata['reference_database_name'] = 'NCBI Reference Gene Database'
self.metadata = metadata

# check source for whether AMFP has been run in protein or nt mode
with open(source) as fh:
header = next(fh)
if 'Contig id' in header.strip().split('\t'):
self.field_mapping = {

nucleotide_field_mapping = {
'Protein identifier': None,
'Contig id': 'input_sequence_id',
'Start': 'input_gene_start',
Expand All @@ -42,11 +42,12 @@ def __init__(self, source, metadata):
'Accession of closest sequence': 'reference_accession',
'Name of closest sequence': None,
'HMM id': None,
'HMM description': None
'HMM description': None,
'AA Mutation': 'amino_acid_mutation',
'Nucleotide Mutation': 'nucleotide_mutation',
'genetic_variation_type': 'genetic_variation_type'
}
else:
self.field_mapping = {
'Protein identifier': 'input_sequence_id',
protein_field_mapping = {'Protein identifier': 'input_sequence_id',
'Gene symbol': 'gene_symbol',
'Sequence name': 'gene_name',
'Scope': None,
Expand All @@ -64,29 +65,55 @@ def __init__(self, source, metadata):
'Name of closest sequence': None,
'HMM id': None,
'HMM description': None,
'AA Mutation': 'amino_acid_mutation',
'genetic_variation_type': 'genetic_variation_type'
}

with open(source) as fh:
header = next(fh)
try:
first_result = next(fh)
if first_result.strip().split('\t')[0] == 'NA':
self.field_mapping = nucleotide_field_mapping
else:
self.field_mapping = protein_field_mapping
except StopIteration:
# doesn't really matter which mapping as this error indicates
# this is an empty results file
self.field_mapping = nucleotide_field_mapping

super().__init__(source, self.field_mapping, self.metadata)


def parse(self, handle):
"""
Read each and return it
"""
# skip any manually specified fields for later
skipped_mutational = 0
reader = csv.DictReader(handle, delimiter='\t')
for result in reader:
# replace NA value with None for consitency
for field, value in result.items():
if value == "NA":
result[field] = None

# "point" for mutational variants in this field, homolog are "AMR"
if result['Element subtype'] != 'AMR':
skipped_mutational += 1
# "POINT" indicates mutational resistance
# amrfinderplus has no special fields but the mutation itself is
# appended to the symbol name so we want to split this
result['AA Mutation'] = None
result['Nucleotide Mutation'] = None
result['genetic_variation_type'] = GENE_PRESENCE

yield self.hAMRonize(result, self.metadata)
if result['Element subtype'] == 'POINT':
gene_symbol, mutation = result["Gene symbol"].rsplit('_', 1)
result['Gene symbol'] = gene_symbol
_, ref, pos, alt, _ = re.split(r"(\D+)(\d+)(\D+)", mutation)
# this means it is a protein mutation
if result['Method'] in ['POINTX', 'POINTP']:
result['AA Mutation'] = f"p.{ref}{pos}{alt}"
result['genetic_variation_type'] = AMINO_ACID_VARIANT
elif result['Method'] == 'POINTN':
# e.g., 23S_G2032G ampC_C-11C -> c.2032G>G
result['Nucleotide Mutation'] = f"c.{pos}{ref}>{alt}"
result['genetic_variation_type'] = NUCLEOTIDE_VARIANT

if skipped_mutational > 0:
warnings.warn(f"Skipping {skipped_mutational} mutational AMR "
f"records from {self.metadata['input_file_name']}")
yield self.hAMRonize(result, self.metadata)
6 changes: 4 additions & 2 deletions hAMRonization/AmrPlusPlusIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import csv
from .Interfaces import hAMRonizedResultIterator

from hAMRonization.constants import GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version',
Expand All @@ -13,7 +13,9 @@ class AmrPlusPlusIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'amrplusplus'
metadata['reference_database_id'] = 'megares'
metadata['reference_database_name'] = 'megares'
metadata['genetic_variation_type'] = GENE_PRESENCE

self.metadata = metadata
self.field_mapping = {
# Sample Gene Hits Gene Fraction
Expand Down
27 changes: 23 additions & 4 deletions hAMRonization/AribaIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

import csv
from .Interfaces import hAMRonizedResultIterator
from hAMRonization.constants import NUCLEOTIDE_VARIANT, AMINO_ACID_VARIANT, GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version',
'reference_database_id',
'reference_database_name',
'input_file_name']


class AribaIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
Expand Down Expand Up @@ -40,13 +40,16 @@ def __init__(self, source, metadata):
"ref_end": None, # for variant
"ref_nt": None,
"ctg_start": None,
"ctig_end": None,
"ctg_end": None,
"smtls_total_depth": None,
"smtls_nts": None,
"smtls_nts_depth": None,
"var_description": None,
"free_text": None,
'_gene_symbol': 'gene_symbol'
'_gene_symbol': 'gene_symbol',
'_nucleotide_mutation': 'nucleotide_mutation',
'_amino_acid_mutation': 'amino_acid_mutation',
'_genetic_variation_type': 'genetic_variation_type'
}
super().__init__(source, self.field_mapping, self.metadata)

Expand All @@ -59,4 +62,20 @@ def parse(self, handle):
for result in reader:
_gene_symbol = result['ref_name'].split(".")[0]
result['_gene_symbol'] = _gene_symbol
# default valuej
result['_genetic_variation_type'] = GENE_PRESENCE
result['_nucleotide_mutation'] = None
result['_amino_acid_mutation'] = None

if str(result['known_var']) == '1' and str(result['has_known_var']) == '1':
if result['var_seq_type'] == 'n':
_, ref, pos, alt, _ = re.split(r"(\D+)(\d+)(\D+)", result['known_var_change'])
result['_genetic_variation_type'] = NUCLEOTIDE_VARIANT
result['_nucleotide_mutation'] = f"n.{pos}{ref}>{alt}"
result['_amino_acid_mutation'] = None
elif result['var_seq_type'] == 'p':
result['_genetic_variation_type'] = AMINO_ACID_VARIANT
result['_nucleotide_mutation'] = f"n.{result['ctg_start']}{result['ref_nt']}>{result['ctg_nt']}"
result['_amino_acid_mutation'] = f"p.{result['known_var_change']}"

yield self.hAMRonize(result, self.metadata)
5 changes: 3 additions & 2 deletions hAMRonization/CSStarIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,19 @@
import csv
from collections import OrderedDict
from .Interfaces import hAMRonizedResultIterator

from hAMRonization.constants import GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version',
'input_file_name',
'reference_database_id']
'reference_database_name']


class CSStarIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'csstar'
metadata['genetic_variation_type'] = GENE_PRESENCE
self.metadata = metadata
self.field_mapping = OrderedDict([
(0, "gene_symbol"),
Expand Down
6 changes: 4 additions & 2 deletions hAMRonization/DeepArgIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import csv
from .Interfaces import hAMRonizedResultIterator
from hAMRonization.constants import GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_version',
Expand All @@ -12,7 +13,8 @@ class DeepArgIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'deeparg'
metadata['reference_database_id'] = 'deeparg_db'
metadata['reference_database_name'] = 'deeparg_db'
metadata['genetic_variation_type'] = GENE_PRESENCE
self.metadata = metadata

self.field_mapping = {
Expand All @@ -29,7 +31,7 @@ def __init__(self, source, metadata):
'alignment-bitscore': None,
'alignment-evalue': None,
'counts': None,
# gather from splititng besthit
# gather from splitting besthit
'_reference_accession': 'reference_accession'}

super().__init__(source, self.field_mapping, self.metadata)
Expand Down
4 changes: 3 additions & 1 deletion hAMRonization/GrootIO.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
import csv
from collections import OrderedDict
from .Interfaces import hAMRonizedResultIterator
from hAMRonization.constants import GENE_PRESENCE

required_metadata = ['analysis_software_version',
'reference_database_id',
'reference_database_name',
'reference_database_version',
'input_file_name']

Expand All @@ -14,6 +15,7 @@ class GrootIterator(hAMRonizedResultIterator):

def __init__(self, source, metadata):
metadata['analysis_software_name'] = 'groot'
metadata['genetic_variation_type'] = GENE_PRESENCE
self.metadata = metadata

self.field_mapping = OrderedDict([
Expand Down
Loading

0 comments on commit d40f2f9

Please sign in to comment.