Skip to content

Commit

Permalink
Merge pull request #221 from phac-nml/development
Browse files Browse the repository at this point in the history
Merging Development into Main (v0.11.0 Release)
emarinier authored Dec 16, 2024
2 parents 8d59498 + 8fe219a commit a81e915
Showing 22 changed files with 1,519 additions and 1,427 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/ci-test.yml
Original file line number Diff line number Diff line change
@@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest

env:
DATABASE_COMMITS: '--resfinder-commit fa32d9a3cf0c12ec70ca4e90c45c0d590ee810bd --pointfinder-commit 8c694b9f336153e6d618b897b3b4930961521eb8 --plasmidfinder-commit c18e08c17a5988d4f075fc1171636e47546a323d'
DATABASE_COMMITS: '--resfinder-commit d1e607b8989260c7b6a3fbce8fa3204ecfc09022 --pointfinder-commit 694919f59a38980204009e7ade76bf319cb7df0b --plasmidfinder-commit c18e08c17a5988d4f075fc1171636e47546a323d'

strategy:
fail-fast: False
@@ -61,4 +61,6 @@ jobs:

- name: Run Tests
shell: bash -l {0}
run: python setup.py test
run: |
pip install pytest
pytest
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# Version 0.11.0

* Updated the Resfinder database to use the 2024-08-06 release.
* Updated the Pointfinder database to use the 2024-08-08 release.
* The Resfinder and Pointfinder databases now use FASTA record IDs with accession numbers (ex: `pmrA` -> `pmrA_1_CP055130.1`). StarAMR has been updated to support this, but database backwards compatibility is unlikely.
* The `genes_to_exclude` is affected by the above change and must now use gene IDs that exactly match the new FASTA record IDs with accession IDs.
* Removed ARG drug key entries with "None" or missing resistance.

# Version 0.10.0

* Updated the Plasmidfinder database to use the January 18th 2023 release.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -145,7 +145,7 @@ If you wish to update to the latest ResFinder, PointFinder, and PlasmidFinder da
staramr db update --update-default
```

If you wish to switch to specific git commits of either ResFinder, PointFinder, or PlasmidFinder databases you may also pass `--resfinder-commit [COMMIT]`, `--pointfinder-commit [COMMIT]`, and `--plasmidfinder-commit [COMMIT]`.
If you wish to switch to specific git commits of either ResFinder, PointFinder, or PlasmidFinder databases you may also pass `--resfinder-commit [COMMIT]`, `--pointfinder-commit [COMMIT]`, and `--plasmidfinder-commit [COMMIT]`. However, please note that because of compatibility issues arising from changes in the source databases, this functionality is largely unsupported and is unlikely to work for versions of the databases that StarAMR wasn't released with.

## Restore Database

@@ -245,11 +245,11 @@ By default, the ResFinder/PointFinder/PlasmidFinder genes listed in [genes_to_ex

```
gene_id
aac(6')-Iaa_1_NC_003197
ColpVC_1__JX133088
gyrA_1_CP073768.1
pmrB_1_CP051284.1
```

Please make sure to include `gene_id` in the first line. The default exclusion list can also be disabled with `--no-exclude-genes`.
Please make sure to include `gene_id` in the first line. The default exclusion list can also be disabled with `--no-exclude-genes`. Gene IDs must exactly match the FASTA record IDs provided in the source databases.

## Complex Mutations

@@ -504,7 +504,7 @@ Restores the default database for `staramr`.

# Caveats

This software is still a work-in-progress. In particular, not all organisms stored in the PointFinder database are supported (only *salmonella*, *campylobacter* are currently supported). Additionally, the predicted phenotypes are for microbiological resistance and *not* clinical resistance. Phenotype/drug resistance predictions are an experimental feature which is continually being improved.
This software is still a work-in-progress. In particular, not all organisms stored in the PointFinder database are supported (only *enterococcus_faecalis*, *helicobacter_pylori*, *enterococcus_faecium*, *campylobacter*, *escherichia_coli*, *salmonella* are currently supported). Additionally, the predicted phenotypes are for microbiological resistance and *not* clinical resistance. Phenotype/drug resistance predictions are an experimental feature which is continually being improved.

`staramr` only works on assembled genomes and not directly on reads. A quick genome assembler you could use is [Shovill][shovill]. Or, you may also wish to try out the [ResFinder webservice][resfinder-web], or the command-line tools [rgi][] or [ariba][] which will work on sequence reads as well as genome assemblies. You may also wish to check out the [CARD webservice][card-web].

14 changes: 11 additions & 3 deletions scripts/data-conversion/pointfinder-drug-resistance.ipynb
Original file line number Diff line number Diff line change
@@ -17,7 +17,7 @@
{
"data": {
"text/plain": [
"dict_keys(['Salmonella', 'Shigella E. coli', 'Campylobacter'])"
"dict_keys(['Salmonella', 'Shigella E. coli', 'Campylobacter', 'NCBI AMRfinder'])"
]
},
"execution_count": 1,
@@ -28,7 +28,7 @@
"source": [
"import pandas as pd\n",
"\n",
"pointfinder_file = '../../drug-key-update/pointfinder 072621.xlsx'\n",
"pointfinder_file = '../../pointfinder.xlsx'\n",
"\n",
"pointfinder_excel = pd.ExcelFile(pointfinder_file)\n",
"sheets_df_map_orig = {n: pd.read_excel(pointfinder_excel, sheet_name=n, header=None) for n in pointfinder_excel.sheet_names}\n",
@@ -407,6 +407,14 @@
"source": [
"pointfinder_df_reduced.to_csv('../../staramr/databases/resistance/data/ARG_drug_key_pointfinder.tsv', sep='\\t', index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "938df9d4-d16a-4855-aee0-5208fadfad68",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
@@ -425,7 +433,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.12.3"
}
},
"nbformat": 4,
101 changes: 56 additions & 45 deletions scripts/data-conversion/resfinder-drug-resistance.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion staramr/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.10.0'
__version__ = '0.11.0'
4 changes: 2 additions & 2 deletions staramr/blast/plasmidfinder/PlasmidfinderBlastDatabase.py
Original file line number Diff line number Diff line change
@@ -75,8 +75,8 @@ def get_config_table(cls, database_dir: str) -> pd.DataFrame:
A Class Method to get the config table from the plasmidfinder database root directory.
:return: A DataFrame containing the config table.
"""
config = pd.read_csv(path.join(database_dir, 'config'), sep='\t', comment='#', header=None,
names=['db_prefix', 'name', 'description'])
config = pd.read_csv(path.join(database_dir, 'config'), sep=r'\t|\s{4,}', comment='#', header=None,
names=['db_prefix', 'name', 'description'], engine='python')
return config

@classmethod
4 changes: 2 additions & 2 deletions staramr/blast/pointfinder/PointfinderBlastDatabase.py
Original file line number Diff line number Diff line change
@@ -167,8 +167,8 @@ def get_organisms(cls, database_dir):
:param database_dir: The PointFinder database root directory.
:return: A list of organisms.
"""
config = pd.read_csv(path.join(database_dir, 'config'), sep='\t', comment='#', header=None,
names=['db_prefix', 'name', 'description'])
config = pd.read_csv(path.join(database_dir, 'config'), sep=r'\t|\s{4,}', comment='#', header=None,
names=['db_prefix', 'name', 'description'], engine='python')
return config['db_prefix'].tolist()

@classmethod
17 changes: 13 additions & 4 deletions staramr/blast/pointfinder/PointfinderDatabaseInfo.py
Original file line number Diff line number Diff line change
@@ -37,11 +37,12 @@ def from_file(cls, file):

with open(file) as f:
line = f.readline()

line = line.lstrip("#")
column_names = line.split()

pointfinder_info = pd.read_csv(file, sep='\t', index_col=False, comment='#', header=None, names=column_names)
pointfinder_info = pd.read_csv(file, sep=r'\t|\s{4,}', index_col=False, comment='#', header=None, names=column_names, engine='python')
pointfinder_info["PMID"] = pointfinder_info["PMID"].astype(str)

return cls(pointfinder_info, file)

@@ -53,7 +54,7 @@ def from_pandas_table(cls, database_info_dataframe):
:return: A new PointfinderDatabaseInfo.
"""
return cls(database_info_dataframe)

@staticmethod
def to_codons(regex_match):
# Sometimes, the regex will match a string with a comma and return multiple matches.
@@ -107,6 +108,14 @@ def _resistance_table_hacks(self, table):
& (table['Gene_ID'].str.contains('16S') == False)
& (table['Gene_ID'].str.contains('23S') == False), "Res_codon"].str.replace('[A-Z,]+', self.to_codons, regex=True)

# We need to correct for the mismatch between promter's FASTA record ID
# and the associated gene ID in the resistens-overview.txt file. For example:
# FASTA Record ID: >ampC-promoter_1_CP037449.1
# Gene ID: ampC_promoter_size_53bp
# The gene ID seems to be derived from the file name (ampC-promoter-size-53bp.fsa),
# although that still contains mismatching hyphens and underscores.
table[['Gene_ID']] = table[['Gene_ID']].replace({r'(.+promoter).+' : r'\1'}, regex=True)

def _get_resistance_codon_match(self, gene, codon_mutation):
table = self._pointfinder_info

@@ -120,7 +129,7 @@ def _get_resistance_codon_match(self, gene, codon_mutation):
# so we need to convert to nucleotide coordinates before making the comparison.
& (table['Ref_codon'] == codon_mutation.get_database_amr_gene_mutation())
& (table['Res_codon'].str.contains(codon_mutation.get_input_genome_mutation(), regex=False))]

# We need to handle codon insertions as a special case:
# Pointfinder mis-reports the position of codon insertions. For example:
# ref: ACG --- ACG
11 changes: 9 additions & 2 deletions staramr/blast/results/BlastResultsParser.py
Original file line number Diff line number Diff line change
@@ -91,9 +91,16 @@ def _get_out_file_name(self, in_file):
pass

def _handle_blast_hit(self, in_file, database_name, blast_file, results, hit_seq_records):
ignore_database = ["all", "campylobacter",
"enterococcus_faecalis", "enterococcus_faecium",
"escherichia_coli", "helicobacter_pylori", "klebsiella",
"mycobacterium_tuberculosis", "neisseria_gonorrhoeae",
"plasmodium_falciparum", "salmonella",
"staphylococcus_aureus"]

blast_table = pd.read_csv(blast_file, sep='\t', header=None, names=JobHandler.BLAST_COLUMNS,
index_col=False).astype(
dtype={'qseqid': np.unicode_, 'sseqid': np.unicode_})
dtype={'qseqid': np.str_, 'sseqid': np.str_})
partitions = BlastHitPartitions()

blast_table['plength'] = (blast_table.length / blast_table.qlen) * 100.0
@@ -107,7 +114,7 @@ def _handle_blast_hit(self, in_file, database_name, blast_file, results, hit_seq
for hits_non_overlapping in partitions.get_hits_nonoverlapping_regions():
for hit in self._select_hits_to_include(hits_non_overlapping):
blast_results = self._get_result_rows(hit, database_name)
if blast_results is not None:
if blast_results is not None and database_name not in ignore_database:
logger.debug("record = %s", blast_results)
results.extend(blast_results)
hit_seq_records.append(hit.get_seq_record())
Original file line number Diff line number Diff line change
@@ -48,7 +48,8 @@ def __init__(self, file_blast_map, blast_database, pid_threshold, plength_thresh

def _create_hit(self, file, database_name, blast_record):
logger.debug("database_name=%s", database_name)
if (database_name == '16S_rrsD') or (database_name == '23S'):
if (database_name.startswith('16S_rrs') or database_name.startswith('16S-rrs') \
or (database_name == '23S')):
return PointfinderHitHSPRNA(file, blast_record)
elif ('promoter' in database_name):
return PointfinderHitHSPPromoter(file, blast_record, database_name)
Original file line number Diff line number Diff line change
@@ -60,12 +60,12 @@ def _get_result(self, hit, db_mutation):
else:
mutation_position = db_mutation.get_mutation_position()

arg_drug = self._arg_drug_table.get_drug(self._blast_database.get_organism(), hit.get_amr_gene_id(),
arg_drug = self._arg_drug_table.get_drug(self._blast_database.get_organism(), hit.get_amr_gene_name(),
mutation_position)

cge_drug = self._blast_database.get_cge_phenotype(hit.get_amr_gene_id(), db_mutation)
cge_drug = self._blast_database.get_cge_phenotype(hit.get_amr_gene_name(), db_mutation)

gene_name = hit.get_amr_gene_id() + " (" + db_mutation.get_mutation_string_short() + ")"
gene_name = hit.get_amr_gene_name() + " (" + db_mutation.get_mutation_string_short() + ")"

if arg_drug is None:
arg_drug = 'unknown[' + gene_name + ']'
@@ -87,10 +87,10 @@ def _get_result(self, hit, db_mutation):
hit.get_genome_contig_start(),
hit.get_genome_contig_end(),
db_mutation.get_pointfinder_mutation_string(),
self._blast_database.get_cge_notes(hit.get_amr_gene_id(), db_mutation),
self._blast_database.get_cge_required_mutation(hit.get_amr_gene_id(), db_mutation),
self._blast_database.get_cge_mechanism(hit.get_amr_gene_id(), db_mutation),
self._blast_database.get_cge_pmid(hit.get_amr_gene_id(), db_mutation),
self._blast_database.get_cge_notes(hit.get_amr_gene_name(), db_mutation),
self._blast_database.get_cge_required_mutation(hit.get_amr_gene_name(), db_mutation),
self._blast_database.get_cge_mechanism(hit.get_amr_gene_name(), db_mutation),
self._blast_database.get_cge_pmid(hit.get_amr_gene_name(), db_mutation),
]

return result
12 changes: 11 additions & 1 deletion staramr/blast/results/pointfinder/PointfinderHitHSP.py
Original file line number Diff line number Diff line change
@@ -28,7 +28,17 @@ def get_amr_gene_name(self):
Gets the particular gene name for the PointFinder hit.
:return: The gene name.
"""
return self._blast_record['qseqid']

# As far back as 2020, CGE has been editing the FASTA file
# record headings used by Pointfinder to include accession
# numbers (pmrA -> pmrA_1_CP055130.1). This seems to be part
# of a larger initiative by CGE to move from the
# resistens-overview.txt file to a new phenotypes.txt file.
# However, we need to use the new FASTA record headers with
# the legacy resistens-overview.txt names, which unfortunately
# requires modifying the gene names back
# (pmrA_1_CP055130.1 -> pmrA).
return self._blast_record['qseqid'].split("_")[0]

def _get_mutation_positions(self, start):
mutation_positions = []
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from staramr.blast.results.pointfinder.PointfinderHitHSP import PointfinderHitHSP
from staramr.blast.results.pointfinder.codon.CodonMutationPosition import CodonMutationPosition
from staramr.blast.results.pointfinder.nucleotide.NucleotideMutationPosition import NucleotideMutationPosition
import re


class PointfinderHitHSPPromoter(PointfinderHitHSP):
@@ -106,15 +107,32 @@ def _parse_database_name(self, database_name):
Parses the name of the database in order to obtain the promoter offset.
The database name is expected to have the following format:
[GENENAME]_promoter_size_[SIZE]bp
[GENENAME]-promoter-size-[SIZE]bp
example:
embA_promoter_size_115bp
embA-promoter-size-115bp
"""

tokens = database_name.split("_") # split the name into tokens
tokens = database_name.split("-") # split the name into tokens
size_string = tokens[len(tokens) - 1] # get the last token
size = int(size_string.replace('bp', '')) # remove the 'bp' and convert to an int

self.offset = size

def get_amr_gene_name(self):
"""
Gets the particular gene name for the PointfinderHitHSPPromoter hit.
:return: The gene name.
"""
name = self._blast_record['qseqid']

# CGE has been changing FASTA record headers to include accession
# numbers, which need to be removed. See PointfinderHitHSP.get_amr_gene_name()
# for more information. Furthermore, promoters seem to use the form
# "ampC-promoter", whereas the resistens-overview.txt file expects the form
# "ampC_promoter_size_53bp".
# Ex: ampC-promoter_1_CP037449.1 -> ampC_promoter
name = re.sub(r'(.+promoter).+', r'\1', name) # Grab everything up to and include "promoter"
name = name.replace('-', "_")
return name
Original file line number Diff line number Diff line change
@@ -12,6 +12,25 @@ def __init__(self, file, blast_record):
"""
super().__init__(file, blast_record)

def get_amr_gene_name(self):
"""
Gets the particular gene name for the PointfinderHitHSPRNA hit.
:return: The gene name.
"""
name = self._blast_record['qseqid']

# CGE has been changing FASTA record headers to include accession
# numbers, which need to be removed. See PointfinderHitHSP.get_amr_gene_name()
# for more information. Naming schemes are also inconsistent:
# pointfinder/campylobacter/23S.fsa -> 23S_1_LR134511.1
# pointfinder/neisseria_gonorrhoeae/23S-rRNA-a1.fsa -> 23S-rRNA-a1_1_AE004969.1
if name.startswith("16S_rrs"): name = name.split("_")[0] + "_" + name.split("_")[1]
elif name.startswith("16S-rrs"): name = name.split("_")[0].replace("-", "_", 1) # Ex: 16S-rrsD_1_CP049983.1
elif name.startswith("23S"): name = "23S"
else: name = name.split("_")[0]

return name

def _get_mutation_positions(self, start):
mutation_positions = []

4 changes: 2 additions & 2 deletions staramr/databases/AMRDatabasesManager.py
Original file line number Diff line number Diff line change
@@ -14,8 +14,8 @@ class AMRDatabasesManager:
# Update to commits corresponding to dates listed on <https://cge.cbs.dtu.dk/services/ResFinder/> (and PlasmidFinder)
# As of May 26, 2022
DEFAULT_COMMITS = {
'resfinder': 'fa32d9a3cf0c12ec70ca4e90c45c0d590ee810bd', # 2022-05-24
'pointfinder': '8c694b9f336153e6d618b897b3b4930961521eb8', # 2021-02-01
'resfinder': 'd1e607b8989260c7b6a3fbce8fa3204ecfc09022', # 2024-08-06
'pointfinder': '694919f59a38980204009e7ade76bf319cb7df0b', # 2024-08-08
'plasmidfinder': 'c18e08c17a5988d4f075fc1171636e47546a323d', # 2023-01-18
}

10 changes: 4 additions & 6 deletions staramr/databases/resistance/data/ARG_drug_key_pointfinder.tsv
Original file line number Diff line number Diff line change
@@ -47,7 +47,6 @@ salmonella parE 514 ciprofloxacin I/R,nalidixic acid
salmonella parE 521 ciprofloxacin I/R,nalidixic acid
salmonella 16S_rrsD 1065 spectinomycin
salmonella 16S_rrsD 1192 spectinomycin
salmonella parC 57 None
salmonella acrB 717 azithromycin
escherichia_coli gyrA 51 ciprofloxacin I/R,nalidixic acid
escherichia_coli gyrA 67 ciprofloxacin I/R,nalidixic acid
@@ -62,7 +61,6 @@ escherichia_coli gyrB 136 aminocoumarin
escherichia_coli gyrB 426 ciprofloxacin I/R,nalidixic acid
escherichia_coli gyrB 447 ciprofloxacin I/R,nalidixic acid
escherichia_coli parC 56 ciprofloxacin I/R,nalidixic acid
escherichia_coli parC 57 None
escherichia_coli parC 60 ciprofloxacin I/R,nalidixic acid
escherichia_coli parC 78 ciprofloxacin I/R,nalidixic acid
escherichia_coli parC 80 ciprofloxacin I/R,nalidixic acid
@@ -110,10 +108,10 @@ escherichia_coli 16S_rrsC 794 kasugamicin
escherichia_coli 16S_rrsC 926 kasugamicin
escherichia_coli 16S_rrsC 1519 kasugamicin
escherichia_coli 16S_rrsH 1192 spectinomycin
escherichia_coli ampC_promoter_size_53bp -42 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter_size_53bp -32 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter_size_53bp -13 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter_size_53bp -12 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter -42 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter -32 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter -13 ampicillin,amoxicillin/clavulanic acid,cefoxitin
escherichia_coli ampC_promoter -12 ampicillin,amoxicillin/clavulanic acid,cefoxitin
campylobacter gyrA 70 ciprofloxacin,nalidixic acid
campylobacter gyrA 85 ciprofloxacin,nalidixic acid
campylobacter gyrA 86 ciprofloxacin,nalidixic acid
2,609 changes: 1,305 additions & 1,304 deletions staramr/databases/resistance/data/ARG_drug_key_resfinder.tsv

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions staramr/databases/resistance/data/info.ini
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[Versions]
pointfinder_gene_drug_version = 072621.2
resfinder_gene_drug_version = 072621
pointfinder_gene_drug_version = 070623
resfinder_gene_drug_version = 072423
6 changes: 3 additions & 3 deletions staramr/subcommand/Search.py
Original file line number Diff line number Diff line change
@@ -194,13 +194,13 @@ def _print_dataframes_to_excel(self, outfile_path, summary_dataframe, resfinder_
for name in ['Summary', 'Detailed_Summary', 'ResFinder', 'PointFinder', 'PlasmidFinder', 'MLST_Summary']:
if name in sheetname_dataframe:
if name == 'Summary':
sheetname_dataframe[name].to_excel(writer, name, freeze_panes=[1, 2], float_format="%0.2f",na_rep=self.BLANK)
sheetname_dataframe[name].to_excel(writer, sheet_name=name, freeze_panes=[1, 2], float_format="%0.2f",na_rep=self.BLANK)
else:
sheetname_dataframe[name].to_excel(writer, name, freeze_panes=[1, 1], float_format="%0.2f",na_rep=self.BLANK)
sheetname_dataframe[name].to_excel(writer, sheet_name=name, freeze_panes=[1, 1], float_format="%0.2f",na_rep=self.BLANK)

self._resize_columns(sheetname_dataframe, writer, max_width=50)

settings_dataframe.to_excel(writer, 'Settings')
settings_dataframe.to_excel(writer, sheet_name='Settings')
self._resize_columns({'Settings': settings_dataframe}, writer, max_width=75, text_wrap=False)

writer.close()
Original file line number Diff line number Diff line change
@@ -7,8 +7,8 @@


class AMRDatabasesManagerIT(unittest.TestCase):
RESFINDER_DEFAULT_COMMIT = 'fa32d9a3cf0c12ec70ca4e90c45c0d590ee810bd'
POINTFINDER_DEFAULT_COMMIT = '8c694b9f336153e6d618b897b3b4930961521eb8'
RESFINDER_DEFAULT_COMMIT = 'd1e607b8989260c7b6a3fbce8fa3204ecfc09022'
POINTFINDER_DEFAULT_COMMIT = '694919f59a38980204009e7ade76bf319cb7df0b'
PLASMIDFINDER_DEFAULT_COMMIT = 'c18e08c17a5988d4f075fc1171636e47546a323d'

def setUp(self):
60 changes: 30 additions & 30 deletions staramr/tests/integration/detection/test_AMRDetection.py

Large diffs are not rendered by default.

0 comments on commit a81e915

Please sign in to comment.