Skip to content

Commit

Permalink
Merge pull request #336 from gbouras13/dev
Browse files Browse the repository at this point in the history
v1.7.1
  • Loading branch information
gbouras13 authored Mar 12, 2024
2 parents 2ef31d3 + a1c6cb9 commit 40c78f6
Show file tree
Hide file tree
Showing 13 changed files with 2,158 additions and 17 deletions.
11 changes: 10 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
History
=======

1.7.0 (2024-01-17)
1.7.1 (2024-03-13)
------------------

* Adds Google Colab notebook that can run pharokka and [phold](https://github.com/gbouras13/phold).
* The notebook is [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
* Fixes #334 issues with contig ids if they were in scientific notation or lead with 0s.
* Fixes issues with `pharokka_proteins.py` not outputting PHROG annotations.


1.7.0 (2024-03-04)
------------------

* Adds `pharokka_multiplotter.py` to plot multiple phage contigs at once
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)

[![Paper](https://img.shields.io/badge/paper-Bioinformatics-teal.svg?style=flat-square&maxAge=3600)](https://doi.org/10.1093/bioinformatics/btac776)
[![CI](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml)
[![BioConda Install](https://img.shields.io/conda/dn/bioconda/pharokka.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/pharokka)
Expand All @@ -23,10 +25,25 @@ Extra special thanks to Ghais Houtak for making Pharokka's logo.

If you are looking for rapid standardised annotation of bacterial genomes, please use [Bakta](https://github.com/oschwengers/bakta). [Prokka](https://github.com/tseemann/prokka), which inspired the creation and naming of `pharokka`, is another good option, but Bakta is [Prokka's worthy successor](https://twitter.com/torstenseemann/status/1565471892840259585).

# phold

If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`.

`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).

# Google Colab Notebooks

If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)


# Table of Contents

- [pharokka](#pharokka)
- [Fast Phage Annotation Tool](#fast-phage-annotation-tool)
- [phold](#phold)
- [Google Colab Notebooks](#google-colab-notebooks)
- [Table of Contents](#table-of-contents)
- [Quick Start](#quick-start)
- [Documentation](#documentation)
Expand Down
4 changes: 2 additions & 2 deletions bin/input_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ def get_input():
"-g",
"--gene_predictor",
action="store",
help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate (not required unless prodigal is desired).',
default="phanotate",
help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate usually and prodigal-gv in meta mode.',
default="default",
)
parser.add_argument(
"-m",
Expand Down
6 changes: 6 additions & 0 deletions bin/pharokka.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,12 @@ def main():

# set the gene_predictor
gene_predictor = args.gene_predictor
# set to phanotate by default and prodigal-gv in meta mode
if gene_predictor == "default":
if args.meta is True:
gene_predictor = "prodigal-gv"
else:
gene_predictor = "phanotate"

# instantiate outdir
out_dir = instantiate_dirs(args.outdir, args.meta, args.force)
Expand Down
2 changes: 1 addition & 1 deletion bin/pharokka_proteins.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def main():
logger.info("Checking dependencies.")
check_dependencies(False) # to check pharokka_proteins.py, don't need mash

# instantiation/checking fasta and gene_predictor
# instantiation/checking fasta
validate_fasta(args.infile)
validate_threads(args.threads)

Expand Down
21 changes: 20 additions & 1 deletion bin/post_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,26 @@ def process_results(self):
# left join mmseqs top hits to cds df
# read in the cds cdf
cds_file = os.path.join(self.out_dir, "cleaned_" + self.gene_predictor + ".tsv")
cds_df = pd.read_csv(cds_file, sep="\t", index_col=False)

col_list = ["start", "stop", "frame", "contig", "score", "gene"]
dtype_dict = {
"start": int,
"stop": int,
"frame": str,
"contig": str,
"score": str,
"gene": str,
}

cds_df = pd.read_csv(
cds_file,
sep="\t",
index_col=False,
names=col_list,
dtype=dtype_dict,
skiprows=1,
)
cds_df["contig"] = cds_df["contig"].astype(str)

###########################################
# add the sequence to the df for the genbank conversion later on
Expand Down
36 changes: 34 additions & 2 deletions bin/processes.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,8 +364,22 @@ def tidy_phanotate_output(out_dir):
"""
phan_file = os.path.join(out_dir, "phanotate_out.txt")
col_list = ["start", "stop", "frame", "contig", "score"]
dtype_dict = {
"start": int,
"stop": int,
"frame": str,
"contig": str,
"score": float,
}

phan_df = pd.read_csv(
phan_file, delimiter="\t", index_col=False, names=col_list, skiprows=2
phan_file,
delimiter="\t",
index_col=False,
names=col_list,
skiprows=2,
dtype=dtype_dict,
comment="#", # to skip the headers
)
# get rid of the headers and reset the index
phan_df = phan_df[phan_df["start"] != "#id:"]
Expand Down Expand Up @@ -408,8 +422,25 @@ def tidy_prodigal_output(out_dir, gv_flag):
"phase",
"description",
]
dtype_dict = {
"contig": str,
"prod": str,
"orf": str,
"start": int,
"stop": int,
"score": float,
"frame": str,
"phase": str,
"description": str,
}

prod_df = pd.read_csv(
prod_file, delimiter="\t", index_col=False, names=col_list, skiprows=3
prod_file,
delimiter="\t",
index_col=False,
names=col_list,
dtype=dtype_dict,
comment="#", # to skip the headers
)

# meta mode brings in some Nas so remove them
Expand Down Expand Up @@ -485,6 +516,7 @@ def tidy_genbank_output(out_dir, genbank_file, coding_table):
data = {"start": starts, "stop": stops, "frame": frames, "contig": contigs}

gen_df = pd.DataFrame(data)
# add fake score
gen_df["score"] = "No_score"

# get the gene
Expand Down
36 changes: 32 additions & 4 deletions bin/proteins.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,8 +457,10 @@ def process_dataframes(self):
phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)

# merge phrog
# get only the contig id not the full description from mmseqs2 output
tophits_df["gene"] = tophits_df["gene"].str.split(" ").str[0]

tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
# convert no phrog to hyp protein
tophits_df["annot"] = tophits_df["annot"].str.replace(
"No_PHROG", "hypothetical protein"
Expand All @@ -468,6 +470,14 @@ def process_dataframes(self):
)

# # replace with No_PHROG if nothing found
# cast columns to str
tophits_df["mmseqs_phrog"] = tophits_df["mmseqs_phrog"].astype(str)
tophits_df["mmseqs_alnScore"] = tophits_df["mmseqs_alnScore"].astype(str)
tophits_df["mmseqs_seqIdentity"] = tophits_df["mmseqs_seqIdentity"].astype(str)
tophits_df["mmseqs_eVal"] = tophits_df["mmseqs_eVal"].astype(str)
tophits_df["mmseqs_top_hit"] = tophits_df["mmseqs_top_hit"].astype(str)
tophits_df["color"] = tophits_df["color"].astype(str)
tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
tophits_df.loc[
tophits_df["mmseqs_phrog"] == "No_PHROG", "mmseqs_phrog"
] = "No_PHROG"
Expand All @@ -487,15 +497,16 @@ def process_dataframes(self):

# get phrog
tophits_df["phrog"] = tophits_df["phrog"].astype(str)
phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)

# drop existing color annot category cols and
tophits_df = tophits_df.drop(columns=["color", "annot", "category"])
tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
tophits_df["annot"] = tophits_df["annot"].replace(
nan, "hypothetical protein", regex=True
np.nan, "hypothetical protein", regex=True
)
tophits_df["category"] = tophits_df["category"].replace(
nan, "unknown function", regex=True
np.nan, "unknown function", regex=True
)
tophits_df["color"] = tophits_df["color"].replace(nan, "None", regex=True)

Expand All @@ -517,7 +528,8 @@ def process_dataframes(self):
)
self.length_df = length_df

# merge the length df into the tophits
# convert nas to things

tophits_df = length_df.merge(tophits_df, on="gene", how="left")

# process vfdb results
Expand Down Expand Up @@ -547,6 +559,22 @@ def process_dataframes(self):

# save

tophits_df["phrog"].fillna("No_PHROG", inplace=True)
tophits_df["annot"].fillna("hypothetical protein", inplace=True)
tophits_df["category"].fillna("unknown function", inplace=True)
tophits_df["mmseqs_phrog"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_alnScore"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_seqIdentity"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_eVal"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_top_hit"].fillna("No_MMseqs_PHROG_hit", inplace=True)
tophits_df["pyhmmer_phrog"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["pyhmmer_bitscore"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["pyhmmer_evalue"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["color"].fillna("None", inplace=True)

# merge the length df into the tophits
print(tophits_df.tail())

tophits_df.to_csv(
os.path.join(self.out_dir, f"{self.prefix}_full_merged_output.tsv"),
sep="\t",
Expand Down
2 changes: 1 addition & 1 deletion bin/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.7.0"
__version__ = "1.7.1"
12 changes: 12 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@

![Image](pharokka_logo.png)

# phold

If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`.

`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).

# Google Colab Notebooks

If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)

## Overview

`pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) implemented with [pyrodigal](https://github.com/althonos/pyrodigal) and [Prodigal-gv](https://github.com/apcamargo/prodigal-gv) implemented with [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def package_files(directory):

setup(
name="Pharokka",
version="1.7.0",
version="1.7.1",
author="George Bouras",
author_email="george.bouras@adelaide.edu.au",
description="Fast phage annotation tool",
Expand Down
Loading

0 comments on commit 40c78f6

Please sign in to comment.