Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.7.1 #336

Merged
merged 10 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
History
=======

1.7.0 (2024-01-17)
1.7.1 (2024-03-13)
------------------

* Adds Google Colab notebook that can run pharokka and [phold](https://github.com/gbouras13/phold).
* The notebook is [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
* Fixes #334 issues with contig ids if they were in scientific notation or lead with 0s.
* Fixes issues with `pharokka_proteins.py` not outputting PHROG annotations.


1.7.0 (2024-03-04)
------------------

* Adds `pharokka_multiplotter.py` to plot multiple phage contigs at once
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)

[![Paper](https://img.shields.io/badge/paper-Bioinformatics-teal.svg?style=flat-square&maxAge=3600)](https://doi.org/10.1093/bioinformatics/btac776)
[![CI](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml)
[![BioConda Install](https://img.shields.io/conda/dn/bioconda/pharokka.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/pharokka)
Expand All @@ -23,10 +25,25 @@ Extra special thanks to Ghais Houtak for making Pharokka's logo.

If you are looking for rapid standardised annotation of bacterial genomes, please use [Bakta](https://github.com/oschwengers/bakta). [Prokka](https://github.com/tseemann/prokka), which inspired the creation and naming of `pharokka`, is another good option, but Bakta is [Prokka's worthy successor](https://twitter.com/torstenseemann/status/1565471892840259585).

# phold

If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`.

`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).

# Google Colab Notebooks

If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)


# Table of Contents

- [pharokka](#pharokka)
- [Fast Phage Annotation Tool](#fast-phage-annotation-tool)
- [phold](#phold)
- [Google Colab Notebooks](#google-colab-notebooks)
- [Table of Contents](#table-of-contents)
- [Quick Start](#quick-start)
- [Documentation](#documentation)
Expand Down
4 changes: 2 additions & 2 deletions bin/input_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ def get_input():
"-g",
"--gene_predictor",
action="store",
help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate (not required unless prodigal is desired).',
default="phanotate",
help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate usually and prodigal-gv in meta mode.',
default="default",
)
parser.add_argument(
"-m",
Expand Down
6 changes: 6 additions & 0 deletions bin/pharokka.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,12 @@ def main():

# set the gene_predictor
gene_predictor = args.gene_predictor
# set to phanotate by default and prodigal-gv in meta mode
if gene_predictor == "default":
if args.meta is True:
gene_predictor = "prodigal-gv"
else:
gene_predictor = "phanotate"

# instantiate outdir
out_dir = instantiate_dirs(args.outdir, args.meta, args.force)
Expand Down
2 changes: 1 addition & 1 deletion bin/pharokka_proteins.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def main():
logger.info("Checking dependencies.")
check_dependencies(False) # to check pharokka_proteins.py, don't need mash

# instantiation/checking fasta and gene_predictor
# instantiation/checking fasta
validate_fasta(args.infile)
validate_threads(args.threads)

Expand Down
21 changes: 20 additions & 1 deletion bin/post_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,26 @@ def process_results(self):
# left join mmseqs top hits to cds df
# read in the cds cdf
cds_file = os.path.join(self.out_dir, "cleaned_" + self.gene_predictor + ".tsv")
cds_df = pd.read_csv(cds_file, sep="\t", index_col=False)

col_list = ["start", "stop", "frame", "contig", "score", "gene"]
dtype_dict = {
"start": int,
"stop": int,
"frame": str,
"contig": str,
"score": str,
"gene": str,
}

cds_df = pd.read_csv(
cds_file,
sep="\t",
index_col=False,
names=col_list,
dtype=dtype_dict,
skiprows=1,
)
cds_df["contig"] = cds_df["contig"].astype(str)

###########################################
# add the sequence to the df for the genbank conversion later on
Expand Down
36 changes: 34 additions & 2 deletions bin/processes.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,8 +364,22 @@ def tidy_phanotate_output(out_dir):
"""
phan_file = os.path.join(out_dir, "phanotate_out.txt")
col_list = ["start", "stop", "frame", "contig", "score"]
dtype_dict = {
"start": int,
"stop": int,
"frame": str,
"contig": str,
"score": float,
}

phan_df = pd.read_csv(
phan_file, delimiter="\t", index_col=False, names=col_list, skiprows=2
phan_file,
delimiter="\t",
index_col=False,
names=col_list,
skiprows=2,
dtype=dtype_dict,
comment="#", # to skip the headers
)
# get rid of the headers and reset the index
phan_df = phan_df[phan_df["start"] != "#id:"]
Expand Down Expand Up @@ -408,8 +422,25 @@ def tidy_prodigal_output(out_dir, gv_flag):
"phase",
"description",
]
dtype_dict = {
"contig": str,
"prod": str,
"orf": str,
"start": int,
"stop": int,
"score": float,
"frame": str,
"phase": str,
"description": str,
}

prod_df = pd.read_csv(
prod_file, delimiter="\t", index_col=False, names=col_list, skiprows=3
prod_file,
delimiter="\t",
index_col=False,
names=col_list,
dtype=dtype_dict,
comment="#", # to skip the headers
)

# meta mode brings in some Nas so remove them
Expand Down Expand Up @@ -485,6 +516,7 @@ def tidy_genbank_output(out_dir, genbank_file, coding_table):
data = {"start": starts, "stop": stops, "frame": frames, "contig": contigs}

gen_df = pd.DataFrame(data)
# add fake score
gen_df["score"] = "No_score"

# get the gene
Expand Down
36 changes: 32 additions & 4 deletions bin/proteins.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,8 +457,10 @@ def process_dataframes(self):
phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)

# merge phrog
# get only the contig id not the full description from mmseqs2 output
tophits_df["gene"] = tophits_df["gene"].str.split(" ").str[0]

tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
# convert no phrog to hyp protein
tophits_df["annot"] = tophits_df["annot"].str.replace(
"No_PHROG", "hypothetical protein"
Expand All @@ -468,6 +470,14 @@ def process_dataframes(self):
)

# # replace with No_PHROG if nothing found
# cast columns to str
tophits_df["mmseqs_phrog"] = tophits_df["mmseqs_phrog"].astype(str)
tophits_df["mmseqs_alnScore"] = tophits_df["mmseqs_alnScore"].astype(str)
tophits_df["mmseqs_seqIdentity"] = tophits_df["mmseqs_seqIdentity"].astype(str)
tophits_df["mmseqs_eVal"] = tophits_df["mmseqs_eVal"].astype(str)
tophits_df["mmseqs_top_hit"] = tophits_df["mmseqs_top_hit"].astype(str)
tophits_df["color"] = tophits_df["color"].astype(str)
tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
tophits_df.loc[
tophits_df["mmseqs_phrog"] == "No_PHROG", "mmseqs_phrog"
] = "No_PHROG"
Expand All @@ -487,15 +497,16 @@ def process_dataframes(self):

# get phrog
tophits_df["phrog"] = tophits_df["phrog"].astype(str)
phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)

# drop existing color annot category cols and
tophits_df = tophits_df.drop(columns=["color", "annot", "category"])
tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
tophits_df["annot"] = tophits_df["annot"].replace(
nan, "hypothetical protein", regex=True
np.nan, "hypothetical protein", regex=True
)
tophits_df["category"] = tophits_df["category"].replace(
nan, "unknown function", regex=True
np.nan, "unknown function", regex=True
)
tophits_df["color"] = tophits_df["color"].replace(nan, "None", regex=True)

Expand All @@ -517,7 +528,8 @@ def process_dataframes(self):
)
self.length_df = length_df

# merge the length df into the tophits
# convert nas to things

tophits_df = length_df.merge(tophits_df, on="gene", how="left")

# process vfdb results
Expand Down Expand Up @@ -547,6 +559,22 @@ def process_dataframes(self):

# save

tophits_df["phrog"].fillna("No_PHROG", inplace=True)
tophits_df["annot"].fillna("hypothetical protein", inplace=True)
tophits_df["category"].fillna("unknown function", inplace=True)
tophits_df["mmseqs_phrog"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_alnScore"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_seqIdentity"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_eVal"].fillna("No_MMseqs", inplace=True)
tophits_df["mmseqs_top_hit"].fillna("No_MMseqs_PHROG_hit", inplace=True)
tophits_df["pyhmmer_phrog"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["pyhmmer_bitscore"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["pyhmmer_evalue"].fillna("No_PHROGs_HMM", inplace=True)
tophits_df["color"].fillna("None", inplace=True)

# merge the length df into the tophits
print(tophits_df.tail())

tophits_df.to_csv(
os.path.join(self.out_dir, f"{self.prefix}_full_merged_output.tsv"),
sep="\t",
Expand Down
2 changes: 1 addition & 1 deletion bin/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.7.0"
__version__ = "1.7.1"
12 changes: 12 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@

![Image](pharokka_logo.png)

# phold

If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.

`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`.

`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).

# Google Colab Notebooks

If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)

## Overview

`pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) implemented with [pyrodigal](https://github.com/althonos/pyrodigal) and [Prodigal-gv](https://github.com/apcamargo/prodigal-gv) implemented with [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def package_files(directory):

setup(
name="Pharokka",
version="1.7.0",
version="1.7.1",
author="George Bouras",
author_email="george.bouras@adelaide.edu.au",
description="Fast phage annotation tool",
Expand Down
Loading
Loading