gbouras13 · gbouras13 · Mar 12, 2024 · Mar 10, 2024 · Mar 12, 2024 · Mar 12, 2024
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,7 +1,16 @@
 History
 =======
 
-1.7.0 (2024-01-17)
+1.7.1 (2024-03-13)
+------------------
+
+* Adds Google Colab notebook that can run pharokka and [phold](https://github.com/gbouras13/phold). 
+* The notebook is  [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
+* Fixes #334 issues with contig ids if they were in scientific notation or lead with 0s.
+* Fixes issues with `pharokka_proteins.py` not outputting PHROG annotations.
+
+
+1.7.0 (2024-03-04)
 ------------------
 
 * Adds `pharokka_multiplotter.py` to plot multiple phage contigs at once

diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
+
 [![Paper](https://img.shields.io/badge/paper-Bioinformatics-teal.svg?style=flat-square&maxAge=3600)](https://doi.org/10.1093/bioinformatics/btac776)
 [![CI](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml/badge.svg)](https://github.com/gbouras13/pharokka/actions/workflows/ci.yaml)
 [![BioConda Install](https://img.shields.io/conda/dn/bioconda/pharokka.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/pharokka)
@@ -23,10 +25,25 @@ Extra special thanks to Ghais Houtak for making Pharokka's logo.
 
 If you are looking for rapid standardised annotation of bacterial genomes, please use [Bakta](https://github.com/oschwengers/bakta). [Prokka](https://github.com/tseemann/prokka), which inspired the creation and naming of `pharokka`, is another good option, but Bakta is [Prokka's worthy successor](https://twitter.com/torstenseemann/status/1565471892840259585).
 
+# phold
+
+If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.
+
+`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`. 
+
+`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).
+
+# Google Colab Notebooks
+
+If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
+
+
 # Table of Contents
 
 - [pharokka](#pharokka)
   - [Fast Phage Annotation Tool](#fast-phage-annotation-tool)
+- [phold](#phold)
+- [Google Colab Notebooks](#google-colab-notebooks)
 - [Table of Contents](#table-of-contents)
 - [Quick Start](#quick-start)
 - [Documentation](#documentation)

diff --git a/bin/input_commands.py b/bin/input_commands.py
@@ -61,8 +61,8 @@ def get_input():
         "-g",
         "--gene_predictor",
         action="store",
-        help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate (not required unless prodigal is desired).',
-        default="phanotate",
+        help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate usually and prodigal-gv in meta mode.',
+        default="default",
     )
     parser.add_argument(
         "-m",

diff --git a/bin/pharokka.py b/bin/pharokka.py
@@ -65,6 +65,12 @@ def main():
 
     # set the gene_predictor
     gene_predictor = args.gene_predictor
+    # set to phanotate by default and prodigal-gv in meta mode
+    if gene_predictor == "default":
+        if args.meta is True:
+            gene_predictor = "prodigal-gv"
+        else:
+            gene_predictor = "phanotate"
 
     # instantiate outdir
     out_dir = instantiate_dirs(args.outdir, args.meta, args.force)

diff --git a/bin/pharokka_proteins.py b/bin/pharokka_proteins.py
@@ -82,7 +82,7 @@ def main():
     logger.info("Checking dependencies.")
     check_dependencies(False)  # to check pharokka_proteins.py, don't need mash
 
-    # instantiation/checking fasta and gene_predictor
+    # instantiation/checking fasta
     validate_fasta(args.infile)
     validate_threads(args.threads)
 

diff --git a/bin/post_processing.py b/bin/post_processing.py
@@ -175,7 +175,26 @@ def process_results(self):
         # left join mmseqs top hits to cds df
         # read in the cds cdf
         cds_file = os.path.join(self.out_dir, "cleaned_" + self.gene_predictor + ".tsv")
-        cds_df = pd.read_csv(cds_file, sep="\t", index_col=False)
+
+        col_list = ["start", "stop", "frame", "contig", "score", "gene"]
+        dtype_dict = {
+            "start": int,
+            "stop": int,
+            "frame": str,
+            "contig": str,
+            "score": str,
+            "gene": str,
+        }
+
+        cds_df = pd.read_csv(
+            cds_file,
+            sep="\t",
+            index_col=False,
+            names=col_list,
+            dtype=dtype_dict,
+            skiprows=1,
+        )
+        cds_df["contig"] = cds_df["contig"].astype(str)
 
         ###########################################
         # add the sequence to the df for the genbank conversion later on

diff --git a/bin/processes.py b/bin/processes.py
@@ -364,8 +364,22 @@ def tidy_phanotate_output(out_dir):
     """
     phan_file = os.path.join(out_dir, "phanotate_out.txt")
     col_list = ["start", "stop", "frame", "contig", "score"]
+    dtype_dict = {
+        "start": int,
+        "stop": int,
+        "frame": str,
+        "contig": str,
+        "score": float,
+    }
+
     phan_df = pd.read_csv(
-        phan_file, delimiter="\t", index_col=False, names=col_list, skiprows=2
+        phan_file,
+        delimiter="\t",
+        index_col=False,
+        names=col_list,
+        skiprows=2,
+        dtype=dtype_dict,
+        comment="#",  # to skip the headers
     )
     # get rid of the headers and reset the index
     phan_df = phan_df[phan_df["start"] != "#id:"]
@@ -408,8 +422,25 @@ def tidy_prodigal_output(out_dir, gv_flag):
         "phase",
         "description",
     ]
+    dtype_dict = {
+        "contig": str,
+        "prod": str,
+        "orf": str,
+        "start": int,
+        "stop": int,
+        "score": float,
+        "frame": str,
+        "phase": str,
+        "description": str,
+    }
+
     prod_df = pd.read_csv(
-        prod_file, delimiter="\t", index_col=False, names=col_list, skiprows=3
+        prod_file,
+        delimiter="\t",
+        index_col=False,
+        names=col_list,
+        dtype=dtype_dict,
+        comment="#",  # to skip the headers
     )
 
     # meta mode brings in some Nas so remove them
@@ -485,6 +516,7 @@ def tidy_genbank_output(out_dir, genbank_file, coding_table):
     data = {"start": starts, "stop": stops, "frame": frames, "contig": contigs}
 
     gen_df = pd.DataFrame(data)
+    # add fake score
     gen_df["score"] = "No_score"
 
     # get the gene

diff --git a/bin/proteins.py b/bin/proteins.py
@@ -457,8 +457,10 @@ def process_dataframes(self):
         phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)
 
         # merge phrog
+        # get only the contig id not the full description from mmseqs2 output
+        tophits_df["gene"] = tophits_df["gene"].str.split(" ").str[0]
+
         tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
-        tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
         # convert no phrog to hyp protein
         tophits_df["annot"] = tophits_df["annot"].str.replace(
             "No_PHROG", "hypothetical protein"
@@ -468,6 +470,14 @@ def process_dataframes(self):
         )
 
         # # replace with No_PHROG if nothing found
+        # cast columns to str
+        tophits_df["mmseqs_phrog"] = tophits_df["mmseqs_phrog"].astype(str)
+        tophits_df["mmseqs_alnScore"] = tophits_df["mmseqs_alnScore"].astype(str)
+        tophits_df["mmseqs_seqIdentity"] = tophits_df["mmseqs_seqIdentity"].astype(str)
+        tophits_df["mmseqs_eVal"] = tophits_df["mmseqs_eVal"].astype(str)
+        tophits_df["mmseqs_top_hit"] = tophits_df["mmseqs_top_hit"].astype(str)
+        tophits_df["color"] = tophits_df["color"].astype(str)
+        tophits_df = tophits_df.replace(np.nan, "No_PHROG", regex=True)
         tophits_df.loc[
             tophits_df["mmseqs_phrog"] == "No_PHROG", "mmseqs_phrog"
         ] = "No_PHROG"
@@ -487,15 +497,16 @@ def process_dataframes(self):
 
         # get phrog
         tophits_df["phrog"] = tophits_df["phrog"].astype(str)
+        phrog_annot_df["phrog"] = phrog_annot_df["phrog"].astype(str)
 
         # drop existing color annot category cols and
         tophits_df = tophits_df.drop(columns=["color", "annot", "category"])
         tophits_df = tophits_df.merge(phrog_annot_df, on="phrog", how="left")
         tophits_df["annot"] = tophits_df["annot"].replace(
-            nan, "hypothetical protein", regex=True
+            np.nan, "hypothetical protein", regex=True
         )
         tophits_df["category"] = tophits_df["category"].replace(
-            nan, "unknown function", regex=True
+            np.nan, "unknown function", regex=True
         )
         tophits_df["color"] = tophits_df["color"].replace(nan, "None", regex=True)
 
@@ -517,7 +528,8 @@ def process_dataframes(self):
         )
         self.length_df = length_df
 
-        # merge the length df into the tophits
+        # convert nas to things
+
         tophits_df = length_df.merge(tophits_df, on="gene", how="left")
 
         # process vfdb results
@@ -547,6 +559,22 @@ def process_dataframes(self):
 
         # save
 
+        tophits_df["phrog"].fillna("No_PHROG", inplace=True)
+        tophits_df["annot"].fillna("hypothetical protein", inplace=True)
+        tophits_df["category"].fillna("unknown function", inplace=True)
+        tophits_df["mmseqs_phrog"].fillna("No_MMseqs", inplace=True)
+        tophits_df["mmseqs_alnScore"].fillna("No_MMseqs", inplace=True)
+        tophits_df["mmseqs_seqIdentity"].fillna("No_MMseqs", inplace=True)
+        tophits_df["mmseqs_eVal"].fillna("No_MMseqs", inplace=True)
+        tophits_df["mmseqs_top_hit"].fillna("No_MMseqs_PHROG_hit", inplace=True)
+        tophits_df["pyhmmer_phrog"].fillna("No_PHROGs_HMM", inplace=True)
+        tophits_df["pyhmmer_bitscore"].fillna("No_PHROGs_HMM", inplace=True)
+        tophits_df["pyhmmer_evalue"].fillna("No_PHROGs_HMM", inplace=True)
+        tophits_df["color"].fillna("None", inplace=True)
+
+        # merge the length df into the tophits
+        print(tophits_df.tail())
+
         tophits_df.to_csv(
             os.path.join(self.out_dir, f"{self.prefix}_full_merged_output.tsv"),
             sep="\t",

diff --git a/bin/version.py b/bin/version.py
@@ -1 +1 @@
-__version__ = "1.7.0"
+__version__ = "1.7.1"
diff --git a/docs/index.md b/docs/index.md
@@ -4,6 +4,18 @@
 
 ![Image](pharokka_logo.png)
 
+# phold
+
+If you like `pharokka`, you will probably love [phold](https://github.com/gbouras13/phold). `phold` uses structural homology to improve phage annotation. Benchmarking is ongoing but `phold` strongly outperforms `pharokka` in terms of annotation, particularly for less characterised phages such as those from metagenomic datasets.
+
+`pharokka` still has features `phold` lacks for now (identifying tRNA, tmRNA, CRISPR repeats, and INPHARED taxonomy search), so it it recommended to run `phold` after running `pharokka`. 
+
+`phold` takes the Genbank output of Pharokka as input. Therefore, if you have already annotated your phage(s) with Pharokka, you can easily update the annotation with more functional predictions with [phold](https://github.com/gbouras13/phold).
+
+# Google Colab Notebooks
+
+If you don't want to install `pharokka` or `phold` locally, you can run `pharokka` and `phold`, or only `pharokka`, without any code using the Google Colab notebook [https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb](https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb)
+
 ## Overview
 
 `pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) implemented with [pyrodigal](https://github.com/althonos/pyrodigal) and [Prodigal-gv](https://github.com/apcamargo/prodigal-gv) implemented with [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.  

diff --git a/setup.py b/setup.py
@@ -19,7 +19,7 @@ def package_files(directory):
 
 setup(
     name="Pharokka",
-    version="1.7.0",
+    version="1.7.1",
     author="George Bouras",
     author_email="george.bouras@adelaide.edu.au",
     description="Fast phage annotation tool",