Skip to content

Commit

Permalink
add notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
gbouras13 committed Mar 12, 2024
2 parents db9e7e9 + 2ef31d3 commit ca1867f
Showing 1 changed file with 394 additions and 0 deletions.
394 changes: 394 additions & 0 deletions run_pharokka_and_phold.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,394 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4",
"authorship_tag": "ABX9TyNnwN/naVC1sG/14rX68U/2",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/gbouras13/pharokka/blob/master/run_pharokka_and_phold.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"source": [
"##Pharokka + Phold\n",
"\n",
"[pharokka](https://github.com/gbouras13/pharokka) is a rapid standardised annotation tool for bacteriophage genomes and metagenomes. You can read more about pharokka in the [documentation](https://pharokka.readthedocs.io/).\n",
"\n",
"[phold](https://github.com/gbouras13/phold) is a sensitive annotation tool for bacteriophage genomes and metagenomes using protein structural homology. You can read more about phold in the [documentation](https://phold.readthedocs.io/).\n",
"\n",
"phold uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a database of 803k protein structures mostly predicted using [Colabfold](https://github.com/sokrypton/ColabFold).\n",
"\n",
"The tools are best run sequentially, as Pharokka conducts extra annotation steps like tRNA, tmRNA, CRISPR and inphared searches that Phold lacks (for now at least). Pharokka will also (rarely) annotate CDS that Phold can miss.\n",
"\n",
"* To run the cells, press the play button on the left side\n",
"* Cells 1 and 2 install pharokka, phold and download the databases\n",
"* Once they have been run, you can re-run Cell 3 (to run Pharokka) and Cell 4 (to run Phold) as many times as you would like\n",
"* Please make sure you change the runtime to T4 GPU, as this will make Phold run faster.\n",
"* To do this, go to the top toolbar, then to Runtime -> Change runtime type -> Hardware accelerator\n",
"\n"
],
"metadata": {
"id": "QGd2GEI3N-02"
}
},
{
"cell_type": "code",
"source": [
"#@title 1. Install pharokka and phold\n",
"\n",
"#@markdown This cell installs pharokka and phold. It will take a few minutes. Please be patient\n",
"\n",
"%%time\n",
"import os\n",
"from sys import version_info\n",
"python_version = f\"{version_info.major}.{version_info.minor}\"\n",
"PYTHON_VERSION = python_version\n",
"\n",
"print(PYTHON_VERSION)\n",
"\n",
"if not os.path.isfile(\"MAMBA_READY\"):\n",
" print(\"installing mamba...\")\n",
" os.system(\"wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh\")\n",
" os.system(\"bash Mambaforge-Linux-x86_64.sh -bfp /usr/local\")\n",
" os.system(\"mamba config --set auto_update_conda false\")\n",
" os.system(\"touch MAMBA_READY\")\n",
"\n",
"if not os.path.isfile(\"PHAROKKA_PHOLD_READY\"):\n",
" print(\"installing pharokka and phold...\")\n",
" os.system(f\"mamba install -y -c conda-forge -c bioconda pharokka python='{PYTHON_VERSION}' phold==0.1.2 pytorch=*=cuda*\")\n",
" os.system(\"touch PHAROKKA_PHOLD_READY\")\n",
"\n",
"\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Ii39RG8eOZUx",
"outputId": "771f2b7e-a2a3-47c5-a838-2630cd3995a4"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"3.10\n",
"installing pharokka and phold...\n",
"CPU times: user 758 ms, sys: 114 ms, total: 873 ms\n",
"Wall time: 3min 11s\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"#@title 2. Download pharokka phold databases\n",
"\n",
"#@markdown This cell downloads the pharokka then the phold database. It will take a few minutes. Please be patient.\n",
"\n",
"\n",
"%%time\n",
"print(\"Downloading pharokka database. This will take a few minutes. Please be patient :)\")\n",
"os.system(\"install_databases.py -o pharokka_db\")\n",
"print(\"Downloading phold database. This will take a few minutes. Please be patient :)\")\n",
"os.system(\"phold install\")\n",
"\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tfltpbZ_QLfZ",
"outputId": "c78c8d4e-0525-4352-bc6a-f1ab8899b01f"
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Downloading pharokka database. This will take a few minutes. Please be patient :)\n",
"Downloading phold database. This will take a few minutes. Please be patient :)\n",
"CPU times: user 1.1 s, sys: 177 ms, total: 1.28 s\n",
"Wall time: 4min 51s\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0"
]
},
"metadata": {},
"execution_count": 2
}
]
},
{
"cell_type": "code",
"source": [
"#@title 3. Run Pharokka\n",
"\n",
"#@markdown First, upload your phage(s) as a nucleotide input FASTA file\n",
"\n",
"#@markdown Click on the folder icon to the left and use file upload button.\n",
"\n",
"#@markdown Once it is uploaded, write the file name in the INPUT_FILE field on the right.\n",
"\n",
"#@markdown Then provide a directory for pharokka's output using PHAROKKA_OUT_DIR.\n",
"#@markdown The default is 'output_pharokka'.\n",
"\n",
"#@markdown Then type in a gene prediction tool for pharokka.\n",
"#@markdown Please choose either 'phanotate', 'prodigal', or 'prodigal-gv'.\n",
"\n",
"#@markdown You can also provide a prefix for your output files with PHAROKKA_PREFIX.\n",
"#@markdown If you provide nothing it will default to 'pharokka'.\n",
"\n",
"#@markdown You can also provide a locus tag for your output files.\n",
"#@markdown If you provide nothing it will generate a random locus tag.\n",
"\n",
"#@markdown You can click FAST to turn off --fast.\n",
"#@markdown By default it is True so that Pharokka runs faster in the Colab environment.\n",
"\n",
"#@markdown You can click META to turn on --meta if you have multiple phages in your input.\n",
"\n",
"#@markdown You can click META_HMM to turn on --meta_hmm.\n",
"\n",
"#@markdown You can click FORCE to overwrite the output directory.\n",
"#@markdown This may be useful if your earlier pharokka run has crashed for whatever reason.\n",
"\n",
"#@markdown The results of Pharokka will be in the folder icon on the left hand panel.\n",
"#@markdown Additionally, it will be zipped so you can download the whole directory.\n",
"\n",
"#@markdown The file to download is PHAROKKA_OUT_DIR.zip, where PHAROKKA_OUT_DIR is what you provided\n",
"\n",
"#@markdown If you do not see the output directory,\n",
"#@markdown refresh the window by either clicking the folder with the refresh icon below \"Files\"\n",
"#@markdown or double click and select \"Refresh\".\n",
"\n",
"\n",
"%%time\n",
"import os\n",
"import sys\n",
"import subprocess\n",
"import zipfile\n",
"INPUT_FILE = '' #@param {type:\"string\"}\n",
"\n",
"if os.path.exists(INPUT_FILE):\n",
" print(f\"Input file {INPUT_FILE} exists\")\n",
"else:\n",
" print(f\"Error: File {INPUT_FILE} does not exist\")\n",
" print(f\"Please check the spelling and that you have uploaded it correctly\")\n",
" sys.exit(1)\n",
"\n",
"PHAROKKA_OUT_DIR = 'output_pharokka' #@param {type:\"string\"}\n",
"GENE_PREDICTOR = 'phanotate' #@param {type:\"string\"}\n",
"allowed_gene_predictors = ['phanotate', 'prodigal', 'prodigal-gv']\n",
"# Check if the input parameter is valid\n",
"if GENE_PREDICTOR.lower() not in allowed_gene_predictors:\n",
" raise ValueError(\"Invalid GENE_PREDICTOR. Please choose from: 'phanotate', 'prodigal', 'prodigal-gv'.\")\n",
"\n",
"PHAROKKA_PREFIX = 'pharokka' #@param {type:\"string\"}\n",
"LOCUS_TAG = 'Default' #@param {type:\"string\"}\n",
"FAST = True #@param {type:\"boolean\"}\n",
"META = False #@param {type:\"boolean\"}\n",
"META_HMM = False #@param {type:\"boolean\"}\n",
"FORCE = False #@param {type:\"boolean\"}\n",
"\n",
"\n",
"# Construct the command\n",
"command = f\"pharokka.py -d pharokka_db -i {INPUT_FILE} -t 4 -o {PHAROKKA_OUT_DIR} -p {PHAROKKA_PREFIX} -l {LOCUS_TAG} -g {GENE_PREDICTOR}\"\n",
"\n",
"if FORCE is True:\n",
" command = f\"{command} -f\"\n",
"\n",
"if FAST is True:\n",
" command = f\"{command} --fast\"\n",
"\n",
"if META is True:\n",
" command = f\"{command} -m\"\n",
"\n",
"if META_HMM is True:\n",
" command = f\"{command} --meta_hmm\"\n",
"\n",
"# Execute the command\n",
"try:\n",
" print(\"Running pharokka\")\n",
" subprocess.run(command, shell=True, check=True)\n",
" print(\"pharokka completed successfully.\")\n",
" print(f\"Your output is in {PHAROKKA_OUT_DIR}.\")\n",
" print(f\"Zipping the output directory so you can download it all in one go.\")\n",
"\n",
" zip_filename = f\"{PHAROKKA_OUT_DIR}.zip\"\n",
"\n",
" # Zip the contents of the output directory\n",
" with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:\n",
" for root, dirs, files in os.walk(PHAROKKA_OUT_DIR):\n",
" for file in files:\n",
" zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHAROKKA_OUT_DIR))\n",
" print(f\"Output directory has been zipped to {zip_filename}\")\n",
"\n",
"\n",
"except subprocess.CalledProcessError as e:\n",
" print(f\"Error occurred: {e}\")\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Sjdpu6R-Kig9",
"outputId": "f8a912f9-0dd0-4740-e89a-245688eed76f"
},
"execution_count": 24,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Input file SAOMS1.fasta exists\n",
"Running pharokka\n",
"pharokka completed successfully.\n",
"Your output is in output_pharokka.\n",
"Zipping the output directory so you can download it all in one go.\n",
"Output directory has been zipped to output_pharokka.zip\n",
"CPU times: user 660 ms, sys: 82.8 ms, total: 743 ms\n",
"Wall time: 2min 1s\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"#@title 4. Run phold\n",
"\n",
"#@markdown This cell will run phold on the output of cell 3's Pharokka run\n",
"\n",
"#@markdown You do not need to provide any further input files\n",
"\n",
"#@markdown You can now provide a directory for phold's output with PHOLD_OUT_DIR.\n",
"#@markdown The default is 'output_phold'.\n",
"\n",
"#@markdown You can also provide a prefix for your output files with PHOLD_PREFIX.\n",
"#@markdown If you provide nothing it will default to 'phold'.\n",
"\n",
"#@markdown You can click FORCE to overwrite the output directory with .\n",
"#@markdown This may be useful if your earlier phold run has crashed for whatever reason.\n",
"\n",
"#@markdown If your input has multiple phages, you can click SEPARATE.\n",
"#@markdown This will output separate GenBank files in the output directory.\n",
"\n",
"#@markdown The results of Phold will be in the folder icon on the left hand panel.\n",
"#@markdown Additionally, it will be zipped so you can download the whole directory.\n",
"\n",
"#@markdown The file to download is PHOLD_OUT_DIR.zip, where PHOLD_OUT_DIR is what you provided\n",
"\n",
"#@markdown If you do not see the output directory,\n",
"#@markdown refresh the window by either clicking the folder with the refresh icon below \"Files\"\n",
"#@markdown or double click and select \"Refresh\".\n",
"\n",
"\n",
"%%time\n",
"import os\n",
"import subprocess\n",
"import zipfile\n",
"\n",
"# phold input is pharokka output\n",
"PHOLD_INPUT = f\"{PHAROKKA_OUT_DIR}/{PHAROKKA_PREFIX}.gbk\"\n",
"PHOLD_OUT_DIR = 'output_phold' #@param {type:\"string\"}\n",
"PHOLD_PREFIX = 'phold' #@param {type:\"string\"}\n",
"FORCE = False #@param {type:\"boolean\"}\n",
"SEPARATE = False #@param {type:\"boolean\"}\n",
"\n",
"# Construct the command\n",
"command = f\"phold run -i {PHOLD_INPUT} -t 4 -o {PHOLD_OUT_DIR} -p {PHOLD_PREFIX}\"\n",
"\n",
"if FORCE is True:\n",
" command = f\"{command} -f\"\n",
"if SEPARATE is True:\n",
" command = f\"{command} --separate\"\n",
"\n",
"\n",
"# Execute the command\n",
"try:\n",
" print(\"Running phold\")\n",
" subprocess.run(command, shell=True, check=True)\n",
" print(\"phold completed successfully.\")\n",
" print(f\"Your output is in {PHOLD_OUT_DIR}.\")\n",
" print(f\"Zipping the output directory so you can download it all in one go.\")\n",
"\n",
" zip_filename = f\"{PHOLD_OUT_DIR}.zip\"\n",
"\n",
" # Zip the contents of the output directory\n",
" with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:\n",
" for root, dirs, files in os.walk(PHOLD_OUT_DIR):\n",
" for file in files:\n",
" zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), PHOLD_OUT_DIR))\n",
" print(f\"Output directory has been zipped to {zip_filename}\")\n",
"\n",
"\n",
"except subprocess.CalledProcessError as e:\n",
" print(f\"Error occurred: {e}\")\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9QfjP3q-Q04f",
"outputId": "36a9722e-9bdf-4aee-ad31-0c05bae81e27"
},
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Running phold\n",
"phold completed successfully.\n",
"Your output is in output_phold.\n",
"Zipping the output directory so you can download it all in one go.\n",
"Output directory has been zipped to output_phold.zip\n",
"CPU times: user 1.25 s, sys: 153 ms, total: 1.4 s\n",
"Wall time: 4min 17s\n"
]
}
]
}
]
}

0 comments on commit ca1867f

Please sign in to comment.