updating to add multiple phages support and recheck completeness

linsalrob · Jun 9, 2024 · 06a0373 · 06a0373
1 parent 561e2e0
commit 06a0373
Show file tree

Hide file tree

Showing 22 changed files with 666 additions and 407 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+problem_test/
+genbank_to.log
 sphae-*.err.txt
 sphae-*.out.txt
 sphae-Bu*

diff --git a/Changes.md b/Changes.md
@@ -4,9 +4,11 @@
 I am updating the default snakemake to v8.11. There have been some changes from v7 to v8, which for now has required me to update 
 - how snakemake can submit jobs to the local cluster - https://snakemake.readthedocs.io/en/v8.4.0/executing/cli.html 
 - The paired end reads have to have the pattern _R1 and _R2 in the sample names, made a note in readme.
+- Adding documentation on how to add specific number of base pairs to include in the analysis if phage genome length is known by defining the config file handling multiple phages assembled from one sample
+- Checking for DTRs from checkv results to confirm if the contig is an assembled genome.
 
 ## v1.4.1
-- added the option to run `sphae annotae` so assembled genomes or reoriented genomes from dnaapler can be run through only the annotation steps
+- added the option to run `sphae annotate` so assembled genomes or reoriented genomes from dnaapler can be run through only the annotation steps
 - Cleaning up code, removed use of Attrmap package
 - Updating the summary file to include phage plot from phynteny output
 - adding phold results to the summary as needed - making note is phold identified any similar integrase like genes

diff --git a/README.md b/README.md
@@ -32,6 +32,7 @@ This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journ
 A complete list of programs used for each step is mentioned in the `sphae.CITATION` file. 
 
 ### Install 
+## Abeey broke me here - need to add more instructions on creating new environments and installing etc.... 
 
 **Pip install**
 
@@ -48,7 +49,7 @@ conda install sphae
 Setting up a new conda environment 
 
 ```bash
-conda create -n sphae python=3.11
+conda create -n sphae python=3.12
 conda activate sphae
 conda install -n base -c conda-forge mamba #if you don't already have mamba installed
 ```
@@ -172,7 +173,7 @@ Genome summary file includes the following information to help,
 
    [Phynteny](https://github.com/susiegriggo/Phynteny), the tool used for this prediction, assigns a confidence score to each gene prediction. If this score falls below a certain threshold (typically 90%), the gene remains classified as having an unknown function. To further investigate these genes, advanced techniques such as folding using tools like [ColabFold](https://github.com/sokrypton/ColabFold) and [Foldseek](https://github.com/steineggerlab/foldseek) can be employed. Analyzing the structure of these genes may provide additional insights into their functionality and potential role in biological processes.
 
-1. **How do I visualize the phages and gene annotations?**
+5. **How do I visualize the phages and gene annotations?**
    To visualize the phages and gene annotations, I recommend using [Clinker](https://github.com/gamcil/clinker). First, gather all the sample genbank files from `sphae.out/RESULTS` and place them in a new directory. Then, execute the clinker command to generate clinker plots, which compare the genes in each genome to each other.
 
    Additionally, for enhanced visualization, consider running [dnaapler](https://github.com/gbouras13/dnaapler) on the genomes in fasta format obtained from 
@@ -181,16 +182,23 @@ Genome summary file includes the following information to help,
 
    Please note that dnaapler may fail if terminase genes are not found, particularly when working with novel phages. The reason these steps haven't been added to sphae. If you encounter any challenges during this process, please feel free to leave an issue, and I'll provide improved documentation to assist you further with the command on how to install and run the command different commands. 
 
-2. **Where are the intermediate files being saved?**
+6. **Where are the intermediate files being saved?**
    These files are being saved in `sphae.out/PROCESSING`. If you need more information on the file structure here, or have ideas of better organization then leave an issue and I will make a note to have more documentation. 
 
-3. **Just run annotation on already assembled genomes?**
+7. **Just run annotation on already assembled genomes?**
 
     `sphae annotate --input <input genomes>`
     This command runs only Pharokka, Phold and Phynteny to annotate the assembled genomes. The results are saved to a new directory labeled `sphae.out/annotation`. 
 
     Note: Currently, Sphae runs Phold in CPU mode, but efforts are underway to support Phold GPU mode for faster processing of this step.
 
+8. Adding new tools to worklow
+9. How to change the number of base pairs to subsample for a sample?
+    Run the command `sphae config`
+    This copies the config file within the workflow to the current directory. Open this file and update the line `bases: 10000000` to for instance `bases: 300000`
+    Then run sphae run with the command `sphae run --input tests/data/illumina-subset --output example -k --config <path to the config file with the change>`
+
+
 ## Issues and Questions
 
 If you come across any issues or errors, report them under [Issues](https://github.com/linsalrob/sphae/issues). 

diff --git a/sphae/__main__.py b/sphae/__main__.py
@@ -34,6 +34,10 @@ def default_to_output(ctx, param, value):
         return os.path.join(ctx.params["output"], value)
     return value
 
+# This is to ensure that the output directory gets generated if it hasnt already been generated
+def ensure_directory_exists(directory):
+    if not os.path.exists(directory):
+        os.makedirs(directory)
 
 def common_options(func):
     """Common command line args
@@ -169,13 +173,14 @@ def annotate(genome, output, db_dir, temp_dir, configfile, **kwargs):
 def run(_input, output, db_dir, sequencing, temp_dir, configfile, **kwargs):
     """Run sphae"""
     copy_config(configfile, system_config=snake_base(os.path.join('config', 'config.yaml')))
-    
+
     merge_config = {
         "args": {
             "input": _input, 
             "output": output, 
             "db_dir": db_dir,  
-            "sequencing": sequencing, 
+            "sequencing": sequencing,
+            "configfile": configfile,
             "temp_dir": temp_dir,
         }
     }

diff --git a/sphae/config/config.yaml b/sphae/config/config.yaml
@@ -14,7 +14,7 @@ params:
   genomeSize: 44k 
   min_mean_quality: 9 
   min_length: 1000 
-  bases: 1000000
+  bases: 10000000
 
 db:
   pfam: "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/"

diff --git a/sphae/workflow/rules/1.preflight-annot.smk b/sphae/workflow/rules/1.preflight-annot.smk
@@ -29,7 +29,7 @@ if config['args']['db_dir'] is None:
 else:
     dir_db = config['args']['db_dir']
 
-dir_annot = os.path.join(dir_out,'PROCESSING')
+dir_annot = os.path.join(dir_out,'PROCESSING', "genome-annotate")
 dir_final = os.path.join(dir_out,'final-annotate')
 dir_log = os.path.join(dir_out, 'logs')
 
@@ -81,7 +81,7 @@ onstart:
             os.unlink(os.path.join(dir["log"], logfile))
 onsuccess:
     """Print a success message"""
-    sys.stderr.write('\n\nSphaehost ran successfully!\n\n')
+    sys.stderr.write('\n\nSphae ran successfully!\n\n')
 onerror:
     """Print an error message"""
-    sys.stderr.write('\n\nSphaehost run failed\n\n')
+    sys.stderr.write('\n\nSphae run failed\n\n')
diff --git a/sphae/workflow/rules/1.preflight.smk b/sphae/workflow/rules/1.preflight.smk
@@ -28,7 +28,7 @@ dir_assembly = os.path.join(dir_out, 'PROCESSING','assembly')
 dir_megahit = os.path.join(dir_assembly, 'megahit')
 dir_flye = os.path.join(dir_assembly, 'flye')
 dir_genome = os.path.join(dir_out, 'PROCESSING','genome')
-dir_pharokka = os.path.join(dir_out, 'PROCESSING','annotate')
+dir_annotate = os.path.join(dir_out, 'PROCESSING','annotate')
 dir_log = os.path.join(dir_out, 'logs')
 dir_final = os.path.join(dir_out, 'RESULTS')
 
@@ -87,7 +87,7 @@ onstart:
             os.unlink(os.path.join(dir["log"], logfile))
 onsuccess:
     """Print a success message"""
-    sys.stderr.write('\n\nSphaehost ran successfully!\n\n')
+    sys.stderr.write('\n\nSphae ran successfully!\n\n')
 onerror:
     """Print an error message"""
-    sys.stderr.write('\n\nSphaehost run failed\n\n')
+    sys.stderr.write('\n\nSphae run failed\n\n')