Skip to content

Commit

Permalink
updating to add multiple phages support and recheck completeness
Browse files Browse the repository at this point in the history
  • Loading branch information
npbhavya committed Jun 9, 2024
1 parent 561e2e0 commit 06a0373
Show file tree
Hide file tree
Showing 22 changed files with 666 additions and 407 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
problem_test/
genbank_to.log
sphae-*.err.txt
sphae-*.out.txt
sphae-Bu*
Expand Down
4 changes: 3 additions & 1 deletion Changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
I am updating the default snakemake to v8.11. There have been some changes from v7 to v8, which for now has required me to update
- how snakemake can submit jobs to the local cluster - https://snakemake.readthedocs.io/en/v8.4.0/executing/cli.html
- The paired end reads have to have the pattern _R1 and _R2 in the sample names, made a note in readme.
- Adding documentation on how to add specific number of base pairs to include in the analysis if phage genome length is known by defining the config file handling multiple phages assembled from one sample
- Checking for DTRs from checkv results to confirm if the contig is an assembled genome.

## v1.4.1
- added the option to run `sphae annotae` so assembled genomes or reoriented genomes from dnaapler can be run through only the annotation steps
- added the option to run `sphae annotate` so assembled genomes or reoriented genomes from dnaapler can be run through only the annotation steps
- Cleaning up code, removed use of Attrmap package
- Updating the summary file to include phage plot from phynteny output
- adding phold results to the summary as needed - making note is phold identified any similar integrase like genes
Expand Down
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journ
A complete list of programs used for each step is mentioned in the `sphae.CITATION` file.

### Install
## Abeey broke me here - need to add more instructions on creating new environments and installing etc....

**Pip install**

Expand All @@ -48,7 +49,7 @@ conda install sphae
Setting up a new conda environment

```bash
conda create -n sphae python=3.11
conda create -n sphae python=3.12
conda activate sphae
conda install -n base -c conda-forge mamba #if you don't already have mamba installed
```
Expand Down Expand Up @@ -172,7 +173,7 @@ Genome summary file includes the following information to help,

[Phynteny](https://github.com/susiegriggo/Phynteny), the tool used for this prediction, assigns a confidence score to each gene prediction. If this score falls below a certain threshold (typically 90%), the gene remains classified as having an unknown function. To further investigate these genes, advanced techniques such as folding using tools like [ColabFold](https://github.com/sokrypton/ColabFold) and [Foldseek](https://github.com/steineggerlab/foldseek) can be employed. Analyzing the structure of these genes may provide additional insights into their functionality and potential role in biological processes.

1. **How do I visualize the phages and gene annotations?**
5. **How do I visualize the phages and gene annotations?**
To visualize the phages and gene annotations, I recommend using [Clinker](https://github.com/gamcil/clinker). First, gather all the sample genbank files from `sphae.out/RESULTS` and place them in a new directory. Then, execute the clinker command to generate clinker plots, which compare the genes in each genome to each other.

Additionally, for enhanced visualization, consider running [dnaapler](https://github.com/gbouras13/dnaapler) on the genomes in fasta format obtained from
Expand All @@ -181,16 +182,23 @@ Genome summary file includes the following information to help,

Please note that dnaapler may fail if terminase genes are not found, particularly when working with novel phages. The reason these steps haven't been added to sphae. If you encounter any challenges during this process, please feel free to leave an issue, and I'll provide improved documentation to assist you further with the command on how to install and run the command different commands.

2. **Where are the intermediate files being saved?**
6. **Where are the intermediate files being saved?**
These files are being saved in `sphae.out/PROCESSING`. If you need more information on the file structure here, or have ideas of better organization then leave an issue and I will make a note to have more documentation.

3. **Just run annotation on already assembled genomes?**
7. **Just run annotation on already assembled genomes?**

`sphae annotate --input <input genomes>`
This command runs only Pharokka, Phold and Phynteny to annotate the assembled genomes. The results are saved to a new directory labeled `sphae.out/annotation`.

Note: Currently, Sphae runs Phold in CPU mode, but efforts are underway to support Phold GPU mode for faster processing of this step.

8. Adding new tools to worklow
9. How to change the number of base pairs to subsample for a sample?
Run the command `sphae config`
This copies the config file within the workflow to the current directory. Open this file and update the line `bases: 10000000` to for instance `bases: 300000`
Then run sphae run with the command `sphae run --input tests/data/illumina-subset --output example -k --config <path to the config file with the change>`


## Issues and Questions

If you come across any issues or errors, report them under [Issues](https://github.com/linsalrob/sphae/issues).
Expand Down
9 changes: 7 additions & 2 deletions sphae/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ def default_to_output(ctx, param, value):
return os.path.join(ctx.params["output"], value)
return value

# This is to ensure that the output directory gets generated if it hasnt already been generated
def ensure_directory_exists(directory):
if not os.path.exists(directory):
os.makedirs(directory)

def common_options(func):
"""Common command line args
Expand Down Expand Up @@ -169,13 +173,14 @@ def annotate(genome, output, db_dir, temp_dir, configfile, **kwargs):
def run(_input, output, db_dir, sequencing, temp_dir, configfile, **kwargs):
"""Run sphae"""
copy_config(configfile, system_config=snake_base(os.path.join('config', 'config.yaml')))

merge_config = {
"args": {
"input": _input,
"output": output,
"db_dir": db_dir,
"sequencing": sequencing,
"sequencing": sequencing,
"configfile": configfile,
"temp_dir": temp_dir,
}
}
Expand Down
2 changes: 1 addition & 1 deletion sphae/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ params:
genomeSize: 44k
min_mean_quality: 9
min_length: 1000
bases: 1000000
bases: 10000000

db:
pfam: "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/"
Expand Down
6 changes: 3 additions & 3 deletions sphae/workflow/rules/1.preflight-annot.smk
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ if config['args']['db_dir'] is None:
else:
dir_db = config['args']['db_dir']

dir_annot = os.path.join(dir_out,'PROCESSING')
dir_annot = os.path.join(dir_out,'PROCESSING', "genome-annotate")
dir_final = os.path.join(dir_out,'final-annotate')
dir_log = os.path.join(dir_out, 'logs')

Expand Down Expand Up @@ -81,7 +81,7 @@ onstart:
os.unlink(os.path.join(dir["log"], logfile))
onsuccess:
"""Print a success message"""
sys.stderr.write('\n\nSphaehost ran successfully!\n\n')
sys.stderr.write('\n\nSphae ran successfully!\n\n')
onerror:
"""Print an error message"""
sys.stderr.write('\n\nSphaehost run failed\n\n')
sys.stderr.write('\n\nSphae run failed\n\n')
6 changes: 3 additions & 3 deletions sphae/workflow/rules/1.preflight.smk
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ dir_assembly = os.path.join(dir_out, 'PROCESSING','assembly')
dir_megahit = os.path.join(dir_assembly, 'megahit')
dir_flye = os.path.join(dir_assembly, 'flye')
dir_genome = os.path.join(dir_out, 'PROCESSING','genome')
dir_pharokka = os.path.join(dir_out, 'PROCESSING','annotate')
dir_annotate = os.path.join(dir_out, 'PROCESSING','annotate')
dir_log = os.path.join(dir_out, 'logs')
dir_final = os.path.join(dir_out, 'RESULTS')

Expand Down Expand Up @@ -87,7 +87,7 @@ onstart:
os.unlink(os.path.join(dir["log"], logfile))
onsuccess:
"""Print a success message"""
sys.stderr.write('\n\nSphaehost ran successfully!\n\n')
sys.stderr.write('\n\nSphae ran successfully!\n\n')
onerror:
"""Print an error message"""
sys.stderr.write('\n\nSphaehost run failed\n\n')
sys.stderr.write('\n\nSphae run failed\n\n')
Loading

0 comments on commit 06a0373

Please sign in to comment.