-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Busco #987
Busco #987
Changes from all commits
fc7a152
578537b
476aedc
b0e43d5
fd64125
795fe67
4488dfe
e9935e5
79832bc
f6a0a79
a88f19b
658f319
53588a2
621520c
048b11e
a0c5294
a13ce07
0293e10
c7efae3
a6bb035
4f7a94c
2e696e1
82e81b7
e3d4ef3
5a418e4
ad421b9
0c5b9a5
0d9123c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,41 @@ | ||||||
process BUSCO { | ||||||
tag "$meta.id" | ||||||
label 'process_medium' | ||||||
conda (params.enable_conda ? "bioconda::busco=5.2.2" : null) | ||||||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | ||||||
'https://depot.galaxyproject.org/singularity/busco:5.2.2--pyhdfd78af_0' : | ||||||
'quay.io/biocontainers/busco:5.2.2--pyhdfd78af_0' }" | ||||||
|
||||||
input: | ||||||
tuple val(meta), path(fasta) | ||||||
path(augustus_config) | ||||||
val(lineage) | ||||||
|
||||||
output: | ||||||
tuple val(meta), path("${meta.id}/run_*/full_table.tsv"), emit: tsv | ||||||
tuple val(meta), path("${meta.id}/run_*/short_summary.txt"), emit: txt | ||||||
priyanka-surana marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
path "versions.yml", emit: versions | ||||||
|
||||||
script: | ||||||
def args = task.ext.args ?: '' | ||||||
def prefix = task.ext.prefix ?: "${meta.id}" | ||||||
if (lineage) args += " --lineage_dataset $lineage" | ||||||
""" | ||||||
# Ensure the input is uncompressed | ||||||
gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__ | ||||||
# Copy the image's AUGUSTUS config directory if it was not provided to the module | ||||||
[ ! -e augustus_config ] && cp -a /usr/local/config augustus_config | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's this /usr/local/config` directory? Is it part of the container? What happens if the tool is running with conda? |
||||||
AUGUSTUS_CONFIG_PATH=augustus_config \\ | ||||||
busco \\ | ||||||
$args \\ | ||||||
--augustus \\ | ||||||
--cpu $task.cpus \\ | ||||||
--in __UNCOMPRESSED_FASTA_FILE__ \\ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it works this is preferrable, as the uncompressed file never gets written to disk.
Suggested change
If the tool reads the input file multiple times, this will fail and there's no better way than what you are already doing. |
||||||
--out $meta.id | ||||||
priyanka-surana marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
cat <<-END_VERSIONS > versions.yml | ||||||
"${task.process}": | ||||||
busco: \$( busco --version 2>&1 | sed 's/^BUSCO //' ) | ||||||
END_VERSIONS | ||||||
""" | ||||||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
name: busco | ||
description: Benchmarking Universal Single Copy Orthologs | ||
keywords: | ||
- quality control | ||
- genome | ||
- transcriptome | ||
- proteome | ||
tools: | ||
- busco: | ||
description: BUSCO provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB. | ||
homepage: https://busco.ezlab.org/ | ||
documentation: ttps://busco.ezlab.org/busco_userguide.html | ||
tool_dev_url: https://gitlab.com/ezlab/busco | ||
doi: "10.1007/978-1-4939-9173-0_14" | ||
licence: ['MIT'] | ||
|
||
input: | ||
- meta: | ||
type: map | ||
description: | | ||
Groovy Map containing sample information | ||
e.g. [ id:'test', single_end:false ] | ||
- fasta: | ||
type: file | ||
description: Nucleic or amino acid sequence file in FASTA format | ||
pattern: "*.{fasta}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure it cannot handle compressed files? It might be worth to add a conditional unzipping to allow zipped input! Because modules should output zipped There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just verified that BUSCO crashes on compressed FASTA input. Following recent discussions on Slack, transparent decompression can be achieved with the following patch. I chose
I verified that it works with the following change to the the test file.
I also opened an issue upstream to request transparent decompression of input files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm I rather thought to add in the
(code not tested! I am pretty sure there are examples in existing modules!) |
||
- augustus_config: | ||
type: directory | ||
description: AUGUSTUS config directory | ||
|
||
output: | ||
- meta: | ||
type: map | ||
description: | | ||
Groovy Map containing sample information | ||
e.g. [ id:'test', single_end:false ] | ||
- versions: | ||
type: file | ||
description: File containing software versions | ||
pattern: "versions.yml" | ||
- tsv: | ||
type: file | ||
description: Full summary table | ||
pattern: "*.{tsv}" | ||
- txt: | ||
type: file | ||
description: Short summary text | ||
pattern: "*.{txt}" | ||
Comment on lines
+41
to
+48
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I know BUSCO has several modes, one of which allows for several output summaries, you can add optional output such as here to the main.nf There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tested with
Is this what you were thinking or something more? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each of that files in your example (4 in total) should have their own output channel. That makes the output more predictable, therefore easier to use. |
||
|
||
authors: | ||
- "@charles-plessy" | ||
- "@priyanka-surana" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,8 +114,8 @@ params { | |
genome_bed_gz_tbi = "${test_data_dir}/genomics/homo_sapiens/genome/genome.bed.gz.tbi" | ||
transcriptome_fasta = "${test_data_dir}/genomics/homo_sapiens/genome/transcriptome.fasta" | ||
genome2_fasta = "${test_data_dir}/genomics/homo_sapiens/genome/genome2.fasta" | ||
genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz" | ||
|
||
genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz" | ||
chr22_odb10_tar_gz = "${test_data_dir}/genomics/homo_sapiens/genome/BUSCO/chr22_odb10.tar.gz" | ||
Comment on lines
+117
to
+118
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please fix the indentation |
||
dbsnp_146_hg38_vcf_gz = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz" | ||
dbsnp_146_hg38_vcf_gz_tbi = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi" | ||
gnomad_r2_1_1_vcf_gz = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/gnomAD.r2.1.1.vcf.gz" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
#!/usr/bin/env nextflow | ||
|
||
nextflow.enable.dsl = 2 | ||
|
||
include { BUSCO as BUSCO_BACTE } from '../../../modules/busco/main.nf' | ||
include { BUSCO as BUSCO_CHR22 } from '../../../modules/busco/main.nf' | ||
include { UNTAR } from '../../../modules/untar/main.nf' | ||
|
||
// This tests genome decompression, empty input channels and data download | ||
workflow test_busco_bacteroidales { | ||
input = [ [ id:'test' ], file(params.test_data['bacteroides_fragilis']['genome']['genome_fna_gz'], checkIfExists: true) ] | ||
BUSCO_BACTE ( input, | ||
[], | ||
[] ) | ||
} | ||
|
||
// This tests uncompressed genome, BUSCO lineage file provided via input channel, and offline mode | ||
workflow test_busco_chr22 { | ||
input = [ [ id:'test' ], file(params.test_data['homo_sapiens']['genome']['genome_fasta'], checkIfExists: true) ] | ||
lineage_dataset = [ file(params.test_data['homo_sapiens']['genome']['chr22_odb10_tar_gz'], checkIfExists: true) ] | ||
UNTAR(lineage_dataset) | ||
BUSCO_CHR22 ( input, | ||
[], | ||
UNTAR.out.untar ) | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
process { | ||
|
||
publishDir = { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" } | ||
|
||
withName: BUSCO_BACTE { | ||
ext.args = '--mode genome --lineage_dataset bacteroidales_odb10' | ||
} | ||
|
||
withName: BUSCO_CHR22 { | ||
ext.args = '--mode genome --offline' | ||
} | ||
|
||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
- name: busco test_busco_bacteroidales | ||
command: nextflow run tests/modules/busco -entry test_busco_bacteroidales -c tests/config/nextflow.config | ||
tags: | ||
- busco | ||
files: | ||
- path: output/busco/test/run_bacteroidales_odb10/full_table.tsv | ||
md5sum: 8d7b401d875ecd9291b01bf4485bf080 | ||
- path: output/busco/test/run_bacteroidales_odb10/short_summary.txt | ||
contains: ['Complete BUSCOs (C)'] | ||
|
||
- name: busco test_busco_chr22 | ||
command: nextflow run tests/modules/busco -entry test_busco_chr22 -c tests/config/nextflow.config | ||
tags: | ||
- busco | ||
files: | ||
- path: output/busco/test/run_chr22_odb10/full_table.tsv | ||
md5sum: 83f20e8996c591338ada73b6ab0eb269 | ||
- path: output/busco/test/run_chr22_odb10/short_summary.txt | ||
contains: ['Complete BUSCOs (C)'] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing handling of a database such as in the nf-core/mag local module.
This is also especially needed for the offline mode. Some compute infrastructures do not have internet access to download the reference dataset, those have to be pre-downloaded and channeled into the module in that cases. Therefore a database input in required in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @d4straub that it is important to support offline use. I just submitted a pull request for a minimised lineage dataset that can be used to test the module in offline mode. The offline lineage datasets can be passed to the module via an extra channel, with the possibility to keep it empty like for
path(augustus_config)
. I was using something like below in some local modules.Shall we follow this approach, using
--download_path
instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where? Which arguments? Do you mean
option.args
? If yes, than thats possible but this is typically solved by a separate input channel. E.g.:and using than in the
script:
blockdef database = database_path ? "--download_path $database_path" : ""
And than use
I think so, yes, because it is more flexible.
edit: that code might not be perfect, its not tested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.