-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Busco #987
Busco #987
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, looks quite good! But I do have some comments.
input: | ||
tuple val(meta), path(fasta) | ||
path(augustus_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing handling of a database such as in the nf-core/mag local module.
This is also especially needed for the offline mode. Some compute infrastructures do not have internet access to download the reference dataset, those have to be pre-downloaded and channeled into the module in that cases. Therefore a database input in required in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @d4straub that it is important to support offline use. I just submitted a pull request for a minimised lineage dataset that can be used to test the module in offline mode. The offline lineage datasets can be passed to the module via an extra channel, with the possibility to keep it empty like for path(augustus_config)
. I was using something like below in some local modules.
if (lineage) options.args += "--offline --lineage_dataset $lineage"
Shall we follow this approach, using --download_path
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.
Where? Which arguments? Do you mean option.args
? If yes, than thats possible but this is typically solved by a separate input channel. E.g.:
input: | |
tuple val(meta), path(fasta) | |
path(augustus_config) | |
input: | |
tuple val(meta), path(fasta) | |
path(augustus_config) | |
path(database_path) |
and using than in the script:
block
def database = database_path ? "--download_path $database_path" : ""
And than use
busco \\
$options.args \\
--augustus \\
$database \\
--cpu $task.cpus \\
--in $fasta \\
--out $meta.id
Shall we follow this approach, using --download_path instead?
I think so, yes, because it is more flexible.
edit: that code might not be perfect, its not tested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.
- fasta: | ||
type: file | ||
description: Nucleic or amino acid sequence file in FASTA format | ||
pattern: "*.{fasta}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it cannot handle compressed files? It might be worth to add a conditional unzipping to allow zipped input! Because modules should output zipped fasta
(and therefore allow compressed input), according to the module guidelines.
Also, does it require the fasta extension? If not, remove it or give more choice (fa,fa.gz
,etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just verified that BUSCO crashes on compressed FASTA input. Following recent discussions on Slack, transparent decompression can be achieved with the following patch. I chose __UNCOMPRESSED_FASTA_FILE__
as output name to make it extremely unlikely that it clashes with an existing file.
index 6a1a5644..8f3b1249 100644
--- a/modules/busco/main.nf
+++ b/modules/busco/main.nf
@@ -33,11 +33,13 @@ process BUSCO {
# Copy the image's AUGUSTUS config directory if it was not provided to the module
[ ! -e augustus_config ] && cp -a /usr/local/config augustus_config
AUGUSTUS_CONFIG_PATH=augustus_config \\
+ # Ensure the input is uncompressed
+ gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__
busco \\
$options.args \\
--augustus \\
--cpu $task.cpus \\
- --in $fasta \\
+ --in __UNCOMPRESSED_FASTA_FILE__ \\
--out $meta.id
I verified that it works with the following change to the the test file.
diff --git a/tests/modules/busco/main.nf b/tests/modules/busco/main.nf
index bf03cf10..1cc8fbdb 100644
--- a/tests/modules/busco/main.nf
+++ b/tests/modules/busco/main.nf
@@ -8,8 +8,7 @@ include { BUSCO } from '../../../modules/busco/main.nf' addParams( options: [arg
workflow test_busco {
compressed_genome_file = file(params.test_data['bacteroides_fragilis']['genome']['genome_fna_gz'], checkIfExists: true)
- GUNZIP ( compressed_genome_file )
- input = GUNZIP.out.gunzip.map { row -> [ [ id:'test' ], row] }
+ input = [ [ id:'test' ], compressed_genome_file ]
I also opened an issue upstream to request transparent decompression of input files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I rather thought to add in the script
block about something like
if [[ $file != *.gz ]]; then
gzip -n $fasta > $fasta.gz
fi
(code not tested! I am pretty sure there are examples in existing modules!)
- tsv: | ||
type: file | ||
description: Full summary table | ||
pattern: "*.{tsv}" | ||
- txt: | ||
type: file | ||
description: Short summary text | ||
pattern: "*.{txt}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know BUSCO has several modes, one of which allows for several output summaries, you can add optional output such as here to the main.nf
I am referring to automated lineage selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with --auto-lineage
and it already works. The output folder contains files from all the different databases:
$ tree output/busco
output/busco
└── test
├── run_bacteria_odb10
│ ├── full_table.tsv
│ └── short_summary.txt
└── run_bacteroidales_odb10
├── full_table.tsv
└── short_summary.txt
3 directories, 4 files
Is this what you were thinking or something more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of that files in your example (4 in total) should have their own output channel. That makes the output more predictable, therefore easier to use.
I tried the prefix suggestion but it started failing the tests, so I changed it back. |
@d4straub I also tried to substitute
Interestingly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@priyanka-surana For pytest to succeed, the paths and MD5 sum in the test.yml
file have to be correct as in the patch below. (I am less sure if the two first changes of the patch are important, but this is what create-test-yml
has output.
@d4straub The 'Edit' option is grayed and unavailable to me; do I need to be added to some kind of whitelist ?
diff --git a/tests/modules/busco/test.yml b/tests/modules/busco/test.yml
index aecccc45..a1363738 100644
--- a/tests/modules/busco/test.yml
+++ b/tests/modules/busco/test.yml
@@ -1,9 +1,9 @@
-- name: busco
- command: nextflow run ./tests/modules/busco -entry test_busco -c tests/config/nextflow.config
+- name: busco test_busco
+ command: nextflow run tests/modules/busco -entry test_busco -c tests/config/nextflow.config
tags:
- busco
files:
- - path: work/*/*/test/run_bacteria_odb10/full_table.tsv
- md5sum: e02aff5ce681637afaec99dbb845b1dc
- - path: work/*/*/test/short_summary*.txt
+ - path: output/busco/test/run_bacteroidales_odb10/full_table.tsv
+ md5sum: 8d7b401d875ecd9291b01bf4485bf080
+ - path: output/busco/test/run_bacteroidales_odb10/short_summary.txt
contains: ['Complete BUSCOs (C)']
It seems to me you are not part of the nf-core group at github, so please habe a look at https://nf-co.re/join#github
That doesn't give me anything to work with. More details on the error might help.
I guess my suggestion was bugged. The lines
seem to not work, just adjust as needed so that the output files are found. Sorry for giving incomplete/inaccurate suggestion here. |
@charles-plessy I have included all of our discussed changes. Please take a look and merge if possible. The tests worked for me. |
Removed from all modules in 9d0cad5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @priyanka-surana as you may have noticed there were major changed in the structure of the modules. I tried to apply them to busco
using samtools/sort
as a template. Can you check it works ?
hi @charles-plessy this is not working anymore. both the tests failed with the error: The parameter "mode (--mode)" was not provided. "--mode" is included in the optional arguments so I think something else broke. I am going to work on this later this week and get back to you. I might just start from a new template to keep things clean. |
They will be added in a new `nextflow.config` file.
Hi @priyanka-surana I figured out that the structure of the tests also changed. Options are now passed in a
I did not find how to create a new file in this PR using the review system. Can you create it ? With this, |
@charles-plessy I have added the |
@charles-plessy pytest checks successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this module!
I added a few comments! Most importantly, please add an entry to the pytest_modules.yml
file, currently no tests are run.
input: | ||
tuple val(meta), path(fasta) | ||
path(augustus_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.
$args \\ | ||
--augustus \\ | ||
--cpu $task.cpus \\ | ||
--in __UNCOMPRESSED_FASTA_FILE__ \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works this is preferrable, as the uncompressed file never gets written to disk.
--in __UNCOMPRESSED_FASTA_FILE__ \\ | |
--in <(gzip -cdf $fasta) \\ |
If the tool reads the input file multiple times, this will fail and there's no better way than what you are already doing.
# Ensure the input is uncompressed | ||
gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__ | ||
# Copy the image's AUGUSTUS config directory if it was not provided to the module | ||
[ ! -e augustus_config ] && cp -a /usr/local/config augustus_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this /usr/local/config` directory? Is it part of the container? What happens if the tool is running with conda?
Maybe it's safest to just mandate providing the config directory as input channel.
genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz" | ||
chr22_odb10_tar_gz = "${test_data_dir}/genomics/homo_sapiens/genome/BUSCO/chr22_odb10.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix the indentation
Closes #569
Worked on this module with @charles-plessy
This BUSCO module clears all manual testing. Fails with pytest, as it doesn't copy relevant output files outside of nextflow work directory.
$ PROFILE=singularity nextflow run tests/modules/busco -entry test_busco -c tests/config/nextflow.config
Currently emits the
full_table.tsv
andshort_summary.txt
files outside the nextflow work directory. Let us know if we should make it more generic and output the entirerun_*
folder.