Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busco #987

Closed
wants to merge 28 commits into from
Closed

Busco #987

wants to merge 28 commits into from

Conversation

priyanka-surana
Copy link
Contributor

Closes #569

Worked on this module with @charles-plessy

This BUSCO module clears all manual testing. Fails with pytest, as it doesn't copy relevant output files outside of nextflow work directory.
$ PROFILE=singularity nextflow run tests/modules/busco -entry test_busco -c tests/config/nextflow.config

Currently emits the full_table.tsv and short_summary.txt files outside the nextflow work directory. Let us know if we should make it more generic and output the entire run_* folder.

Copy link
Contributor

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, looks quite good! But I do have some comments.

modules/busco/main.nf Show resolved Hide resolved
modules/busco/main.nf Show resolved Hide resolved
Comment on lines +21 to +23
input:
tuple val(meta), path(fasta)
path(augustus_config)
Copy link
Contributor

@d4straub d4straub Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing handling of a database such as in the nf-core/mag local module.
This is also especially needed for the offline mode. Some compute infrastructures do not have internet access to download the reference dataset, those have to be pre-downloaded and channeled into the module in that cases. Therefore a database input in required in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @d4straub that it is important to support offline use. I just submitted a pull request for a minimised lineage dataset that can be used to test the module in offline mode. The offline lineage datasets can be passed to the module via an extra channel, with the possibility to keep it empty like for path(augustus_config). I was using something like below in some local modules.

if (lineage) options.args += "--offline --lineage_dataset $lineage"

Shall we follow this approach, using --download_path instead?

Copy link
Contributor

@d4straub d4straub Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.

Where? Which arguments? Do you mean option.args? If yes, than thats possible but this is typically solved by a separate input channel. E.g.:

Suggested change
input:
tuple val(meta), path(fasta)
path(augustus_config)
input:
tuple val(meta), path(fasta)
path(augustus_config)
path(database_path)

and using than in the script: block
def database = database_path ? "--download_path $database_path" : ""
And than use

    busco \\
        $options.args \\
        --augustus \\
        $database \\
        --cpu $task.cpus \\
        --in  $fasta \\
        --out $meta.id

Shall we follow this approach, using --download_path instead?

I think so, yes, because it is more flexible.
edit: that code might not be perfect, its not tested!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.

- fasta:
type: file
description: Nucleic or amino acid sequence file in FASTA format
pattern: "*.{fasta}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it cannot handle compressed files? It might be worth to add a conditional unzipping to allow zipped input! Because modules should output zipped fasta (and therefore allow compressed input), according to the module guidelines.
Also, does it require the fasta extension? If not, remove it or give more choice (fa,fa.gz,etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just verified that BUSCO crashes on compressed FASTA input. Following recent discussions on Slack, transparent decompression can be achieved with the following patch. I chose __UNCOMPRESSED_FASTA_FILE__ as output name to make it extremely unlikely that it clashes with an existing file.

index 6a1a5644..8f3b1249 100644
--- a/modules/busco/main.nf
+++ b/modules/busco/main.nf
@@ -33,11 +33,13 @@ process BUSCO {
     # Copy the image's AUGUSTUS config directory if it was not provided to the module
     [ ! -e augustus_config ] && cp -a /usr/local/config augustus_config
     AUGUSTUS_CONFIG_PATH=augustus_config \\
+    # Ensure the input is uncompressed
+    gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__
     busco \\
         $options.args \\
         --augustus \\
         --cpu $task.cpus \\
-        --in  $fasta \\
+        --in __UNCOMPRESSED_FASTA_FILE__ \\
         --out $meta.id

I verified that it works with the following change to the the test file.

diff --git a/tests/modules/busco/main.nf b/tests/modules/busco/main.nf
index bf03cf10..1cc8fbdb 100644
--- a/tests/modules/busco/main.nf
+++ b/tests/modules/busco/main.nf
@@ -8,8 +8,7 @@ include { BUSCO } from '../../../modules/busco/main.nf' addParams( options: [arg
 workflow test_busco {
     
        compressed_genome_file = file(params.test_data['bacteroides_fragilis']['genome']['genome_fna_gz'], checkIfExists: true)
-       GUNZIP ( compressed_genome_file )
-       input = GUNZIP.out.gunzip.map { row -> [ [ id:'test' ], row] }
+       input = [ [ id:'test' ], compressed_genome_file ]

I also opened an issue upstream to request transparent decompression of input files.

Copy link
Contributor

@d4straub d4straub Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I rather thought to add in the script block about something like

if [[ $file != *.gz ]]; then
    gzip -n $fasta > $fasta.gz
fi

(code not tested! I am pretty sure there are examples in existing modules!)

Comment on lines +41 to +48
- tsv:
type: file
description: Full summary table
pattern: "*.{tsv}"
- txt:
type: file
description: Short summary text
pattern: "*.{txt}"
Copy link
Contributor

@d4straub d4straub Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know BUSCO has several modes, one of which allows for several output summaries, you can add optional output such as here to the main.nf
I am referring to automated lineage selection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with --auto-lineage and it already works. The output folder contains files from all the different databases:

$ tree output/busco
output/busco
└── test
    ├── run_bacteria_odb10
    │   ├── full_table.tsv
    │   └── short_summary.txt
    └── run_bacteroidales_odb10
        ├── full_table.tsv
        └── short_summary.txt

3 directories, 4 files

Is this what you were thinking or something more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of that files in your example (4 in total) should have their own output channel. That makes the output more predictable, therefore easier to use.

priyanka-surana and others added 4 commits November 10, 2021 15:16
Co-authored-by: Daniel Straub <42973691+d4straub@users.noreply.github.com>
Co-authored-by: Daniel Straub <42973691+d4straub@users.noreply.github.com>
@priyanka-surana
Copy link
Contributor Author

I tried the prefix suggestion but it started failing the tests, so I changed it back.

@charles-plessy
Copy link
Contributor

@d4straub I also tried to substitute meta.id by prefix as you suggested and I also ran into error like Priyanka. The error message suggests that prefix evaluates to null when searching for the result files in the output: section:

Missing output file(s) `null/run_*/full_table.tsv` expected by process `test_busco:BUSCO (test)`

Interestingly, prefix does evaluate the same as meta.id in the script: section. I can not find where the error is...

Copy link
Contributor

@charles-plessy charles-plessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@priyanka-surana For pytest to succeed, the paths and MD5 sum in the test.yml file have to be correct as in the patch below. (I am less sure if the two first changes of the patch are important, but this is what create-test-yml has output.

@d4straub The 'Edit' option is grayed and unavailable to me; do I need to be added to some kind of whitelist ?

diff --git a/tests/modules/busco/test.yml b/tests/modules/busco/test.yml
index aecccc45..a1363738 100644
--- a/tests/modules/busco/test.yml
+++ b/tests/modules/busco/test.yml
@@ -1,9 +1,9 @@
-- name: busco
-  command: nextflow run ./tests/modules/busco -entry test_busco -c tests/config/nextflow.config
+- name: busco test_busco
+  command: nextflow run tests/modules/busco -entry test_busco -c tests/config/nextflow.config
   tags:
     - busco
   files:
-    - path: work/*/*/test/run_bacteria_odb10/full_table.tsv
-      md5sum: e02aff5ce681637afaec99dbb845b1dc
-    - path: work/*/*/test/short_summary*.txt
+    - path: output/busco/test/run_bacteroidales_odb10/full_table.tsv
+      md5sum: 8d7b401d875ecd9291b01bf4485bf080
+    - path: output/busco/test/run_bacteroidales_odb10/short_summary.txt
       contains: ['Complete BUSCOs (C)']

@d4straub
Copy link
Contributor

d4straub commented Nov 11, 2021

@d4straub The 'Edit' option is grayed and unavailable to me; do I need to be added to some kind of whitelist ?

It seems to me you are not part of the nf-core group at github, so please habe a look at https://nf-co.re/join#github

I tried the prefix suggestion but it started failing the tests, so I changed it back.

That doesn't give me anything to work with. More details on the error might help.

Interestingly, prefix does evaluate the same as meta.id in the script: section. I can not find where the error is...

I guess my suggestion was bugged. The lines

    tuple val(meta), path("${prefix}/run_*/full_table.tsv"), emit: tsv
    tuple val(meta), path("${prefix}/run_*/short_summary.txt"), emit: txt

seem to not work, just adjust as needed so that the output files are found. Sorry for giving incomplete/inaccurate suggestion here.

@priyanka-surana priyanka-surana added WIP Work in progress new module Adding a new module labels Nov 12, 2021
@priyanka-surana
Copy link
Contributor Author

@charles-plessy I have included all of our discussed changes. Please take a look and merge if possible. The tests worked for me.

Copy link
Contributor

@charles-plessy charles-plessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @priyanka-surana as you may have noticed there were major changed in the structure of the modules. I tried to apply them to busco using samtools/sort as a template. Can you check it works ?

@priyanka-surana
Copy link
Contributor Author

priyanka-surana commented Dec 6, 2021

hi @charles-plessy this is not working anymore. both the tests failed with the error: The parameter "mode (--mode)" was not provided. "--mode" is included in the optional arguments so I think something else broke. I am going to work on this later this week and get back to you. I might just start from a new template to keep things clean.

They will be added in a new `nextflow.config` file.
@charles-plessy
Copy link
Contributor

Hi @priyanka-surana I figured out that the structure of the tests also changed. Options are now passed in a nextflow.config file. Following the last commit I made, we need tests/modules/busco/nextflow.config to contain:

process {

    publishDir = { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }

    withName: BUSCO_BACTE {
        ext.args = '--mode genome --lineage_dataset bacteroidales_odb10'
    }

    withName: BUSCO_CHR22 {
        ext.args = '--mode genome --offline'
    }

}

I did not find how to create a new file in this PR using the review system. Can you create it ?

With this, pytest succeeds on my side after merging on the master branch (I did not test if it was strictly required) and using the nf-core tools as of December 6th.

@priyanka-surana
Copy link
Contributor Author

@charles-plessy I have added the nextflow.config file and testing without pytest works. Pytest gives syntax errors, I posted on nf-core help channel, will have to wait till it resolves.

@priyanka-surana
Copy link
Contributor Author

@charles-plessy pytest checks successful.

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this module!

I added a few comments! Most importantly, please add an entry to the pytest_modules.yml file, currently no tests are run.

Comment on lines +21 to +23
input:
tuple val(meta), path(fasta)
path(augustus_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.

$args \\
--augustus \\
--cpu $task.cpus \\
--in __UNCOMPRESSED_FASTA_FILE__ \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works this is preferrable, as the uncompressed file never gets written to disk.

Suggested change
--in __UNCOMPRESSED_FASTA_FILE__ \\
--in <(gzip -cdf $fasta) \\

If the tool reads the input file multiple times, this will fail and there's no better way than what you are already doing.

# Ensure the input is uncompressed
gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__
# Copy the image's AUGUSTUS config directory if it was not provided to the module
[ ! -e augustus_config ] && cp -a /usr/local/config augustus_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this /usr/local/config` directory? Is it part of the container? What happens if the tool is running with conda?
Maybe it's safest to just mandate providing the config directory as input channel.

Comment on lines +117 to +118
genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz"
chr22_odb10_tar_gz = "${test_data_dir}/genomics/homo_sapiens/genome/BUSCO/chr22_odb10.tar.gz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the indentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new module: BUSCO
5 participants