Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busco #987

Closed
wants to merge 28 commits into from
Closed

Busco #987

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
fc7a152
Added busco modules and modified the associated files.
priyanka-surana Oct 26, 2021
578537b
Updating the busco module
priyanka-surana Oct 27, 2021
476aedc
Fixed typo in output
priyanka-surana Oct 28, 2021
b0e43d5
Made change to the genome file location
priyanka-surana Oct 28, 2021
fd64125
Added gunzip feature in testing
priyanka-surana Oct 29, 2021
795fe67
New testing with lineage dataset specified
priyanka-surana Oct 29, 2021
4488dfe
Modified how test data is referenced
priyanka-surana Oct 29, 2021
e9935e5
Merge branch 'master' into busco
priyanka-surana Oct 29, 2021
79832bc
Changes meta.id to prefix for out
priyanka-surana Nov 10, 2021
f6a0a79
Changes meta.id to prefix for output files
priyanka-surana Nov 10, 2021
a88f19b
Revert "Changes meta.id to prefix for output files"
priyanka-surana Nov 10, 2021
658f319
Revert "Changes meta.id to prefix for out"
priyanka-surana Nov 10, 2021
53588a2
Added compressed genome functionality
priyanka-surana Nov 18, 2021
621520c
Modified lineage feature
priyanka-surana Nov 18, 2021
048b11e
Added lineage information to test/main.nf
priyanka-surana Nov 18, 2021
a0c5294
Tried fixing lineage optional input again
priyanka-surana Nov 18, 2021
a13ce07
Merge branch 'master' into busco
rpetit3 Nov 18, 2021
0293e10
Merge branch 'nf-core:master' into busco
priyanka-surana Nov 22, 2021
c7efae3
Edited testing for pre-downloaded lineage
priyanka-surana Nov 22, 2021
a6bb035
Merge branch 'nf-core:master' into busco
priyanka-surana Nov 29, 2021
4f7a94c
Included @CharlesPlessy suggestions
priyanka-surana Nov 29, 2021
2e696e1
Merge branch 'nf-core:master' into busco
priyanka-surana Dec 2, 2021
82e81b7
modified test yml file format
priyanka-surana Dec 2, 2021
e3d4ef3
Delete functions.nf
charles-plessy Dec 6, 2021
5a418e4
Update to latest nf-core standards
charles-plessy Dec 6, 2021
ad421b9
Remove options from test nf file
charles-plessy Dec 7, 2021
0c5b9a5
Merge branch 'master' into busco
priyanka-surana Dec 7, 2021
0d9123c
Added tests nextflow.config file
priyanka-surana Dec 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions modules/busco/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
process BUSCO {
tag "$meta.id"
label 'process_medium'
conda (params.enable_conda ? "bioconda::busco=5.2.2" : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/busco:5.2.2--pyhdfd78af_0' :
'quay.io/biocontainers/busco:5.2.2--pyhdfd78af_0' }"

input:
tuple val(meta), path(fasta)
path(augustus_config)
Comment on lines +9 to +11
Copy link
Contributor

@d4straub d4straub Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing handling of a database such as in the nf-core/mag local module.
This is also especially needed for the offline mode. Some compute infrastructures do not have internet access to download the reference dataset, those have to be pre-downloaded and channeled into the module in that cases. Therefore a database input in required in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @d4straub that it is important to support offline use. I just submitted a pull request for a minimised lineage dataset that can be used to test the module in offline mode. The offline lineage datasets can be passed to the module via an extra channel, with the possibility to keep it empty like for path(augustus_config). I was using something like below in some local modules.

if (lineage) options.args += "--offline --lineage_dataset $lineage"

Shall we follow this approach, using --download_path instead?

Copy link
Contributor

@d4straub d4straub Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this module, we give database as one of the arguments rather than an input. In tests, it works with both local database and downloaded database. This makes it more generic, rather than absolutely needing a database.

Where? Which arguments? Do you mean option.args? If yes, than thats possible but this is typically solved by a separate input channel. E.g.:

Suggested change
input:
tuple val(meta), path(fasta)
path(augustus_config)
input:
tuple val(meta), path(fasta)
path(augustus_config)
path(database_path)

and using than in the script: block
def database = database_path ? "--download_path $database_path" : ""
And than use

    busco \\
        $options.args \\
        --augustus \\
        $database \\
        --cpu $task.cpus \\
        --in  $fasta \\
        --out $meta.id

Shall we follow this approach, using --download_path instead?

I think so, yes, because it is more flexible.
edit: that code might not be perfect, its not tested!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all input files, including databases, should be passed as an input channel - only like that can nextflow take care of mounting files into containers, upload them to AWS worker nodes etc.

val(lineage)

output:
tuple val(meta), path("${meta.id}/run_*/full_table.tsv"), emit: tsv
tuple val(meta), path("${meta.id}/run_*/short_summary.txt"), emit: txt
priyanka-surana marked this conversation as resolved.
Show resolved Hide resolved
path "versions.yml", emit: versions

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
if (lineage) args += " --lineage_dataset $lineage"
"""
# Ensure the input is uncompressed
gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__
# Copy the image's AUGUSTUS config directory if it was not provided to the module
[ ! -e augustus_config ] && cp -a /usr/local/config augustus_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this /usr/local/config` directory? Is it part of the container? What happens if the tool is running with conda?
Maybe it's safest to just mandate providing the config directory as input channel.

AUGUSTUS_CONFIG_PATH=augustus_config \\
busco \\
$args \\
--augustus \\
--cpu $task.cpus \\
--in __UNCOMPRESSED_FASTA_FILE__ \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it works this is preferrable, as the uncompressed file never gets written to disk.

Suggested change
--in __UNCOMPRESSED_FASTA_FILE__ \\
--in <(gzip -cdf $fasta) \\

If the tool reads the input file multiple times, this will fail and there's no better way than what you are already doing.

--out $meta.id
priyanka-surana marked this conversation as resolved.
Show resolved Hide resolved

cat <<-END_VERSIONS > versions.yml
"${task.process}":
busco: \$( busco --version 2>&1 | sed 's/^BUSCO //' )
END_VERSIONS
"""
}
52 changes: 52 additions & 0 deletions modules/busco/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: busco
description: Benchmarking Universal Single Copy Orthologs
keywords:
- quality control
- genome
- transcriptome
- proteome
tools:
- busco:
description: BUSCO provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB.
homepage: https://busco.ezlab.org/
documentation: ttps://busco.ezlab.org/busco_userguide.html
tool_dev_url: https://gitlab.com/ezlab/busco
doi: "10.1007/978-1-4939-9173-0_14"
licence: ['MIT']

input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- fasta:
type: file
description: Nucleic or amino acid sequence file in FASTA format
pattern: "*.{fasta}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it cannot handle compressed files? It might be worth to add a conditional unzipping to allow zipped input! Because modules should output zipped fasta (and therefore allow compressed input), according to the module guidelines.
Also, does it require the fasta extension? If not, remove it or give more choice (fa,fa.gz,etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just verified that BUSCO crashes on compressed FASTA input. Following recent discussions on Slack, transparent decompression can be achieved with the following patch. I chose __UNCOMPRESSED_FASTA_FILE__ as output name to make it extremely unlikely that it clashes with an existing file.

index 6a1a5644..8f3b1249 100644
--- a/modules/busco/main.nf
+++ b/modules/busco/main.nf
@@ -33,11 +33,13 @@ process BUSCO {
     # Copy the image's AUGUSTUS config directory if it was not provided to the module
     [ ! -e augustus_config ] && cp -a /usr/local/config augustus_config
     AUGUSTUS_CONFIG_PATH=augustus_config \\
+    # Ensure the input is uncompressed
+    gzip -cdf $fasta > __UNCOMPRESSED_FASTA_FILE__
     busco \\
         $options.args \\
         --augustus \\
         --cpu $task.cpus \\
-        --in  $fasta \\
+        --in __UNCOMPRESSED_FASTA_FILE__ \\
         --out $meta.id

I verified that it works with the following change to the the test file.

diff --git a/tests/modules/busco/main.nf b/tests/modules/busco/main.nf
index bf03cf10..1cc8fbdb 100644
--- a/tests/modules/busco/main.nf
+++ b/tests/modules/busco/main.nf
@@ -8,8 +8,7 @@ include { BUSCO } from '../../../modules/busco/main.nf' addParams( options: [arg
 workflow test_busco {
     
        compressed_genome_file = file(params.test_data['bacteroides_fragilis']['genome']['genome_fna_gz'], checkIfExists: true)
-       GUNZIP ( compressed_genome_file )
-       input = GUNZIP.out.gunzip.map { row -> [ [ id:'test' ], row] }
+       input = [ [ id:'test' ], compressed_genome_file ]

I also opened an issue upstream to request transparent decompression of input files.

Copy link
Contributor

@d4straub d4straub Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I rather thought to add in the script block about something like

if [[ $file != *.gz ]]; then
    gzip -n $fasta > $fasta.gz
fi

(code not tested! I am pretty sure there are examples in existing modules!)

- augustus_config:
type: directory
description: AUGUSTUS config directory

output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- versions:
type: file
description: File containing software versions
pattern: "versions.yml"
- tsv:
type: file
description: Full summary table
pattern: "*.{tsv}"
- txt:
type: file
description: Short summary text
pattern: "*.{txt}"
Comment on lines +41 to +48
Copy link
Contributor

@d4straub d4straub Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know BUSCO has several modes, one of which allows for several output summaries, you can add optional output such as here to the main.nf
I am referring to automated lineage selection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with --auto-lineage and it already works. The output folder contains files from all the different databases:

$ tree output/busco
output/busco
└── test
    ├── run_bacteria_odb10
    │   ├── full_table.tsv
    │   └── short_summary.txt
    └── run_bacteroidales_odb10
        ├── full_table.tsv
        └── short_summary.txt

3 directories, 4 files

Is this what you were thinking or something more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of that files in your example (4 in total) should have their own output channel. That makes the output more predictable, therefore easier to use.


authors:
- "@charles-plessy"
- "@priyanka-surana"
4 changes: 2 additions & 2 deletions tests/config/test_data.config
Original file line number Diff line number Diff line change
Expand Up @@ -114,8 +114,8 @@ params {
genome_bed_gz_tbi = "${test_data_dir}/genomics/homo_sapiens/genome/genome.bed.gz.tbi"
transcriptome_fasta = "${test_data_dir}/genomics/homo_sapiens/genome/transcriptome.fasta"
genome2_fasta = "${test_data_dir}/genomics/homo_sapiens/genome/genome2.fasta"
genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz"

genome_chain_gz = "${test_data_dir}/genomics/homo_sapiens/genome/genome.chain.gz"
chr22_odb10_tar_gz = "${test_data_dir}/genomics/homo_sapiens/genome/BUSCO/chr22_odb10.tar.gz"
Comment on lines +117 to +118
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the indentation

dbsnp_146_hg38_vcf_gz = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz"
dbsnp_146_hg38_vcf_gz_tbi = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz.tbi"
gnomad_r2_1_1_vcf_gz = "${test_data_dir}/genomics/homo_sapiens/genome/vcf/gnomAD.r2.1.1.vcf.gz"
Expand Down
26 changes: 26 additions & 0 deletions tests/modules/busco/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env nextflow

nextflow.enable.dsl = 2

include { BUSCO as BUSCO_BACTE } from '../../../modules/busco/main.nf'
include { BUSCO as BUSCO_CHR22 } from '../../../modules/busco/main.nf'
include { UNTAR } from '../../../modules/untar/main.nf'

// This tests genome decompression, empty input channels and data download
workflow test_busco_bacteroidales {
input = [ [ id:'test' ], file(params.test_data['bacteroides_fragilis']['genome']['genome_fna_gz'], checkIfExists: true) ]
BUSCO_BACTE ( input,
[],
[] )
}

// This tests uncompressed genome, BUSCO lineage file provided via input channel, and offline mode
workflow test_busco_chr22 {
input = [ [ id:'test' ], file(params.test_data['homo_sapiens']['genome']['genome_fasta'], checkIfExists: true) ]
lineage_dataset = [ file(params.test_data['homo_sapiens']['genome']['chr22_odb10_tar_gz'], checkIfExists: true) ]
UNTAR(lineage_dataset)
BUSCO_CHR22 ( input,
[],
UNTAR.out.untar )
}

14 changes: 14 additions & 0 deletions tests/modules/busco/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
process {

publishDir = { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" }

withName: BUSCO_BACTE {
ext.args = '--mode genome --lineage_dataset bacteroidales_odb10'
}

withName: BUSCO_CHR22 {
ext.args = '--mode genome --offline'
}

}

20 changes: 20 additions & 0 deletions tests/modules/busco/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
- name: busco test_busco_bacteroidales
command: nextflow run tests/modules/busco -entry test_busco_bacteroidales -c tests/config/nextflow.config
tags:
- busco
files:
- path: output/busco/test/run_bacteroidales_odb10/full_table.tsv
md5sum: 8d7b401d875ecd9291b01bf4485bf080
- path: output/busco/test/run_bacteroidales_odb10/short_summary.txt
contains: ['Complete BUSCOs (C)']

- name: busco test_busco_chr22
command: nextflow run tests/modules/busco -entry test_busco_chr22 -c tests/config/nextflow.config
tags:
- busco
files:
- path: output/busco/test/run_chr22_odb10/full_table.tsv
md5sum: 83f20e8996c591338ada73b6ab0eb269
- path: output/busco/test/run_chr22_odb10/short_summary.txt
contains: ['Complete BUSCOs (C)']