Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional skipping of short-read input to Filtlong for large datasets #691

Open
ddomman opened this issue Oct 14, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@ddomman
Copy link

ddomman commented Oct 14, 2024

Description of the bug

Hi all - long time listener first time caller:

I have a rather large set of Illumina data along with some nanopore reads on which I was trying to run the hybrid assembly option. After 10+ hours, filtlong was still processing the nanopore reads. I did some digging and the current command utilizes the short-read data as part of the reference option. I think that is fine for small-ish datasets but seems impractical for larger ones.

Once I edited the filtlong.nf code to no longer use the short-reads, the filtlong process took less than 5 minutes and the pipeline has proceeded as expected. Maybe there could be a flag to turn on/off that feature?

filtlong.nf:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        -1 ${short_reads_1} \
        -2 ${short_reads_2} \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --trim \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Edited working solution:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Command used and terminal output

No response

Relevant files

No response

System information

No response

@ddomman ddomman added the bug Something isn't working label Oct 14, 2024
@jfy133
Copy link
Member

jfy133 commented Oct 14, 2024

Hi @ddomman !

Thanks for this! This is great you have a solution already :)

Within the module we could make it optional by inserting the short_reads1/2 into the ocmmand if supplied, something along the lines of:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    def sr_command = short_reads_1 ? "-1 ${short_reads_1} -2 ${short_reads_2} \\" : ""
    """
    filtlong \
        ${sr_command}
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

If you want to contribute to the module and pipeline, you can make these changes via a PR to nf-core/moduels, and we can update in the pipeline (With credit to you!) - the contributions will be gratefully recieved :)

Note that @muabnezor is currently in the process of overhauling the long-read/nanopore preprocessing tools anyway, we just merged into the dev branch porechop_abi as a faster replacment for porechop and next we plan to add nanoq as an alternative to Filtlong. So if you prefer that, you could wait for that instead

That said, I think updating filtlong would still be very helpful to the community as a whole. Let me know what you think!

@jfy133 jfy133 added enhancement New feature or request and removed bug Something isn't working labels Oct 14, 2024
@jfy133 jfy133 changed the title Filtlong takes forever with large hybrid datasets Optional skipping of short-read input to Filtlong for large datasets Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants