Optional skipping of short-read input to Filtlong for large datasets #691

ddomman · 2024-10-14T00:07:12Z

Description of the bug

Hi all - long time listener first time caller:

I have a rather large set of Illumina data along with some nanopore reads on which I was trying to run the hybrid assembly option. After 10+ hours, filtlong was still processing the nanopore reads. I did some digging and the current command utilizes the short-read data as part of the reference option. I think that is fine for small-ish datasets but seems impractical for larger ones.

Once I edited the filtlong.nf code to no longer use the short-reads, the filtlong process took less than 5 minutes and the pipeline has proceeded as expected. Maybe there could be a flag to turn on/off that feature?

filtlong.nf:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        -1 ${short_reads_1} \
        -2 ${short_reads_2} \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --trim \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Edited working solution:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Command used and terminal output

No response

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

jfy133 · 2024-10-14T07:28:13Z

Hi @ddomman !

Thanks for this! This is great you have a solution already :)

Within the module we could make it optional by inserting the short_reads1/2 into the ocmmand if supplied, something along the lines of:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    def sr_command = short_reads_1 ? "-1 ${short_reads_1} -2 ${short_reads_2} \\" : ""
    """
    filtlong \
        ${sr_command}
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

If you want to contribute to the module and pipeline, you can make these changes via a PR to nf-core/moduels, and we can update in the pipeline (With credit to you!) - the contributions will be gratefully recieved :)

Note that @muabnezor is currently in the process of overhauling the long-read/nanopore preprocessing tools anyway, we just merged into the dev branch porechop_abi as a faster replacment for porechop and next we plan to add nanoq as an alternative to Filtlong. So if you prefer that, you could wait for that instead

That said, I think updating filtlong would still be very helpful to the community as a whole. Let me know what you think!

ddomman added the bug Something isn't working label Oct 14, 2024

jfy133 added enhancement New feature or request and removed bug Something isn't working labels Oct 14, 2024

jfy133 changed the title ~~Filtlong takes forever with large hybrid datasets~~ Optional skipping of short-read input to Filtlong for large datasets Oct 14, 2024

jfy133 mentioned this issue Oct 15, 2024

Add chopper and nanoq options for longread preprocessing #692

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional skipping of short-read input to Filtlong for large datasets #691

Optional skipping of short-read input to Filtlong for large datasets #691

ddomman commented Oct 14, 2024

jfy133 commented Oct 14, 2024

Optional skipping of short-read input to Filtlong for large datasets #691

Optional skipping of short-read input to Filtlong for large datasets #691

Comments

ddomman commented Oct 14, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

jfy133 commented Oct 14, 2024