Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding shortread deduplication feature with fastp #439

Merged
merged 14 commits into from
Feb 1, 2024
Merged

Conversation

maxibor
Copy link
Member

@maxibor maxibor commented Jan 25, 2024

This PR adds deduplication of reads with fastp

PR checklist

  • This comment contains a description of changes (with reason).
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • CHANGELOG.md is updated.

Copy link

github-actions bot commented Jan 25, 2024

nf-core lint overall result: Failed ❌

Posted for pipeline commit 7e3f119

+| ✅ 183 tests passed       |+
#| ❔   1 tests were ignored |#
-| ❌  11 tests failed       |-

❌ Test failures:

  • files_exist - File must be removed: lib/nfcore_external_java_deps.jar
  • nextflow_config - Config default value incorrect: params.igenomes_base is set as s3://ngi-igenomes/igenomes in nextflow_schema.json but is s3://ngi-igenomes/igenomes/ in nextflow.config.
  • files_unchanged - .github/workflows/branch.yml does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - assets/email_template.html does not match the template
  • files_unchanged - assets/email_template.txt does not match the template
  • files_unchanged - assets/nf-core-taxprofiler_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-taxprofiler_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-taxprofiler_logo_dark.png does not match the template
  • files_unchanged - pyproject.toml does not match the template

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.12
  • Run at 2024-02-01 10:33:30

@maxibor maxibor requested review from sofstam and jfy133 and removed request for sofstam January 25, 2024 16:48
@maxibor
Copy link
Member Author

maxibor commented Jan 25, 2024

An easy PR for you @jfy133 to offset #548 😉

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is much easier 😬

Doc improvements but otherwise LGTM. @sofstam @LilyAnderssonLee should we add here as 1.5 or to bouncy basenji? It's not a bug fix exactly... but it's not a fully fledged new functionality addition?

We do have potentially teh centrifuge fix coming up (if I can fix it...) so there likely will be a 1.1.5

CHANGELOG.md Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
@sofstam
Copy link
Collaborator

sofstam commented Jan 26, 2024

I would say it is 1.1.5.

@sofstam
Copy link
Collaborator

sofstam commented Jan 26, 2024

Getting back with a review later today :)

maxibor and others added 3 commits January 26, 2024 09:09
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@Midnighter
Copy link
Collaborator

I'm somewhat concerned with this feature. If I remember correctly, FASTQC and fastp use the first 50 and 75 bp, respectively, to judge read duplication. Using longer sequences would drive up memory requirements and take longer. So the first question is, are we truly only removing identical reads with this?

My second question comes from my inexperience with sequencing: If you have a dominant species in your metagenomic sample, how unlikely is it to have an identical read?

@jfy133
Copy link
Member

jfy133 commented Jan 28, 2024

sing longer sequences would drive up memory requirements and take longer. So the first question is, are we truly only removing identical reads with this?

Does it really? the README at least seems to implies it's some condensed hash of the whole read: https://github.com/OpenGene/fastp#duplication-rate-evaluation. That said, it's opt-in so it's still up to the user to decide if it's a suitable algorithm

My second question comes from my inexperience with sequencing: If you have a dominant species in your metagenomic sample, how unlikely is it to have an identical read?

An absolutely exact duplicate is quite unlikely, as

  1. fragmentation protocols should be random (with a slight preference breakages around GCs IIRC), so that in combination with (relatively) longer reads it's unlikely due to sequence diversity.

  2. Exact duplicates are much more likely from lab-based amplicons as they use the same priming sequence, and given the number of amplification cycles also very likely to have copies from artifical duplicate rather than naturally occuring. At least in Illumina short-read protocols that is.

@Midnighter
Copy link
Collaborator

Thank you for your response, sounds good to me then. 👍🏼

nextflow_schema.json Outdated Show resolved Hide resolved
@jfy133 jfy133 merged commit 0792d91 into nf-core:dev Feb 1, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants