Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable download of files form 10X Genomics experiments #145

Closed
wants to merge 14 commits into from

Conversation

FelixKrueger
Copy link

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/fetchngs branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

We have previously opened an issue because, currently, fetchngs fails to download data from 10X genomics experiments (#144). The issue in question has a lot more details of what goes wrong, and how to fix it.

Changes

In essence, we have changed the following things (for the PREFETCH_FASTERQDUMP_SRATOOLS workflow only):

  • increased the number of files that can be downloaded from 1 (single-end) or 2 (paired-end), to potentially 3 or 4 FastQ file (e.g. single or dual indexing when the index reads are marked as 'technical reads'
  • added the options --split-files and --include-technical to fasterq-dump
  • changed the way file names are recognised in the main workflow (the data structure changes to a list object if more than 3 or more files were present

Tests

We have carried out tests using single-end as well as paired-end files, both using the ENA (default) and SRATOOLS options; the pipelines and resulting files are all in working order, as before.

For a 10X genomics test dataset, the new version results in 3 output files (see #144 for additional details) using the SRATOOLS route. In ENA mode, only a single bulk (and meaningless) file is produced, as before.

I am afraid I am not able to add any meaningful CI tests (don't really know how to), but maybe you would be able to find a minimal test case that works?

NOTE:

We have not changed anything for the workflow downloading data from the ENA (which is the default of fetchngs). The ENA does not serve out read that are marked as 'technical' at all, so all 10X Genomics data will appear as a single FastQ file - which means that the cell-ID and UMI read is missing. Thus, for 10X data you have to force downloads via the sratoolkit route - or end up with one single, bulk file.

Many thanks to @wzheng0520 for figuring this out, and the nf-core community for their constant support!

Comment on lines +29 to +37
fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'
def outfile = meta.single_end ? "${prefix}.fastq" : prefix
"""
export NCBI_SETTINGS="\$PWD/${ncbi_settings}"

fasterq-dump \\
$args \\
--split-files \\
--include-technical \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modules should be modified in nf-core modules, or patched in the pipeline.

@maxulysse
Copy link
Member

how is your PR coming from your master and not your dev branch?

@FelixKrueger
Copy link
Author

Hmm, it seems I only have a master branch in my private fork....

@drpatelh
Copy link
Member

drpatelh commented Apr 26, 2023

Will be fixed in #146

@drpatelh drpatelh closed this Apr 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants