Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to download 10X Genomics data #144

Closed
FelixKrueger opened this issue Apr 24, 2023 · 0 comments
Closed

Add support to download 10X Genomics data #144

FelixKrueger opened this issue Apr 24, 2023 · 0 comments
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@FelixKrueger
Copy link

Description of the bug

In it's current form, fetchngs does not download the relevant files required for re-processing single-cell experiments from the 10X Genomics platforms.

As discussed on the Slack channel, 10X data currently gets downloaded only as a single FastQ file. However, 10X data typically contains the the cell ID and UMI data in Read 1 (~28 bp), Read 2 is the RNA insert (~91 bp). Read 3 tends to be the Illumina multiplexing index (mostly irrelevant as they should all belong to a single sample anyway. Read 1 is flagged as a technical, so it doesn't get included when using fasterq-dump currently, rendering the single-cell experiment into one single big bulk RNA-seq dataset.

Note:

It is also worth noting that the ENA does not serve out technical reads at all, so 10X raw data can only be obtained via the SRA (prefetch, or fasterq-dump + accession).

Here is a description of the bug:

This is the command run by fetchngs with a 10X sample accession SRR9320616:

fasterq-dump --threads 6 SRR9320616 --outfile SRR9320616.fastq

it gives the following output:

SRR9320616.fastq

This output is arguably useless for single-cell (re-)analysis.

Proposal:

This is the command required for 10X data. It uses both --split-files and --include-technical:

fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress

It gives the following output:

SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq

Read 1 is the cell barcode +UMI:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
NCACCTTCTGCTGTCGCCGATGTTGT
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
#AAFFJJJJJJJJJJJJJJJJJJJJJ

Read 2 is the RNA insert read:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
NGTTACGCTAGTAATCCCTCTACCTTTAGCCACTCACTTGGCCCTAGGTAACTAAGACCCTGACATCACTTTGCCTCTTAGGGCACAAGGAGGAACTA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
#A<FAFAAJFF-<FAJFF<--FFAJ-7F-7<--7-<--7-777-7<77-7F<AJJ7J-----A7-A-FFF7<-7--7F<JF---AAAJ7<J---7--F

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
NTTGAGAA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
#AA-FFJF

Adding these options to the pipeline, either as config file or straight within the fasterq-dump process works fine.

process {
    withName: 'SRATOOLS_FASTERQDUMP' {
        ext.args = '--split-files --include-technical'
    }
}

Download, extraction into 3 files as well as the pigz compression appear to have worked well:

2023-04-24 10:47:08          0
2023-04-24 10:50:35          6 .command.begin
2023-04-24 11:33:02         90 .command.err
2023-04-24 11:35:08         90 .command.log
2023-04-24 11:33:01          0 .command.out
2023-04-24 10:47:08      13370 .command.run
2023-04-24 10:47:08        527 .command.sh
2023-04-24 11:33:02        261 .command.trace
2023-04-24 11:35:06          1 .exitcode
2023-04-24 11:33:03 3133859956 SRX6088086_SRR9320616_1.fastq.gz
2023-04-24 11:33:03 8441509889 SRX6088086_SRR9320616_2.fastq.gz
2023-04-24 11:33:03 1496357946 SRX6088086_SRR9320616_3.fastq.gz
2023-04-24 11:33:03        124 versions.yml

I have changed the file pattern recognition to:

fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'

However the files then never get published, and I suspect it has to do with how the read names are extracted afterwards:

https://github.com/FelixKrueger/fetchngs/blob/62b2bc840b14465a0ff551f614d613a15fdef582/workflows/sra.nf#L120-L132

sra.nf

SRA_FASTQ_FTP
           .out
           .fastq
           .mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
           .map { 
               meta, fastq ->
                   def reads = meta.single_end ? [ fastq ] : fastq
                   def meta_clone = meta.clone()
                   meta_clone.fastq_1 = reads[0] ? "${params.outdir}/fastq/${reads[0].getName()}" : ''
                   meta_clone.fastq_2 = reads[1] && !meta.single_end ? "${params.outdir}/fastq/${reads[1].getName()}" : ''
                   return meta_clone
           }
           .set { ch_sra_metadata }

This is the error message that brings the whole process down:

Unknown method invocation `getName` on ArrayList type
-- Check script '.nextflow/assets/FelixKrueger/fetchngs/./workflows/sra.nf' at line: 128 or see 'nf-62eTOEybyloWFq.log' file for more details
WARN: Failed to publish file: s3://altos-lab-nextflow/scratch/5c32VUHOyVZskM/aa/b062914e17b4b9d68ae187ffb920a7/SRX6088086_SRR9320616_2.fastq.gz; to: s3://testbucket/results/fastq/SRX6088086_SRR9320616_2.fastq.gz [copy] -- See log file for details

It could be really trivial to get the getName() method to work in the new data structure, but I am currently at a loss how to fix it.

Many thanks for your kind attention!

Command used and terminal output

No response

Relevant files

No response

System information

No response

@FelixKrueger FelixKrueger added the bug Something isn't working label Apr 24, 2023
@drpatelh drpatelh added this to the 1.10 milestone Apr 25, 2023
@drpatelh drpatelh changed the title FetchNGS does currently not support download 10X Genomics data Add support to download 10X Genomics data Apr 25, 2023
@drpatelh drpatelh added enhancement Improvement for existing functionality and removed bug Something isn't working labels Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants