Add support to download 10X Genomics data #144

FelixKrueger · 2023-04-24T17:04:48Z

Description of the bug

In it's current form, fetchngs does not download the relevant files required for re-processing single-cell experiments from the 10X Genomics platforms.

As discussed on the Slack channel, 10X data currently gets downloaded only as a single FastQ file. However, 10X data typically contains the the cell ID and UMI data in Read 1 (~28 bp), Read 2 is the RNA insert (~91 bp). Read 3 tends to be the Illumina multiplexing index (mostly irrelevant as they should all belong to a single sample anyway. Read 1 is flagged as a technical, so it doesn't get included when using fasterq-dump currently, rendering the single-cell experiment into one single big bulk RNA-seq dataset.

Note:

It is also worth noting that the ENA does not serve out technical reads at all, so 10X raw data can only be obtained via the SRA (prefetch, or fasterq-dump + accession).

Here is a description of the bug:

This is the command run by fetchngs with a 10X sample accession SRR9320616:

fasterq-dump --threads 6 SRR9320616 --outfile SRR9320616.fastq

it gives the following output:

SRR9320616.fastq

This output is arguably useless for single-cell (re-)analysis.

Proposal:

This is the command required for 10X data. It uses both --split-files and --include-technical:

fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress

It gives the following output:

SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq

Read 1 is the cell barcode +UMI:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
NCACCTTCTGCTGTCGCCGATGTTGT
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=26
#AAFFJJJJJJJJJJJJJJJJJJJJJ

Read 2 is the RNA insert read:

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
NGTTACGCTAGTAATCCCTCTACCTTTAGCCACTCACTTGGCCCTAGGTAACTAAGACCCTGACATCACTTTGCCTCTTAGGGCACAAGGAGGAACTA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=98
#A<FAFAAJFF-<FAJFF<--FFAJ-7F-7<--7-<--7-777-7<77-7F<AJJ7J-----A7-A-FFF7<-7--7F<JF---AAAJ7<J---7--F

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

@SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
NTTGAGAA
+SRR9320616.1 K00125:67:HHJF7BBXX:1:1101:2777:998 length=8
#AA-FFJF

Adding these options to the pipeline, either as config file or straight within the fasterq-dump process works fine.

process {
    withName: 'SRATOOLS_FASTERQDUMP' {
        ext.args = '--split-files --include-technical'
    }
}

Download, extraction into 3 files as well as the pigz compression appear to have worked well:

2023-04-24 10:47:08          0
2023-04-24 10:50:35          6 .command.begin
2023-04-24 11:33:02         90 .command.err
2023-04-24 11:35:08         90 .command.log
2023-04-24 11:33:01          0 .command.out
2023-04-24 10:47:08      13370 .command.run
2023-04-24 10:47:08        527 .command.sh
2023-04-24 11:33:02        261 .command.trace
2023-04-24 11:35:06          1 .exitcode
2023-04-24 11:33:03 3133859956 SRX6088086_SRR9320616_1.fastq.gz
2023-04-24 11:33:03 8441509889 SRX6088086_SRR9320616_2.fastq.gz
2023-04-24 11:33:03 1496357946 SRX6088086_SRR9320616_3.fastq.gz
2023-04-24 11:33:03        124 versions.yml

I have changed the file pattern recognition to:

fastq = meta.single_end ? '*.fastq.gz' : '*_{1,2,3,4}.fastq.gz'

However the files then never get published, and I suspect it has to do with how the read names are extracted afterwards:

https://github.com/FelixKrueger/fetchngs/blob/62b2bc840b14465a0ff551f614d613a15fdef582/workflows/sra.nf#L120-L132

sra.nf

SRA_FASTQ_FTP
           .out
           .fastq
           .mix(FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS.out.reads)
           .map { 
               meta, fastq ->
                   def reads = meta.single_end ? [ fastq ] : fastq
                   def meta_clone = meta.clone()
                   meta_clone.fastq_1 = reads[0] ? "${params.outdir}/fastq/${reads[0].getName()}" : ''
                   meta_clone.fastq_2 = reads[1] && !meta.single_end ? "${params.outdir}/fastq/${reads[1].getName()}" : ''
                   return meta_clone
           }
           .set { ch_sra_metadata }

This is the error message that brings the whole process down:

Unknown method invocation `getName` on ArrayList type
-- Check script '.nextflow/assets/FelixKrueger/fetchngs/./workflows/sra.nf' at line: 128 or see 'nf-62eTOEybyloWFq.log' file for more details
WARN: Failed to publish file: s3://altos-lab-nextflow/scratch/5c32VUHOyVZskM/aa/b062914e17b4b9d68ae187ffb920a7/SRX6088086_SRR9320616_2.fastq.gz; to: s3://testbucket/results/fastq/SRX6088086_SRR9320616_2.fastq.gz [copy] -- See log file for details

It could be really trivial to get the getName() method to work in the new data structure, but I am currently at a loss how to fix it.

Many thanks for your kind attention!

Command used and terminal output

No response

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

FelixKrueger added the bug Something isn't working label Apr 24, 2023

drpatelh added this to the 1.10 milestone Apr 25, 2023

drpatelh mentioned this issue Apr 25, 2023

Errors downloading scrnaseq data using Amazon Genomics CLI (version 1.5.2) #130

Closed

FelixKrueger mentioned this issue Apr 25, 2023

Enable download of files form 10X Genomics experiments #145

Closed

7 tasks

drpatelh changed the title ~~FetchNGS does currently not support download 10X Genomics data~~ Add support to download 10X Genomics data Apr 25, 2023

drpatelh added enhancement Improvement for existing functionality and removed bug Something isn't working labels Apr 25, 2023

drpatelh mentioned this issue Apr 26, 2023

Add support to download 10X Genomics data #146

Merged

drpatelh closed this as completed Apr 26, 2023

drpatelh mentioned this issue May 5, 2023

fetchngs pipeline not working on scRNAseq #32

Closed

drpatelh mentioned this issue Jan 30, 2024

Add ability to download more than 2 FastQ files via FTP and Aspera #260

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to download 10X Genomics data #144

Add support to download 10X Genomics data #144

FelixKrueger commented Apr 24, 2023

Add support to download 10X Genomics data #144

Add support to download 10X Genomics data #144

Comments

FelixKrueger commented Apr 24, 2023

Description of the bug

Note:

Proposal:

Read 1 is the cell barcode +UMI:

Read 2 is the RNA insert read:

Read3 is the multiplexing index read (not strictly required but doesn’t hurt, can always be deleted afterwards if desired):

Command used and terminal output

Relevant files

System information