Cellranger / spaceranger input file handling #241

grst · 2023-06-14T15:53:14Z

Description of feature

Cellranger (and also spaceranger, and probably other 10x pipelines) rely on the following input

--fastqs, to point to a directory with fastq files that are named according to {sample_name}_S{i}_L00{i}_{R1,R2}_001.fastq.gz.
--sample, with a sample name (that is the prefix of all associated fastq files)

In the easiest case, --fastqs points to a directory that contains all fastqs of a single sample that already follow the naming conventions. However, it's not always easy:

the directory may contain fastqs for multiple samples
- this is not a problem for cellranger, it will automatically choose the correct fastq files via the --sample flag as long as they follow the naming convention, but
- if we stage the whole folder (or all files in that folder) using nextflow, it breaks caching in that if an additional sample (or any other file) gets added to the folder, the cache gets invalidated for all samples.
the sample has been sequenced across multiple flow cells
- In this case, we need to specify multiple input folders. Cellranger allows passing a list of directories e.g. --fastqs=/path/dir1,path/dir2
- Staging all files in these folders into a single directory using nextflow doesn't do the job, as there may be duplicate file names across the different folders.
the raw sequencing data may have been downloaded from a sequence database and doesn't follow the naming convention anymore.
- In that case we need to rename the filese to follow the bcl2fastq naming convention.

The symptoms are seen in the following issues

Possible solutions

instead of staging an entire directory, stage only the files that are required for this sample. This is already done in #scrnaseq, because files (instead of folders) are specified in the input sample sheet. In #spatialtranscriptomics, there's still a discussion about that with the current preference leaning towards specifying folders and filtering for the required files automatically.

possible solutions
a) stage files into subdirectories
b) rename files. It shouldn't matter if we have

folder1/test_sample_S1_L001_R1_001.fastq.gz
folder2/test_sample_S1_L001_R1_001.fastq.gz

or this

folder1/test_sample_S1_L001_R1_001.fastq.gz
folder1/test_sample_S1_L002_R2_001.fastq.gz

rename the files to follow the naming conventions
```
sample_S1_L00{i}_R{1,2}_001.fastq.gz
```
where {i} refers to an incrementing integer number for each pair of fastq files.

Discussion

Renaming seems the most general solution. To be sure that renaming makes no difference, this should be tried out with a test sample or alternatively confirmed by someone who knows cellranger's inner wirings.

Implementation

it is impossible to stage all files into a single directory because of potential name clashes.

~~[ ] does stageAs take a callback?~~
can filenames be manipulated in groovy via channel operations
- I don't think they can except by renaming the original file, which we do not want to do.
worst case: launch one process per file that renames it an emits it. This will lead to unnecessary network traffic when using s3 buckets as storage. There could be a flag to skip this process if the samples already follow the naming convention.

Possible solution:
stageAs: "???/* allows to put each individual file into a separate folder. We can then move and reanme them as appropriate using a script.
(1) either rely on the input order or the files (sample1_R1, sample1_R2, sample2_R1, sample2_R2, ...) or
(2) match files based on their name (files that don't differ except in R1/R2 are a pair).

If going for (1), (2) should be included as an additional check and raise a warning/error if it's not fulfilled.

The text was updated successfully, but these errors were encountered:

grst · 2023-06-19T14:32:07Z

I tried it out: Specifying the same fastqs distributed across different flow cells, in a single folder, or concatenated using cat gives exactly the same result down to identical md5sums of the {raw,filtered}_feature_bc_matrix.h5.

grst · 2023-07-07T11:16:14Z

Done via #246.

grst added the enhancement New feature or request label Jun 14, 2023

This was referenced Jun 14, 2023

Switch to spaceranger module nf-core/spatialvi#45

Merged

Auto rename fastq files for cellranger input #214

Closed

Fix #214 #215

Closed

grst mentioned this issue Jun 20, 2023

Automatically rename input files in cellranger module nf-core/modules#3537

Merged

14 tasks

grst mentioned this issue Jul 3, 2023

Update cellranger module #246

Merged

10 tasks

grst closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cellranger / spaceranger input file handling #241

Cellranger / spaceranger input file handling #241

grst commented Jun 14, 2023 •

edited

Loading

grst commented Jun 19, 2023

grst commented Jul 7, 2023

Cellranger / spaceranger input file handling #241

Cellranger / spaceranger input file handling #241

Comments

grst commented Jun 14, 2023 • edited Loading

Description of feature

Possible solutions

Discussion

Implementation

grst commented Jun 19, 2023

grst commented Jul 7, 2023

grst commented Jun 14, 2023 •

edited

Loading