You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cellranger (and also spaceranger, and probably other 10x pipelines) rely on the following input
--fastqs, to point to a directory with fastq files that are named according to {sample_name}_S{i}_L00{i}_{R1,R2}_001.fastq.gz.
--sample, with a sample name (that is the prefix of all associated fastq files)
In the easiest case, --fastqs points to a directory that contains all fastqs of a single sample that already follow the naming conventions. However, it's not always easy:
the directory may contain fastqs for multiple samples
this is not a problem for cellranger, it will automatically choose the correct fastq files via the --sample flag as long as they follow the naming convention, but
if we stage the whole folder (or all files in that folder) using nextflow, it breaks caching in that if an additional sample (or any other file) gets added to the folder, the cache gets invalidated for all samples.
the sample has been sequenced across multiple flow cells
In this case, we need to specify multiple input folders. Cellranger allows passing a list of directories e.g. --fastqs=/path/dir1,path/dir2
Staging all files in these folders into a single directory using nextflow doesn't do the job, as there may be duplicate file names across the different folders.
the raw sequencing data may have been downloaded from a sequence database and doesn't follow the naming convention anymore.
In that case we need to rename the filese to follow the bcl2fastq naming convention.
instead of staging an entire directory, stage only the files that are required for this sample. This is already done in #scrnaseq, because files (instead of folders) are specified in the input sample sheet. In #spatialtranscriptomics, there's still a discussion about that with the current preference leaning towards specifying folders and filtering for the required files automatically.
possible solutions
a) stage files into subdirectories
b) rename files. It shouldn't matter if we have
where {i} refers to an incrementing integer number for each pair of fastq files.
Discussion
Renaming seems the most general solution. To be sure that renaming makes no difference, this should be tried out with a test sample or alternatively confirmed by someone who knows cellranger's inner wirings.
Implementation
it is impossible to stage all files into a single directory because of potential name clashes.
[ ] does stageAs take a callback?
can filenames be manipulated in groovy via channel operations
I don't think they can except by renaming the original file, which we do not want to do.
worst case: launch one process per file that renames it an emits it. This will lead to unnecessary network traffic when using s3 buckets as storage. There could be a flag to skip this process if the samples already follow the naming convention.
Possible solution: stageAs: "???/* allows to put each individual file into a separate folder. We can then move and reanme them as appropriate using a script.
(1) either rely on the input order or the files (sample1_R1, sample1_R2, sample2_R1, sample2_R2, ...) or
(2) match files based on their name (files that don't differ except in R1/R2 are a pair).
If going for (1), (2) should be included as an additional check and raise a warning/error if it's not fulfilled.
The text was updated successfully, but these errors were encountered:
I tried it out: Specifying the same fastqs distributed across different flow cells, in a single folder, or concatenated using cat gives exactly the same result down to identical md5sums of the {raw,filtered}_feature_bc_matrix.h5.
Description of feature
Cellranger (and also spaceranger, and probably other 10x pipelines) rely on the following input
--fastqs
, to point to a directory with fastq files that are named according to{sample_name}_S{i}_L00{i}_{R1,R2}_001.fastq.gz
.--sample
, with a sample name (that is the prefix of all associated fastq files)In the easiest case,
--fastqs
points to a directory that contains all fastqs of a single sample that already follow the naming conventions. However, it's not always easy:--sample
flag as long as they follow the naming convention, but--fastqs=/path/dir1,path/dir2
The symptoms are seen in the following issues
Possible solutions
instead of staging an entire directory, stage only the files that are required for this sample. This is already done in #scrnaseq, because files (instead of folders) are specified in the input sample sheet. In #spatialtranscriptomics, there's still a discussion about that with the current preference leaning towards specifying folders and filtering for the required files automatically.
possible solutions
a) stage files into subdirectories
b) rename files. It shouldn't matter if we have
or this
rename the files to follow the naming conventions
where
{i}
refers to an incrementing integer number for each pair of fastq files.Discussion
Renaming seems the most general solution. To be sure that renaming makes no difference, this should be tried out with a test sample or alternatively confirmed by someone who knows cellranger's inner wirings.
Implementation
[ ] doesstageAs
take a callback?Possible solution:
stageAs: "???/*
allows to put each individual file into a separate folder. We can then move and reanme them as appropriate using a script.(1) either rely on the input order or the files (sample1_R1, sample1_R2, sample2_R1, sample2_R2, ...) or
(2) match files based on their name (files that don't differ except in R1/R2 are a pair).
If going for (1), (2) should be included as an additional check and raise a warning/error if it's not fulfilled.
The text was updated successfully, but these errors were encountered: