Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellranger / spaceranger input file handling #241

Closed
2 tasks
grst opened this issue Jun 14, 2023 · 2 comments
Closed
2 tasks

Cellranger / spaceranger input file handling #241

grst opened this issue Jun 14, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@grst
Copy link
Member

grst commented Jun 14, 2023

Description of feature

Cellranger (and also spaceranger, and probably other 10x pipelines) rely on the following input

  • --fastqs, to point to a directory with fastq files that are named according to {sample_name}_S{i}_L00{i}_{R1,R2}_001.fastq.gz.
  • --sample, with a sample name (that is the prefix of all associated fastq files)

In the easiest case, --fastqs points to a directory that contains all fastqs of a single sample that already follow the naming conventions. However, it's not always easy:

  1. the directory may contain fastqs for multiple samples
    • this is not a problem for cellranger, it will automatically choose the correct fastq files via the --sample flag as long as they follow the naming convention, but
    • if we stage the whole folder (or all files in that folder) using nextflow, it breaks caching in that if an additional sample (or any other file) gets added to the folder, the cache gets invalidated for all samples.
  2. the sample has been sequenced across multiple flow cells
    • In this case, we need to specify multiple input folders. Cellranger allows passing a list of directories e.g. --fastqs=/path/dir1,path/dir2
    • Staging all files in these folders into a single directory using nextflow doesn't do the job, as there may be duplicate file names across the different folders.
  3. the raw sequencing data may have been downloaded from a sequence database and doesn't follow the naming convention anymore.
    • In that case we need to rename the filese to follow the bcl2fastq naming convention.

The symptoms are seen in the following issues

Possible solutions

  1. instead of staging an entire directory, stage only the files that are required for this sample. This is already done in #scrnaseq, because files (instead of folders) are specified in the input sample sheet. In #spatialtranscriptomics, there's still a discussion about that with the current preference leaning towards specifying folders and filtering for the required files automatically.

  2. possible solutions
    a) stage files into subdirectories
    b) rename files. It shouldn't matter if we have

    folder1/test_sample_S1_L001_R1_001.fastq.gz
    folder2/test_sample_S1_L001_R1_001.fastq.gz
    

    or this

    folder1/test_sample_S1_L001_R1_001.fastq.gz
    folder1/test_sample_S1_L002_R2_001.fastq.gz
    
  3. rename the files to follow the naming conventions

    sample_S1_L00{i}_R{1,2}_001.fastq.gz
    

    where {i} refers to an incrementing integer number for each pair of fastq files.

Discussion

Renaming seems the most general solution. To be sure that renaming makes no difference, this should be tried out with a test sample or alternatively confirmed by someone who knows cellranger's inner wirings.

Implementation

  • it is impossible to stage all files into a single directory because of potential name clashes.
  • [ ] does stageAs take a callback?
  • can filenames be manipulated in groovy via channel operations
    • I don't think they can except by renaming the original file, which we do not want to do.
  • worst case: launch one process per file that renames it an emits it. This will lead to unnecessary network traffic when using s3 buckets as storage. There could be a flag to skip this process if the samples already follow the naming convention.

Possible solution:
stageAs: "???/* allows to put each individual file into a separate folder. We can then move and reanme them as appropriate using a script.
(1) either rely on the input order or the files (sample1_R1, sample1_R2, sample2_R1, sample2_R2, ...) or
(2) match files based on their name (files that don't differ except in R1/R2 are a pair).

If going for (1), (2) should be included as an additional check and raise a warning/error if it's not fulfilled.

@grst
Copy link
Member Author

grst commented Jun 19, 2023

I tried it out: Specifying the same fastqs distributed across different flow cells, in a single folder, or concatenated using cat gives exactly the same result down to identical md5sums of the {raw,filtered}_feature_bc_matrix.h5.

@grst
Copy link
Member Author

grst commented Jul 7, 2023

Done via #246.

@grst grst closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant