Integrate cellranger workflow #90

grst · 2022-01-26T16:19:51Z

Integrate the cellranger workflow into the main pipeline.

(it was already implemented to some degree, but not exposed via the parameters)

Closes #31

PR checklist

alexblaessle · 2022-03-24T10:30:45Z

Hi @grst , do you have an estimate how long this is taking? Are there ways I can help out with this?

grst · 2022-03-28T10:06:17Z

Hi @alexblaessle,

I had to switch attention to another project. The earliest time I could possibly continue working on this is May, but I don't want to make any promises.

In principle, this works already (I applied it to one of my datasets). The part that's missing (apart from writing tests) is to support a universal samplesheet (#92), handling cases where filenames don't match cellranger defaults or multiple samples are grouped into a "gem".

If you have time to move this PR forward, that would be much appreciated!

I'm wondering if we should just merge this (@apeltzer ?), and continue implementing the samplesheet standard in a follow-up PR. I was just hesitant so far because I thought it might speed up the first DSL2 release if we didn't have to deal with cellranger already.

alexblaessle · 2022-03-29T08:00:21Z

@grst @apeltzer As far as I understand, to make cellranger work from a universal sample sheet this would require a file structure check and if it fails at least temporarily re-organising files, right?

As soon as I will have developer support from my side, would it make sense to have a quick call to completely understand what needs to be done?

grst · 2022-03-29T08:04:27Z

As far as I understand, to make cellranger work from a universal sample sheet this would require a file structure check and if it fails at least temporarily re-organising files, right?

I was wondering if it would also work to just concatenate fastq files manually before feeding them into cellranger. My gut feeling is that this is exactly what cellranger does anyway, but would be good to either try it out or have it confirmed by someone who knows cellranger's inner gearings.

As soon as I will have developer support from my side, would it make sense to have a quick call to completely understand what needs to be done?

sure! Maybe also involving @ggabernet. She has some thoughts on cellranger multi.

apeltzer · 2022-04-02T15:02:15Z

Depending on when I'm back @alexblaessle (April 25th), I can also potentially join in a small call before that. Just to be in the loop.

grst · 2022-05-19T18:09:40Z

Following up with today's call, here are the commands I used to apply the pipeline version from this PR on my dataset:

nextflow run ../../nf-core-scrnaseq/main.nf \
  -resume \
  -w /data/scratch/sturm/projects/2021/grabherr-scrnaseq-mouse \
  -profile icbi_long,singularity \
  -c scrnaseq.config

where this is the scrnaseq.config file

params {
  input = "../tables/samplesheet_scrnaseq.csv"
  aligner = "cellranger"
  outdir = "/data/projects/2021/Grabherr-scRNAseq-mouse/10_nfcore_scrnaseq/cellranger"
  cellranger_index = "/data/databases/CellRanger/refdata-gex-mm10-2020-A"
  protocol = "10XV3"
}

process {
  withName:CELLRANGER_COUNT {
    ext.args = "--expect-cells 8000"
    scratch = "/local/scratch/sturm"
    cpus = 22
    memory = 400.GB
  }
}

and this the samplesheet:

sample,fastq_1,fastq_2,group,internal_id,sequencing_sample,mouse_id,sex,mouse_type,batch
FG-1,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-1/FG-1_S1_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-1/FG-1_S1_R2_001.fastq.gz,KO,FG-1,S1,TADO_830,male,B6,1
FG-2,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-2/FG-2_S2_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-2/FG-2_S2_R2_001.fastq.gz,WT,FG-2,S2,TADO_831,male,B6,1
FG-3,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-3/FG-3_S3_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-3/FG-3_S3_R2_001.fastq.gz,KO,FG-3,S3,TADO_837,male,B6,2
FG-4,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-4/FG-4_S4_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-4/FG-4_S4_R2_001.fastq.gz,WT,FG-4,S4,TADO_838,male,B6,2
FG-5,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-5/FG-5_S5_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-5/FG-5_S5_R2_001.fastq.gz,KO,FG-5,S5,TADO_846,female,B6,3
FG-6,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-6/FG-6_S6_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-6/FG-6_S6_R2_001.fastq.gz,WT,FG-6,S6,TADO_847,female,B6,3
FG-7,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-7/FG-7_S7_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-7/FG-7_S7_R2_001.fastq.gz,KO,FG-7,S7,TADO_839,female,B6,4
FG-8,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-8/FG-8_S8_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-8/FG-8_S8_R2_001.fastq.gz,WT,FG-8,S8,TADO_840,female,B6,4

grst · 2022-05-19T18:12:45Z

The test profile can be ran with

nextflow run main.nf -profile test,singularity --aligner cellranger

LMK if anything is unclear.

however it currently fails. This is because we updated the samplesheet in the test data, but didn't adapt the pipeline accordingly.

fmalmeida · 2022-05-20T07:20:28Z

The example you used can give a gist of what is happening, but since I don't have access to the data I will have to stick with the test-dataset to try it ...

So I tested and I saw the error. It currently fails because it says there is duplicate rows. How do you believe this should be solved? This samplesheet, in the main branch also raises a problem?

When I remove the duplicated line and give the input manually, it passes the check and enters the workflow.

Maybe update the module to concatenate it and avoid flagging the issue? Or don't use the default script for cellranger? Or create a different samplesheet? The error is actually raised by check_samplesheet.py.

Thoughts?

apeltzer · 2022-05-20T07:48:13Z

Its fine to rely on the testing data for testing this now - no need for now to have a larger testdataset and we can certainly provide one as soon as the small-scale test runs through.

I would not develop this in a way to be compatible with the samplesheet in the main branch, but rather already rely on the updated version of the Samplesheet that you were trying to work on. The module itself should be fine I believe, as we want to rely on nf-core/modules if possible / whereever possible.

Updating the check_samplesheet.py would be the way to go I think, to make sure it does the checking correctly and accepts a Samplesheet like the one here in testing. That was never done, which is why this likely fails at the moment.

grst · 2022-05-20T10:34:56Z

The problem with the samplesheet is that it indeed contains duplicate lines.

The test case we wanted to generate is

multiple samples (actually, we had a bug in a previous version of the pipeline where it wouldn't run with multiple samples) and
technical replicates per sample

So for a quick test, you can just remove those lines and it should run through, but eventually it would be really good to have those two test cases. Either by disabling that samplesheet check or by finding more suitable test files / duplicating and renaming them / or whatever works.

fmalmeida · 2022-05-20T11:40:08Z

Hi guys,

Update

So I was able to successfully run the pipeline using the test dataset, but with some considerations that I'd like to ask you how to handle.

nextflow run ~/Documents/GitHub/scrnaseq/main.nf -profile docker -c test.config --aligner cellranger

Considerations/adaptions

For it to run I actually had to:

Remove the duplicate row in a local samplesheet
Rename the fastqs to match what is expected by cellranger
- From S10_L001_R{1,2}_001.fastq.gz to Sample_{X,Y}_S1_L001_R{1,2}_001.fastq.gz

Meaning the example samplesheet would look like:

sample,fastq_1,fastq_2,protocol,expected_cells
Sample_X,Sample_X_S1_L001_R1_001.fastq.gz,Sample_X_S1_L001_R2_001.fastq.gz,"10xV2","5000" 
Sample_Y,Sample_Y_S1_L001_R1_001.fastq.gz,Sample_Y_S1_L001_R2_001.fastq.gz,"10xV2","5000"

Questions

Should we change the check_samplesheet.py or the module itself to check when using cellranger if the users fastqs are under what is expected by the tool, so they know exactly what is expected and change it if necessary?
About the test-dataset, maybe we should provide new "copies" of the samplesheet and fastq files so they have the proper naming convetion and can be used with -profile test.
First make a release running without replicates and then thinking about the best strategy for them? Or already think of it to this release?

I will try to edit the check_samplesheet.py so it does not complain about it, and the run it to see what the output looks like.

apeltzer · 2022-05-20T13:09:32Z

About the test-dataset, maybe we should provide new "copies" of the samplesheet and fastq files so they have the proper naming convetion and can be used with -profile test.

I have a tendency to rather have a custom way of making any input compatible with CellRanger - every other tool accepts this differently and I'd rather be happy to see that we can use the very same approach for all tools (independent of the actual tool used) rather than creating special cases for each of the available tools ;-)

First make a release running without replicates and then thinking about the best strategy for them? Or already think of it to this release?

I would say make a release without replicates for now (resolving the other things that need to be done) and later come back to replicates and resolve this ;-)

grst · 2022-05-20T13:33:35Z

Agree it would be nice to have the pipeline rename the files such that they just work with cellranger.

But it's fair to start with just checking to get a working version quickly and then iterate.

handling existence of replicates

fmalmeida · 2022-05-23T08:21:40Z

So let's go little by little to make sure we have an aggreement about all.

Changing check_samplesheet.py to make sure it handles the replicates

## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
    replicate = 1  ## added this
    sample_mapping_dict[sample] = [sample_info]
else:
    if sample_info in sample_mapping_dict[sample]:
        replicate += 1  ## added this
        sample_mapping_dict[f"{sample}_rep{replicate}"] = [sample_info]   ## added this
    else:
        replicate = 1  ## added this
        sample_mapping_dict[sample] = [sample_info]

So now, it does not complain about duplicates since they will be treated as replicates and will gain an additional suffix while also allowing users to do not have replicates which will not affect the sample name given.

thoughts?

solving this I'll go to the file name problem

grst · 2022-05-23T08:43:13Z

I'm not sure, wouldn't that lead to every technical replicate being processed individually, rather than merging them?
I think in principle, the code was alright (putting technical replicates in a list), it just didn't like two identical entries (i.e. the same files, which shouldn't happen in practice, only in our test case).

replicates are appended together

fmalmeida · 2022-05-23T10:35:45Z

I'm not sure, wouldn't that lead to every technical replicate being processed individually, rather than merging them? I think in principle, the code was alright (putting technical replicates in a list), it just didn't like two identical entries (i.e. the same files, which shouldn't happen in practice, only in our test case).

Yes, it causes them to be separate entries ... So atuallly, we would have to have this?

[Sample_X, [read1, read2], [read1, read2]] ?

Changing to this now works because they are appended together as represented above:

## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
    sample_mapping_dict[sample] = [sample_info]
else:
    sample_mapping_dict[sample].append(sample_info)

[[id:Sample_X, single_end:false, gem:Sample_X, samples:[Sample_X]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]]
[[id:Sample_Y, single_end:false, gem:Sample_Y, samples:[Sample_Y]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]]

But then, when entering the COUNT module, it fails with the test data because they have the same name ...

Caused by:
  Process `NFCORE_SCRNASEQ:SCRNASEQ:CELLRANGER_ALIGN:CELLRANGER_COUNT` input file name collision -- There are multiple input files for each of the following file names: Sample_Y/S10_L001_R1_001.fastq.gz, Sample_Y/S10_L001_R2_001.fastq.gz

So maybe we will have to work on the renaming now. The identification of replicates in the cellranger syntax is the value on the lane, right?

test_sample1_S1_L001_R1_001.fastq.gz

Would it make sense if we were renaming the first to L001, second to L002, and so on? Any thoughts?

Will the renaming handle duplicate input filenames? or will it only rename to cellranger syntax? Should we add another header on the samplesheet to tell the lane?

The way it is now it will probably work with your dataset because it will not flag duplicates, but is not working with the -profile test

grst · 2022-05-23T11:30:22Z

The more I think about it: maybe we should just duplicate & rename the test data accordingly. In practice, duplicate rows in the samplesheet would be a mistake anyway, right?

The test data is in the scrnaseq branch of the nf-core/test-datasets repository and you could just adapt it there.

However, in cases where the input data is not named according to the cellranger filename pattern, I agree it would be best to rename them as you described. Here's an example of such a dataset on ArrayExpress.

The only thing with renaming I'm unsure about is the case where one GEM-well has been sequenced across multiple flowcells. In that case, cellranger expects multiple input folders that can be specified as comma-separated list:

If the files are in multiple folders, for instance because one library was sequenced across multiple flow cells, supply a comma-separated list of paths.
(https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input)

I'm not sure if it would make a difference if all these files were put in the same folder and just renamed to have more L0XX files. It might, because two identical barcodes from different GEMs would identify different droplets (?).

fmalmeida · 2022-05-23T12:31:40Z

The more I think about it: maybe we should just duplicate & rename the test data accordingly. In practice, duplicate rows in the samplesheet would be a mistake anyway, right?

I agree on that. The duplicate rows are indeed an error. We would just be "fixing" the test dataset problem... And actually may be masking future problems.

The test data is in the scrnaseq branch of the nf-core/test-datasets repository and you could just adapt it there.

I'll take a look at it to adapt the testing case.

About the renaming

As I can see, we still have many things to think and discuss about it.

So for now, I will stick on adapting a test-dataset that can be used for cellranger in the nf-core/test-datasets and then enabling a -profile test. So we know it is working when the files are properly named.

grst · 2022-05-23T12:54:54Z

👍 sounds good!

fmalmeida · 2022-05-23T13:37:58Z

saw that the changes in check_samplesheet.py broke the tests with other aligners, so I will try to revert back the changes and make the changes for "merging" reads only happen when using --aligner cellranger.

nf-core lint wants it deleted

fixed with nf-core lint

apeltzer · 2022-05-31T19:06:59Z

Hi @fmalmeida !

I had some time to look into things and noticed a few things that we should probably get rid of / do a bit differently. Other pipelines in nf-core, e.g. atacseq can also handle technical replicates and do things in a way that they don't need a groovy for loop - see e.g. https://github.com/nf-core/atacseq/blob/f327c86324427c64716be09c98634ae0bc8165f6/main.nf#L369 for some more details. If possible, I'd rather try to stay very similar to this instead of looping through files, as its a bit cleaner / easier to read (and thus understand, especially later when we need to come back to this at some point maybe).

Not so much worried about the join usage in the respective modules - that should be fine 👍🏻

Apart from this, docs would be great indeed + fixes for the last CI issues.

grst · 2022-06-01T10:03:34Z

Thanks for the extensive update!

But when I played with it I saw that it is actually required for the workflow as the "meta.id" from the input check comes with some
"_T[1..]" tags attached to it.

I don't quite get where this would be coming from? As far as I understand the INPUT_CHECK, meta.id would be just the sample column from the samplesheet, which should serve well as a unique identifier for each sample?

scrnaseq/subworkflows/local/input_check.nf

Lines 25 to 28 in c8d04dc

    
           def create_fastq_channels(LinkedHashMap row) { 
        
               def meta = [:] 
        
               meta.id           = row.sample 
        
               meta.single_end   = row.single_end.toBoolean()

For these tools I had to create a small loop inside the modules to separate reverse and forward pairs as they should be passed in separate strings for tools. This is the loop:

I wonder if this could be simplified. The INPUT_CHECK returns tuples of reads for paired end samples, so we should already know which read is forward and which reverse.
So instead of flattening the list here

scrnaseq/workflows/scrnaseq.nf

Line 80 in c8d04dc

.map { it -> [ it[0], it[1].flatten() ] }

and then again separating it into forward and reverse for starsolo/alevin, I think it should be possible to just keep a list of tuples in the first place.

fmalmeida · 2022-06-01T11:46:02Z

Hi guys,
thanks for the comments.

I don't quite get where this would be coming from? As far as I understand the INPUT_CHECK, meta.id would be just the sample column from the samplesheet, which should serve well as a unique identifier for each sample?

@grst these "_T[1..]" suffixes com from check_samplesheet.py (see the last lines from code snippet) maybe I could remove them so we don't need this procedure in scrnaseq.nf file?

scrnaseq/bin/check_samplesheet.py

Lines 120 to 135 in 6d4aada

    
           ## Write validated samplesheet with appropriate columns 
        
           if len(sample_mapping_dict) > 0: 
        
               out_dir = os.path.dirname(file_out) 
        
               make_dir(out_dir) 
        
               with open(file_out, "w") as fout: 
        
                   fout.write(",".join(["sample", "single_end", "fastq_1", "fastq_2"]) + "\n") 
        
                   for sample in sorted(sample_mapping_dict.keys()): 
        
                       ## Check that multiple runs of the same sample are of the same datatype 
        
                       if not all(x[0] == sample_mapping_dict[sample][0][0] for x in sample_mapping_dict[sample]): 
        
                           print_error("Multiple runs of a sample must be of the same datatype!", "Sample: {}".format(sample)) 
        
                       for idx, val in enumerate(sample_mapping_dict[sample]): 
        
                           fout.write(",".join(["{}_T{}".format(sample, idx + 1)] + val) + "\n") 
        
           else: 
        
               print_error("No entries to process!", "Samplesheet: {}".format(file_in))

So instead of flattening the list here and then again separating it into forward and reverse for starsolo/alevin, I think it should be possible to just keep a list of tuples in the first place.

Here you said that we could keep a channel as?

ch_fastq = [ meta, [ [ array of forward reads ], [ array of reverse reads ] ] ]

If so, I think it would propably make it easier for the two modules where we need the arrays separately, but would pose difficulty to kallistobustools_count.nf module where we just need reads in a single string as rep1_read1 rep1_read2 rep2_read1 rep2_read2. No? There is a quick way to "join" both arrays while taking one from each?

I had some time to look into things and noticed a few things that we should probably get rid of / do a bit differently. Other pipelines in nf-core, e.g. atacseq can also handle technical replicates and do things in a way that they don't need a groovy for loop - see e.g. https://github.com/nf-core/atacseq/blob/f327c86324427c64716be09c98634ae0bc8165f6/main.nf#L369 for some more details. If possible, I'd rather try to stay very similar to this instead of looping through files, as its a bit cleaner / easier to read (and thus understand, especially later when we need to come back to this at some point maybe).

Here @apeltzer I totally agree with both of you guys, that the groovy for loop is not the best way because it adds complexity. But I am not actually understanding how we could do some splitting using the example given. On this one I will still need a little more help on brainstorming.

You know of a way where we can, only using maps, convert this:

[ meta, [ all reads ]

into this?

[ meta, [ [ forward reads], [reverse reads] ]

I believe this is what is still there to be defined right? To decide with try to keep this "flattened" array and then separate them as forward and reverse to produce strings inside the modules, or we try to do the other way around, right?

grst · 2022-06-01T15:29:26Z

these "_T[1..]" suffixes com from check_samplesheet.py

I see, thanks! To me, this still seems unnecessary, but I might be missing something. I asked on Slack if someone can explain the motivation:

https://nfcore.slack.com/archives/CE5LG7WMB/p1654096778326729

Here you said that we could keep a channel as
ch_fastq = [ meta, [ [ array of forward reads ], [ array of reverse reads ] ] ]

I think it would rather be

[ meta, [ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ] ]

Either way using .flatten() or .transpose() (not an expert with groovy list magics myself) should do the job getting the files in the right order.

fmalmeida · 2022-06-02T06:43:58Z

[ meta, [ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ] ]

I see. But I cannot see how this could be used.

It seems that making it like this, we would add the need to flatten these tuples inside kallistobustools_count.nf which have been easily solved with reads.join( " " ) because they are flattened, while not solving the problem from others where we need to separate forward from reverse as:

-1 x_1.fq,y_1.fq,... -2 x_2.fq,y_2.fq,...

grst · 2022-06-02T08:35:03Z

[ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ].transpose()

should yield

[[ rep1_R1, rep2_R1, ..., repN_R1 ], [ rep1_R2, rep2_R2, ..., repN_R2]]

which is pretty much what you need, isn't it?

It should then be possible to do

(forward, reverse) = [[ rep1_R1, rep2_R1, ..., repN_R1 ], [ rep1_R2, rep2_R2, ..., repN_R2]]

and

-1 ${ forward.join(',') } -2 ${ reverse.join(',') }

fmalmeida · 2022-06-02T08:36:32Z

I get it know. Thank you. I will have a try on it.
😄

grst · 2022-06-02T08:40:07Z

btw, there was not a lot of follow-up on my question on slack, except for one comment that suggests appending the _T is a leftover from the fetchngs pipeline.

Unless @apeltzer disagrees, I'd therefore get rid of the respective code in the samplesheet_check, as this would simplify both the samplesheet_check.py and the procedure for generating the input channels.

apeltzer · 2022-06-02T10:03:48Z

I agree with what Gregor proposes above and looking at the above discussion this looks like the best way forward to resolve this efficiently :-) Thanks for taking care of it @fmalmeida 💯

flattens reads array as it is expected my the module

fmalmeida · 2022-06-02T11:02:44Z

Did our changes 😄

check_samplesheet.py does not emit that "_T" tags. I've literally done nothing more on the script than removing the variable from there. Everything else stay the same.
I've removed that snippet that was parsing the input channel inside the workflow and transferred the appropriate parsing required to inside the INPUT_CHECK subworkflow so things stay together.
- .groupTuple is required to have replicates of a sample in a single channel, otherwise every replicate would have an isolated channel
- Due to groupTuple, the channel changes to [ [reads_rep1], [reads_repN] ] which is not what cellranger and kallisto modules expect. Thus, I've added a .flatten() after groupTuple to keep expected structure

scrnaseq/subworkflows/local/input_check.nf

Lines 11 to 18 in 6d9caad

    
           main: 
        
           SAMPLESHEET_CHECK ( samplesheet ) 
        
               .csv 
        
               .splitCsv ( header:true, sep:',' ) 
        
               .map { create_fastq_channels(it) } 
        
               .groupTuple(by: [0]) // group replicate files together, modifies channel to [ val(meta), [ [reads_rep1], [reads_repN] ] ] 
        
               .map { meta, reads -> [ meta, reads.flatten() ] } // needs to flatten due to last "groupTuple", so we now have reads as a single array as expected by nf-core modules: [ val(meta), [ reads ] ] 
        
               .set { reads }

Finally, after having it properly and now knowing more about .collate() and .transpose(), I changed the unreadable for loop to only two operators inside alevin and star modules:

scrnaseq/modules/local/salmon_alevin.nf

Lines 25 to 26 in 5955ac5

    
           // separate forward from reverse pairs 
        
           def (forward, reverse) = reads.collate(2).transpose()

grst

Thanks @fmalmeida!

I added one tiny question in the code, other than that I think we got the pipeline logic right now 🚀

grst · 2022-06-02T11:18:52Z

subworkflows/local/align_cellranger.nf

+        CELLRANGER_COUNT (
+             // TODO what is `gem` and why is it needed?
+             // needed to add .flatten() on reads as cellranger count module expects a flattened array
+             ch_fastq.map{ meta, reads -> [meta + ["gem": meta.id, "samples": [meta.id]], reads.flatten()] },


Isn't this .flatten() here redundant with the .flatten() in input_check.nf?

If so, I would indeed prefer the .flatten() here (and in the kallisto call) over having it in input_check, to keep the pairing information intact as long as possible.

Yes it is redundant! Nicely spotted. I did this one first before changing on the subworkflow, and forgot about it!

😄

You prefer having them on the modules? I mean, this flatten is just to revert what happenned when applying groupTuple so it returns again to the normal convetions from nf-core modules.

That's why I thought to put it in input_check to make sure what comes from there tries to follow adopted conventions. And also, having only on .flatten call, instead of multiple across modules :)

As long as it works I guess it doesn't matter too much 🤷

My line of thinking was that I prefer having

[ [ rep1_R1, rep1_R2], [ rep2_R1, rep2_R2], [ ... ] ]

over

[ rep1_R1, rep1_R2, rep2_R1, rep2_R2, ... ]

because in the former case, the pairing information is explicit.

But I get your point with having only a single reads argument for the modules. Let's keep it that way!

Maybe add a comment in the module itself or meta.yml in what order multiple replicates are expected for the reads argument?

so users always put their replicates properly ordered when creating the samplesheet, right?

you mean, always putting the corresponding pair of fastqs in one row? I would hope so!
If the files follow some naming convention, we could technically try to validate it, but if not there's not a lot we can do about it anyway...

For now, I just removed the redundancy spotted. If we actually decide on going the other way, and having it on the modules, it does not seem like a complex change to make.

👌 perfect

And about the comment, would we have to make these comments directly on our branch?

I only meant the local modules for now. The linter shouldn't care about them.

Was thinking in adding a comment like this in the modules?

// // Input reads are expected to come as: [ meta, [ pair1_read1, pair1_read2, pair2_read1, pair2_read2 ] ] // Input array for a sample is created in the same order reads appear in samplesheet as pairs from replicates are appended to array. //

If so I can already add it to the relevant modules.
Then we can later discuss what's still missing for the PR 😄

Added on our relevant local files:

star_align.nf

kallistobustools_count.nf

salmon_alevin.nf

Yeah, I think we can still think about making these official, e.g. moving these where necessary to nf-core/modules and/or add their functionality to the nf-core/modules and then rely on these for the future 👍🏻

removing redudancy spotted

This reverts commit 5c2a21a.

apeltzer · 2022-06-07T15:09:21Z

As discussed with @grst @fmalmeida and @alexblaessle we will merge this to dev and then work on the template PR #101 before making a new release.

grst added 3 commits January 26, 2022 17:15

WIP cellranger workflow

c6da098

Enable cellranger count

796ad5b

Fix cellranger input path

00d0908

grst linked an issue Mar 7, 2022 that may be closed by this pull request

Add CellRanger to nf-core/scrnaseq #31

Closed

grst changed the base branch from dev to master March 28, 2022 09:48

grst changed the base branch from master to dev March 28, 2022 09:48

Update check_samplesheet.py

562d0fa

handling existence of replicates

Update check_samplesheet.py

09f7d87

replicates are appended together

fmalmeida added 6 commits May 31, 2022 19:35

updating cellranger module mkgtf to same version as count

3ed8ca3

Delete .markdownlint.yml

84774fa

nf-core lint wants it deleted

updated to nf-core latest to have same docker image

e136763

Create .prettierrc.yml

7988823

Update .prettierrc.yml

f75a077

fixed with nf-core lint

Update nextflow.config

78674d9

fixed with nf-core lint

diminishing complexity on input channel generation

60aeed7

fmalmeida added 3 commits June 2, 2022 12:13

Update align_cellranger.nf

d8828bf

flattens reads array as it is expected my the module

fixing and commenting channel generation

6d9caad

changing bad for loop

5955ac5

grst commented Jun 2, 2022

View reviewed changes

fmalmeida and others added 6 commits June 2, 2022 13:54

Update align_cellranger.nf

abf4b7e

removing redudancy spotted

executed prettier on files

5c2a21a

Merge branch 'dev' into cellranger2

e44be22

ran prettier after solving conflict

c8abde4

Revert "executed prettier on files" as it broke linting

510a1b3

This reverts commit 5c2a21a.

added comment about input reads on relevant "local" modules

ee5535b

apeltzer merged commit 1879c6d into nf-core:dev Jun 7, 2022

adairama mentioned this pull request Nov 11, 2022

Include --expect-cells when using --aligner cellranger #182

Open

Integrate cellranger workflow #90

Integrate cellranger workflow #90

Conversation

grst commented Jan 26, 2022 • edited Loading

PR checklist

alexblaessle commented Mar 24, 2022

grst commented Mar 28, 2022

alexblaessle commented Mar 29, 2022

grst commented Mar 29, 2022

apeltzer commented Apr 2, 2022

grst commented May 19, 2022

grst commented May 19, 2022

fmalmeida commented May 20, 2022 • edited Loading

apeltzer commented May 20, 2022

grst commented May 20, 2022

fmalmeida commented May 20, 2022 • edited Loading

apeltzer commented May 20, 2022

grst commented May 20, 2022

fmalmeida commented May 23, 2022

grst commented May 23, 2022

fmalmeida commented May 23, 2022 • edited Loading

grst commented May 23, 2022

fmalmeida commented May 23, 2022 • edited Loading

grst commented May 23, 2022

fmalmeida commented May 23, 2022 • edited Loading

apeltzer commented May 31, 2022

grst commented Jun 1, 2022

fmalmeida commented Jun 1, 2022

grst commented Jun 1, 2022

fmalmeida commented Jun 2, 2022

grst commented Jun 2, 2022

fmalmeida commented Jun 2, 2022

grst commented Jun 2, 2022

apeltzer commented Jun 2, 2022

fmalmeida commented Jun 2, 2022

grst left a comment

Choose a reason for hiding this comment

grst Jun 2, 2022

Choose a reason for hiding this comment

fmalmeida Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

fmalmeida Jun 2, 2022

Choose a reason for hiding this comment

grst Jun 2, 2022

Choose a reason for hiding this comment

grst Jun 2, 2022

Choose a reason for hiding this comment

grst Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

fmalmeida Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

grst Jun 7, 2022

Choose a reason for hiding this comment

fmalmeida Jun 7, 2022

Choose a reason for hiding this comment

apeltzer Jun 7, 2022

Choose a reason for hiding this comment

apeltzer commented Jun 7, 2022

grst commented Jan 26, 2022 •

edited

Loading

fmalmeida commented May 20, 2022 •

edited

Loading

fmalmeida commented May 20, 2022 •

edited

Loading

fmalmeida commented May 23, 2022 •

edited

Loading

fmalmeida commented May 23, 2022 •

edited

Loading

fmalmeida commented May 23, 2022 •

edited

Loading

fmalmeida Jun 2, 2022 •

edited

Loading

grst Jun 2, 2022 •

edited

Loading

fmalmeida Jun 7, 2022 •

edited

Loading