Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate cellranger workflow #90

Merged
merged 34 commits into from
Jun 7, 2022
Merged

Integrate cellranger workflow #90

merged 34 commits into from
Jun 7, 2022

Conversation

grst
Copy link
Member

@grst grst commented Jan 26, 2022

Integrate the cellranger workflow into the main pipeline.

(it was already implemented to some degree, but not exposed via the parameters)

Closes #31

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/scrnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@grst grst linked an issue Mar 7, 2022 that may be closed by this pull request
@alexblaessle
Copy link

Hi @grst , do you have an estimate how long this is taking? Are there ways I can help out with this?

@grst grst changed the base branch from dev to master March 28, 2022 09:48
@grst grst changed the base branch from master to dev March 28, 2022 09:48
@grst
Copy link
Member Author

grst commented Mar 28, 2022

Hi @alexblaessle,

I had to switch attention to another project. The earliest time I could possibly continue working on this is May, but I don't want to make any promises.

In principle, this works already (I applied it to one of my datasets). The part that's missing (apart from writing tests) is to support a universal samplesheet (#92), handling cases where filenames don't match cellranger defaults or multiple samples are grouped into a "gem".

If you have time to move this PR forward, that would be much appreciated!

I'm wondering if we should just merge this (@apeltzer ?), and continue implementing the samplesheet standard in a follow-up PR. I was just hesitant so far because I thought it might speed up the first DSL2 release if we didn't have to deal with cellranger already.

@alexblaessle
Copy link

@grst @apeltzer As far as I understand, to make cellranger work from a universal sample sheet this would require a file structure check and if it fails at least temporarily re-organising files, right?

As soon as I will have developer support from my side, would it make sense to have a quick call to completely understand what needs to be done?

@grst
Copy link
Member Author

grst commented Mar 29, 2022

As far as I understand, to make cellranger work from a universal sample sheet this would require a file structure check and if it fails at least temporarily re-organising files, right?

I was wondering if it would also work to just concatenate fastq files manually before feeding them into cellranger. My gut feeling is that this is exactly what cellranger does anyway, but would be good to either try it out or have it confirmed by someone who knows cellranger's inner gearings.

As soon as I will have developer support from my side, would it make sense to have a quick call to completely understand what needs to be done?

sure! Maybe also involving @ggabernet. She has some thoughts on cellranger multi.

@apeltzer
Copy link
Member

apeltzer commented Apr 2, 2022

Depending on when I'm back @alexblaessle (April 25th), I can also potentially join in a small call before that. Just to be in the loop.

@grst
Copy link
Member Author

grst commented May 19, 2022

Following up with today's call, here are the commands I used to apply the pipeline version from this PR on my dataset:

nextflow run ../../nf-core-scrnaseq/main.nf \
  -resume \
  -w /data/scratch/sturm/projects/2021/grabherr-scrnaseq-mouse \
  -profile icbi_long,singularity \
  -c scrnaseq.config

where this is the scrnaseq.config file

params {
  input = "../tables/samplesheet_scrnaseq.csv"
  aligner = "cellranger"
  outdir = "/data/projects/2021/Grabherr-scRNAseq-mouse/10_nfcore_scrnaseq/cellranger"
  cellranger_index = "/data/databases/CellRanger/refdata-gex-mm10-2020-A"
  protocol = "10XV3"
}

process {
  withName:CELLRANGER_COUNT {
    ext.args = "--expect-cells 8000"
    scratch = "/local/scratch/sturm"
    cpus = 22
    memory = 400.GB
  }
}

and this the samplesheet:

sample,fastq_1,fastq_2,group,internal_id,sequencing_sample,mouse_id,sex,mouse_type,batch
FG-1,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-1/FG-1_S1_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-1/FG-1_S1_R2_001.fastq.gz,KO,FG-1,S1,TADO_830,male,B6,1
FG-2,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-2/FG-2_S2_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-2/FG-2_S2_R2_001.fastq.gz,WT,FG-2,S2,TADO_831,male,B6,1
FG-3,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-3/FG-3_S3_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-3/FG-3_S3_R2_001.fastq.gz,KO,FG-3,S3,TADO_837,male,B6,2
FG-4,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-4/FG-4_S4_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-4/FG-4_S4_R2_001.fastq.gz,WT,FG-4,S4,TADO_838,male,B6,2
FG-5,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-5/FG-5_S5_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-5/FG-5_S5_R2_001.fastq.gz,KO,FG-5,S5,TADO_846,female,B6,3
FG-6,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-6/FG-6_S6_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-6/FG-6_S6_R2_001.fastq.gz,WT,FG-6,S6,TADO_847,female,B6,3
FG-7,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-7/FG-7_S7_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-7/FG-7_S7_R2_001.fastq.gz,KO,FG-7,S7,TADO_839,female,B6,4
FG-8,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-8/FG-8_S8_R1_001.fastq.gz,/data/projects/2021/Grabherr-scRNAseq-mouse/00_fastq/AK16_FG-8/FG-8_S8_R2_001.fastq.gz,WT,FG-8,S8,TADO_840,female,B6,4

@grst
Copy link
Member Author

grst commented May 19, 2022

The test profile can be ran with

nextflow run main.nf -profile test,singularity --aligner cellranger

LMK if anything is unclear.

however it currently fails. This is because we updated the samplesheet in the test data, but didn't adapt the pipeline accordingly.

@fmalmeida
Copy link
Contributor

fmalmeida commented May 20, 2022

The example you used can give a gist of what is happening, but since I don't have access to the data I will have to stick with the test-dataset to try it ...

So I tested and I saw the error. It currently fails because it says there is duplicate rows. How do you believe this should be solved? This samplesheet, in the main branch also raises a problem?

When I remove the duplicated line and give the input manually, it passes the check and enters the workflow.

Maybe update the module to concatenate it and avoid flagging the issue? Or don't use the default script for cellranger? Or create a different samplesheet? The error is actually raised by check_samplesheet.py.

Thoughts?

@apeltzer
Copy link
Member

Its fine to rely on the testing data for testing this now - no need for now to have a larger testdataset and we can certainly provide one as soon as the small-scale test runs through.

I would not develop this in a way to be compatible with the samplesheet in the main branch, but rather already rely on the updated version of the Samplesheet that you were trying to work on. The module itself should be fine I believe, as we want to rely on nf-core/modules if possible / whereever possible.

Updating the check_samplesheet.py would be the way to go I think, to make sure it does the checking correctly and accepts a Samplesheet like the one here in testing. That was never done, which is why this likely fails at the moment.

@grst
Copy link
Member Author

grst commented May 20, 2022

The problem with the samplesheet is that it indeed contains duplicate lines.

The test case we wanted to generate is

  • multiple samples (actually, we had a bug in a previous version of the pipeline where it wouldn't run with multiple samples) and
  • technical replicates per sample

So for a quick test, you can just remove those lines and it should run through, but eventually it would be really good to have those two test cases. Either by disabling that samplesheet check or by finding more suitable test files / duplicating and renaming them / or whatever works.

@fmalmeida
Copy link
Contributor

fmalmeida commented May 20, 2022

Hi guys,

Update

So I was able to successfully run the pipeline using the test dataset, but with some considerations that I'd like to ask you how to handle.

nextflow run ~/Documents/GitHub/scrnaseq/main.nf -profile docker -c test.config --aligner cellranger

Considerations/adaptions

For it to run I actually had to:

  1. Remove the duplicate row in a local samplesheet
  2. Rename the fastqs to match what is expected by cellranger
    • From S10_L001_R{1,2}_001.fastq.gz to Sample_{X,Y}_S1_L001_R{1,2}_001.fastq.gz

Meaning the example samplesheet would look like:

sample,fastq_1,fastq_2,protocol,expected_cells
Sample_X,Sample_X_S1_L001_R1_001.fastq.gz,Sample_X_S1_L001_R2_001.fastq.gz,"10xV2","5000" 
Sample_Y,Sample_Y_S1_L001_R1_001.fastq.gz,Sample_Y_S1_L001_R2_001.fastq.gz,"10xV2","5000"

Questions

  • Should we change the check_samplesheet.py or the module itself to check when using cellranger if the users fastqs are under what is expected by the tool, so they know exactly what is expected and change it if necessary?
  • About the test-dataset, maybe we should provide new "copies" of the samplesheet and fastq files so they have the proper naming convetion and can be used with -profile test.
  • First make a release running without replicates and then thinking about the best strategy for them? Or already think of it to this release?

I will try to edit the check_samplesheet.py so it does not complain about it, and the run it to see what the output looks like.

@apeltzer
Copy link
Member

About the test-dataset, maybe we should provide new "copies" of the samplesheet and fastq files so they have the proper naming convetion and can be used with -profile test.

I have a tendency to rather have a custom way of making any input compatible with CellRanger - every other tool accepts this differently and I'd rather be happy to see that we can use the very same approach for all tools (independent of the actual tool used) rather than creating special cases for each of the available tools ;-)

First make a release running without replicates and then thinking about the best strategy for them? Or already think of it to this release?

I would say make a release without replicates for now (resolving the other things that need to be done) and later come back to replicates and resolve this ;-)

@grst
Copy link
Member Author

grst commented May 20, 2022

Agree it would be nice to have the pipeline rename the files such that they just work with cellranger.

But it's fair to start with just checking to get a working version quickly and then iterate.

handling existence of replicates
@fmalmeida
Copy link
Contributor

So let's go little by little to make sure we have an aggreement about all.

Changing check_samplesheet.py to make sure it handles the replicates

## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
    replicate = 1  ## added this
    sample_mapping_dict[sample] = [sample_info]
else:
    if sample_info in sample_mapping_dict[sample]:
        replicate += 1  ## added this
        sample_mapping_dict[f"{sample}_rep{replicate}"] = [sample_info]   ## added this
    else:
        replicate = 1  ## added this
        sample_mapping_dict[sample] = [sample_info]

So now, it does not complain about duplicates since they will be treated as replicates and will gain an additional suffix while also allowing users to do not have replicates which will not affect the sample name given.

thoughts?

solving this I'll go to the file name problem

@grst
Copy link
Member Author

grst commented May 23, 2022

I'm not sure, wouldn't that lead to every technical replicate being processed individually, rather than merging them?
I think in principle, the code was alright (putting technical replicates in a list), it just didn't like two identical entries (i.e. the same files, which shouldn't happen in practice, only in our test case).

replicates are appended together
@fmalmeida
Copy link
Contributor

fmalmeida commented May 23, 2022

I'm not sure, wouldn't that lead to every technical replicate being processed individually, rather than merging them? I think in principle, the code was alright (putting technical replicates in a list), it just didn't like two identical entries (i.e. the same files, which shouldn't happen in practice, only in our test case).

Yes, it causes them to be separate entries ... So atuallly, we would have to have this?

[Sample_X, [read1, read2], [read1, read2]] ?

Changing to this now works because they are appended together as represented above:

## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
    sample_mapping_dict[sample] = [sample_info]
else:
    sample_mapping_dict[sample].append(sample_info)
[[id:Sample_X, single_end:false, gem:Sample_X, samples:[Sample_X]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]]
[[id:Sample_Y, single_end:false, gem:Sample_Y, samples:[Sample_Y]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]]

But then, when entering the COUNT module, it fails with the test data because they have the same name ...

Caused by:
  Process `NFCORE_SCRNASEQ:SCRNASEQ:CELLRANGER_ALIGN:CELLRANGER_COUNT` input file name collision -- There are multiple input files for each of the following file names: Sample_Y/S10_L001_R1_001.fastq.gz, Sample_Y/S10_L001_R2_001.fastq.gz

So maybe we will have to work on the renaming now. The identification of replicates in the cellranger syntax is the value on the lane, right?

test_sample1_S1_L001_R1_001.fastq.gz

Would it make sense if we were renaming the first to L001, second to L002, and so on? Any thoughts?

Will the renaming handle duplicate input filenames? or will it only rename to cellranger syntax? Should we add another header on the samplesheet to tell the lane?

The way it is now it will probably work with your dataset because it will not flag duplicates, but is not working with the -profile test

@grst
Copy link
Member Author

grst commented May 23, 2022

The more I think about it: maybe we should just duplicate & rename the test data accordingly. In practice, duplicate rows in the samplesheet would be a mistake anyway, right?

The test data is in the scrnaseq branch of the nf-core/test-datasets repository and you could just adapt it there.

However, in cases where the input data is not named according to the cellranger filename pattern, I agree it would be best to rename them as you described. Here's an example of such a dataset on ArrayExpress.

The only thing with renaming I'm unsure about is the case where one GEM-well has been sequenced across multiple flowcells. In that case, cellranger expects multiple input folders that can be specified as comma-separated list:

If the files are in multiple folders, for instance because one library was sequenced across multiple flow cells, supply a comma-separated list of paths.
(https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/fastq-input)

I'm not sure if it would make a difference if all these files were put in the same folder and just renamed to have more L0XX files. It might, because two identical barcodes from different GEMs would identify different droplets (?).

@fmalmeida
Copy link
Contributor

fmalmeida commented May 23, 2022

The more I think about it: maybe we should just duplicate & rename the test data accordingly. In practice, duplicate rows in the samplesheet would be a mistake anyway, right?

I agree on that. The duplicate rows are indeed an error. We would just be "fixing" the test dataset problem... And actually may be masking future problems.

The test data is in the scrnaseq branch of the nf-core/test-datasets repository and you could just adapt it there.

I'll take a look at it to adapt the testing case.

About the renaming

As I can see, we still have many things to think and discuss about it.

So for now, I will stick on adapting a test-dataset that can be used for cellranger in the nf-core/test-datasets and then enabling a -profile test. So we know it is working when the files are properly named.

@grst
Copy link
Member Author

grst commented May 23, 2022

👍 sounds good!

@fmalmeida
Copy link
Contributor

fmalmeida commented May 23, 2022

saw that the changes in check_samplesheet.py broke the tests with other aligners, so I will try to revert back the changes and make the changes for "merging" reads only happen when using --aligner cellranger.

@apeltzer
Copy link
Member

Hi @fmalmeida !

I had some time to look into things and noticed a few things that we should probably get rid of / do a bit differently. Other pipelines in nf-core, e.g. atacseq can also handle technical replicates and do things in a way that they don't need a groovy for loop - see e.g. https://github.com/nf-core/atacseq/blob/f327c86324427c64716be09c98634ae0bc8165f6/main.nf#L369 for some more details. If possible, I'd rather try to stay very similar to this instead of looping through files, as its a bit cleaner / easier to read (and thus understand, especially later when we need to come back to this at some point maybe).

Not so much worried about the join usage in the respective modules - that should be fine 👍🏻

Apart from this, docs would be great indeed + fixes for the last CI issues.

@grst
Copy link
Member Author

grst commented Jun 1, 2022

Thanks for the extensive update!

But when I played with it I saw that it is actually required for the workflow as the "meta.id" from the input check comes with some
"_T[1..]" tags attached to it.

I don't quite get where this would be coming from? As far as I understand the INPUT_CHECK, meta.id would be just the sample column from the samplesheet, which should serve well as a unique identifier for each sample?

def create_fastq_channels(LinkedHashMap row) {
def meta = [:]
meta.id = row.sample
meta.single_end = row.single_end.toBoolean()

For these tools I had to create a small loop inside the modules to separate reverse and forward pairs as they should be passed in separate strings for tools. This is the loop:

I wonder if this could be simplified. The INPUT_CHECK returns tuples of reads for paired end samples, so we should already know which read is forward and which reverse.
So instead of flattening the list here

.map { it -> [ it[0], it[1].flatten() ] }

and then again separating it into forward and reverse for starsolo/alevin, I think it should be possible to just keep a list of tuples in the first place.

@fmalmeida
Copy link
Contributor

Hi guys,
thanks for the comments.

I don't quite get where this would be coming from? As far as I understand the INPUT_CHECK, meta.id would be just the sample column from the samplesheet, which should serve well as a unique identifier for each sample?

@grst these "_T[1..]" suffixes com from check_samplesheet.py (see the last lines from code snippet) maybe I could remove them so we don't need this procedure in scrnaseq.nf file?

## Write validated samplesheet with appropriate columns
if len(sample_mapping_dict) > 0:
out_dir = os.path.dirname(file_out)
make_dir(out_dir)
with open(file_out, "w") as fout:
fout.write(",".join(["sample", "single_end", "fastq_1", "fastq_2"]) + "\n")
for sample in sorted(sample_mapping_dict.keys()):
## Check that multiple runs of the same sample are of the same datatype
if not all(x[0] == sample_mapping_dict[sample][0][0] for x in sample_mapping_dict[sample]):
print_error("Multiple runs of a sample must be of the same datatype!", "Sample: {}".format(sample))
for idx, val in enumerate(sample_mapping_dict[sample]):
fout.write(",".join(["{}_T{}".format(sample, idx + 1)] + val) + "\n")
else:
print_error("No entries to process!", "Samplesheet: {}".format(file_in))

So instead of flattening the list here and then again separating it into forward and reverse for starsolo/alevin, I think it should be possible to just keep a list of tuples in the first place.

Here you said that we could keep a channel as?

ch_fastq = [ meta, [ [ array of forward reads ], [ array of reverse reads ] ] ]

If so, I think it would propably make it easier for the two modules where we need the arrays separately, but would pose difficulty to kallistobustools_count.nf module where we just need reads in a single string as rep1_read1 rep1_read2 rep2_read1 rep2_read2. No? There is a quick way to "join" both arrays while taking one from each?

I had some time to look into things and noticed a few things that we should probably get rid of / do a bit differently. Other pipelines in nf-core, e.g. atacseq can also handle technical replicates and do things in a way that they don't need a groovy for loop - see e.g. https://github.com/nf-core/atacseq/blob/f327c86324427c64716be09c98634ae0bc8165f6/main.nf#L369 for some more details. If possible, I'd rather try to stay very similar to this instead of looping through files, as its a bit cleaner / easier to read (and thus understand, especially later when we need to come back to this at some point maybe).

Here @apeltzer I totally agree with both of you guys, that the groovy for loop is not the best way because it adds complexity. But I am not actually understanding how we could do some splitting using the example given. On this one I will still need a little more help on brainstorming.

You know of a way where we can, only using maps, convert this:

[ meta, [ all reads ]

into this?

[ meta, [ [ forward reads], [reverse reads] ]

I believe this is what is still there to be defined right? To decide with try to keep this "flattened" array and then separate them as forward and reverse to produce strings inside the modules, or we try to do the other way around, right?

@grst
Copy link
Member Author

grst commented Jun 1, 2022

these "_T[1..]" suffixes com from check_samplesheet.py

I see, thanks! To me, this still seems unnecessary, but I might be missing something. I asked on Slack if someone can explain the motivation:

https://nfcore.slack.com/archives/CE5LG7WMB/p1654096778326729

Here you said that we could keep a channel as
ch_fastq = [ meta, [ [ array of forward reads ], [ array of reverse reads ] ] ]

I think it would rather be

[ meta, [ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ] ]

Either way using .flatten() or .transpose() (not an expert with groovy list magics myself) should do the job getting the files in the right order.

@fmalmeida
Copy link
Contributor

[ meta, [ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ] ]

I see. But I cannot see how this could be used.

It seems that making it like this, we would add the need to flatten these tuples inside kallistobustools_count.nf which have been easily solved with reads.join( " " ) because they are flattened, while not solving the problem from others where we need to separate forward from reverse as:

-1 x_1.fq,y_1.fq,... -2 x_2.fq,y_2.fq,...

@grst
Copy link
Member Author

grst commented Jun 2, 2022

[ [ rep1_R1, rep1_R2],  [rep2_R1, rep2_R2], ... ].transpose()

should yield

[[ rep1_R1, rep2_R1, ..., repN_R1 ], [ rep1_R2, rep2_R2, ..., repN_R2]]

which is pretty much what you need, isn't it?

It should then be possible to do

(forward, reverse) = [[ rep1_R1, rep2_R1, ..., repN_R1 ], [ rep1_R2, rep2_R2, ..., repN_R2]]

and

-1 ${ forward.join(',') } -2 ${ reverse.join(',') }

@fmalmeida
Copy link
Contributor

I get it know. Thank you. I will have a try on it.
😄

@grst
Copy link
Member Author

grst commented Jun 2, 2022

btw, there was not a lot of follow-up on my question on slack, except for one comment that suggests appending the _T is a leftover from the fetchngs pipeline.

Unless @apeltzer disagrees, I'd therefore get rid of the respective code in the samplesheet_check, as this would simplify both the samplesheet_check.py and the procedure for generating the input channels.

@apeltzer
Copy link
Member

apeltzer commented Jun 2, 2022

I agree with what Gregor proposes above and looking at the above discussion this looks like the best way forward to resolve this efficiently :-) Thanks for taking care of it @fmalmeida 💯

@fmalmeida
Copy link
Contributor

Did our changes 😄

  1. check_samplesheet.py does not emit that "_T" tags. I've literally done nothing more on the script than removing the variable from there. Everything else stay the same.
  2. I've removed that snippet that was parsing the input channel inside the workflow and transferred the appropriate parsing required to inside the INPUT_CHECK subworkflow so things stay together.
    • .groupTuple is required to have replicates of a sample in a single channel, otherwise every replicate would have an isolated channel
    • Due to groupTuple, the channel changes to [ [reads_rep1], [reads_repN] ] which is not what cellranger and kallisto modules expect. Thus, I've added a .flatten() after groupTuple to keep expected structure

main:
SAMPLESHEET_CHECK ( samplesheet )
.csv
.splitCsv ( header:true, sep:',' )
.map { create_fastq_channels(it) }
.groupTuple(by: [0]) // group replicate files together, modifies channel to [ val(meta), [ [reads_rep1], [reads_repN] ] ]
.map { meta, reads -> [ meta, reads.flatten() ] } // needs to flatten due to last "groupTuple", so we now have reads as a single array as expected by nf-core modules: [ val(meta), [ reads ] ]
.set { reads }

  1. Finally, after having it properly and now knowing more about .collate() and .transpose(), I changed the unreadable for loop to only two operators inside alevin and star modules:

// separate forward from reverse pairs
def (forward, reverse) = reads.collate(2).transpose()

Copy link
Member Author

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fmalmeida!

I added one tiny question in the code, other than that I think we got the pipeline logic right now 🚀

CELLRANGER_COUNT (
// TODO what is `gem` and why is it needed?
// needed to add .flatten() on reads as cellranger count module expects a flattened array
ch_fastq.map{ meta, reads -> [meta + ["gem": meta.id, "samples": [meta.id]], reads.flatten()] },
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this .flatten() here redundant with the .flatten() in input_check.nf?

If so, I would indeed prefer the .flatten() here (and in the kallisto call) over having it in input_check, to keep the pairing information intact as long as possible.

Copy link
Contributor

@fmalmeida fmalmeida Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is redundant! Nicely spotted. I did this one first before changing on the subworkflow, and forgot about it!

😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You prefer having them on the modules? I mean, this flatten is just to revert what happenned when applying groupTuple so it returns again to the normal convetions from nf-core modules.

That's why I thought to put it in input_check to make sure what comes from there tries to follow adopted conventions. And also, having only on .flatten call, instead of multiple across modules :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it works I guess it doesn't matter too much 🤷

My line of thinking was that I prefer having

[ [ rep1_R1, rep1_R2], [ rep2_R1, rep2_R2], [ ... ] ]

over

[ rep1_R1, rep1_R2, rep2_R1, rep2_R2, ... ]

because in the former case, the pairing information is explicit.

But I get your point with having only a single reads argument for the modules. Let's keep it that way!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment in the module itself or meta.yml in what order multiple replicates are expected for the reads argument?

Copy link
Member Author

@grst grst Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so users always put their replicates properly ordered when creating the samplesheet, right?

you mean, always putting the corresponding pair of fastqs in one row? I would hope so!
If the files follow some naming convention, we could technically try to validate it, but if not there's not a lot we can do about it anyway...

For now, I just removed the redundancy spotted. If we actually decide on going the other way, and having it on the modules, it does not seem like a complex change to make.

👌 perfect

And about the comment, would we have to make these comments directly on our branch?

I only meant the local modules for now. The linter shouldn't care about them.

Copy link
Contributor

@fmalmeida fmalmeida Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking in adding a comment like this in the modules?

//
// Input reads are expected to come as: [ meta, [ pair1_read1, pair1_read2, pair2_read1, pair2_read2 ] ]
// Input array for a sample is created in the same order reads appear in samplesheet as pairs from replicates are appended to array.
//

If so I can already add it to the relevant modules.
Then we can later discuss what's still missing for the PR 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added on our relevant local files:

  • star_align.nf
  • kallistobustools_count.nf
  • salmon_alevin.nf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can still think about making these official, e.g. moving these where necessary to nf-core/modules and/or add their functionality to the nf-core/modules and then rely on these for the future 👍🏻

@apeltzer
Copy link
Member

apeltzer commented Jun 7, 2022

As discussed with @grst @fmalmeida and @alexblaessle we will merge this to dev and then work on the template PR #101 before making a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CellRanger to nf-core/scrnaseq
4 participants