-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate cellranger workflow #90
Conversation
Hi @grst , do you have an estimate how long this is taking? Are there ways I can help out with this? |
Hi @alexblaessle, I had to switch attention to another project. The earliest time I could possibly continue working on this is May, but I don't want to make any promises. In principle, this works already (I applied it to one of my datasets). The part that's missing (apart from writing tests) is to support a universal samplesheet (#92), handling cases where filenames don't match cellranger defaults or multiple samples are grouped into a "gem". If you have time to move this PR forward, that would be much appreciated! I'm wondering if we should just merge this (@apeltzer ?), and continue implementing the samplesheet standard in a follow-up PR. I was just hesitant so far because I thought it might speed up the first DSL2 release if we didn't have to deal with cellranger already. |
@grst @apeltzer As far as I understand, to make cellranger work from a universal sample sheet this would require a file structure check and if it fails at least temporarily re-organising files, right? As soon as I will have developer support from my side, would it make sense to have a quick call to completely understand what needs to be done? |
I was wondering if it would also work to just concatenate fastq files manually before feeding them into cellranger. My gut feeling is that this is exactly what cellranger does anyway, but would be good to either try it out or have it confirmed by someone who knows cellranger's inner gearings.
sure! Maybe also involving @ggabernet. She has some thoughts on |
Depending on when I'm back @alexblaessle (April 25th), I can also potentially join in a small call before that. Just to be in the loop. |
Following up with today's call, here are the commands I used to apply the pipeline version from this PR on my dataset: nextflow run ../../nf-core-scrnaseq/main.nf \
-resume \
-w /data/scratch/sturm/projects/2021/grabherr-scrnaseq-mouse \
-profile icbi_long,singularity \
-c scrnaseq.config where this is the params {
input = "../tables/samplesheet_scrnaseq.csv"
aligner = "cellranger"
outdir = "/data/projects/2021/Grabherr-scRNAseq-mouse/10_nfcore_scrnaseq/cellranger"
cellranger_index = "/data/databases/CellRanger/refdata-gex-mm10-2020-A"
protocol = "10XV3"
}
process {
withName:CELLRANGER_COUNT {
ext.args = "--expect-cells 8000"
scratch = "/local/scratch/sturm"
cpus = 22
memory = 400.GB
}
} and this the samplesheet:
|
The test profile can be ran with nextflow run main.nf -profile test,singularity --aligner cellranger LMK if anything is unclear. however it currently fails. This is because we updated the samplesheet in the test data, but didn't adapt the pipeline accordingly. |
The example you used can give a gist of what is happening, but since I don't have access to the data I will have to stick with the test-dataset to try it ... So I tested and I saw the error. It currently fails because it says there is duplicate rows. How do you believe this should be solved? This samplesheet, in the
Maybe update the module to concatenate it and avoid flagging the issue? Or don't use the default script for cellranger? Or create a different samplesheet? The error is actually raised by Thoughts? |
Its fine to rely on the testing data for testing this now - no need for now to have a larger testdataset and we can certainly provide one as soon as the small-scale test runs through. I would not develop this in a way to be compatible with the samplesheet in the main branch, but rather already rely on the updated version of the Samplesheet that you were trying to work on. The module itself should be fine I believe, as we want to rely on nf-core/modules if possible / whereever possible. Updating the |
The problem with the samplesheet is that it indeed contains duplicate lines. The test case we wanted to generate is
So for a quick test, you can just remove those lines and it should run through, but eventually it would be really good to have those two test cases. Either by disabling that samplesheet check or by finding more suitable test files / duplicating and renaming them / or whatever works. |
Hi guys, Update So I was able to successfully run the pipeline using the test dataset, but with some considerations that I'd like to ask you how to handle.
Considerations/adaptions For it to run I actually had to:
Meaning the example samplesheet would look like: sample,fastq_1,fastq_2,protocol,expected_cells
Sample_X,Sample_X_S1_L001_R1_001.fastq.gz,Sample_X_S1_L001_R2_001.fastq.gz,"10xV2","5000"
Sample_Y,Sample_Y_S1_L001_R1_001.fastq.gz,Sample_Y_S1_L001_R2_001.fastq.gz,"10xV2","5000" Questions
|
I have a tendency to rather have a custom way of making any input compatible with CellRanger - every other tool accepts this differently and I'd rather be happy to see that we can use the very same approach for all tools (independent of the actual tool used) rather than creating special cases for each of the available tools ;-)
I would say make a release without replicates for now (resolving the other things that need to be done) and later come back to replicates and resolve this ;-) |
Agree it would be nice to have the pipeline rename the files such that they just work with cellranger. But it's fair to start with just checking to get a working version quickly and then iterate. |
handling existence of replicates
So let's go little by little to make sure we have an aggreement about all. Changing ## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
replicate = 1 ## added this
sample_mapping_dict[sample] = [sample_info]
else:
if sample_info in sample_mapping_dict[sample]:
replicate += 1 ## added this
sample_mapping_dict[f"{sample}_rep{replicate}"] = [sample_info] ## added this
else:
replicate = 1 ## added this
sample_mapping_dict[sample] = [sample_info] So now, it does not complain about duplicates since they will be treated as replicates and will gain an additional suffix while also allowing users to do not have replicates which will not affect the sample name given. thoughts?
|
I'm not sure, wouldn't that lead to every technical replicate being processed individually, rather than merging them? |
replicates are appended together
Yes, it causes them to be separate entries ... So atuallly, we would have to have this?
Changing to this now works because they are appended together as represented above: ## Create sample mapping dictionary = { sample: [ single_end, fastq_1, fastq_2 ] }
if sample not in sample_mapping_dict:
sample_mapping_dict[sample] = [sample_info]
else:
sample_mapping_dict[sample].append(sample_info) [[id:Sample_X, single_end:false, gem:Sample_X, samples:[Sample_X]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]]
[[id:Sample_Y, single_end:false, gem:Sample_Y, samples:[Sample_Y]], [/nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R1_001.fastq.gz, /nf-core/test-datasets/scrnaseq/testdata/S10_L001_R2_001.fastq.gz]] But then, when entering the Caused by:
Process `NFCORE_SCRNASEQ:SCRNASEQ:CELLRANGER_ALIGN:CELLRANGER_COUNT` input file name collision -- There are multiple input files for each of the following file names: Sample_Y/S10_L001_R1_001.fastq.gz, Sample_Y/S10_L001_R2_001.fastq.gz So maybe we will have to work on the renaming now. The identification of replicates in the cellranger syntax is the value on the lane, right?
Would it make sense if we were renaming the first to L001, second to L002, and so on? Any thoughts?
The way it is now it will probably work with your dataset because it will not flag duplicates, but is not working with the |
The more I think about it: maybe we should just duplicate & rename the test data accordingly. In practice, duplicate rows in the samplesheet would be a mistake anyway, right? The test data is in the However, in cases where the input data is not named according to the cellranger filename pattern, I agree it would be best to rename them as you described. Here's an example of such a dataset on ArrayExpress. The only thing with renaming I'm unsure about is the case where one GEM-well has been sequenced across multiple flowcells. In that case, cellranger expects multiple input folders that can be specified as comma-separated list:
I'm not sure if it would make a difference if all these files were put in the same folder and just renamed to have more |
I agree on that. The duplicate rows are indeed an error. We would just be "fixing" the test dataset problem... And actually may be masking future problems.
I'll take a look at it to adapt the testing case.
As I can see, we still have many things to think and discuss about it. So for now, I will stick on adapting a test-dataset that can be used for cellranger in the nf-core/test-datasets and then enabling a |
👍 sounds good! |
saw that the changes in |
nf-core lint wants it deleted
fixed with nf-core lint
fixed with nf-core lint
Hi @fmalmeida ! I had some time to look into things and noticed a few things that we should probably get rid of / do a bit differently. Other pipelines in nf-core, e.g. atacseq can also handle technical replicates and do things in a way that they don't need a groovy Not so much worried about the Apart from this, docs would be great indeed + fixes for the last CI issues. |
Thanks for the extensive update!
I don't quite get where this would be coming from? As far as I understand the scrnaseq/subworkflows/local/input_check.nf Lines 25 to 28 in c8d04dc
I wonder if this could be simplified. The scrnaseq/workflows/scrnaseq.nf Line 80 in c8d04dc
and then again separating it into forward and reverse for starsolo/alevin, I think it should be possible to just keep a list of tuples in the first place. |
Hi guys,
@grst these "_T[1..]" suffixes com from scrnaseq/bin/check_samplesheet.py Lines 120 to 135 in 6d4aada
Here you said that we could keep a channel as? ch_fastq = [ meta, [ [ array of forward reads ], [ array of reverse reads ] ] ] If so, I think it would propably make it easier for the two modules where we need the arrays separately, but would pose difficulty to
Here @apeltzer I totally agree with both of you guys, that the You know of a way where we can, only using maps, convert this:
into this?
I believe this is what is still there to be defined right? To decide with try to keep this "flattened" array and then separate them as forward and reverse to produce strings inside the modules, or we try to do the other way around, right? |
I see, thanks! To me, this still seems unnecessary, but I might be missing something. I asked on Slack if someone can explain the motivation: https://nfcore.slack.com/archives/CE5LG7WMB/p1654096778326729
I think it would rather be
Either way using |
I see. But I cannot see how this could be used. It seems that making it like this, we would add the need to flatten these tuples inside
|
[ [ rep1_R1, rep1_R2], [rep2_R1, rep2_R2], ... ].transpose() should yield
which is pretty much what you need, isn't it? It should then be possible to do (forward, reverse) = [[ rep1_R1, rep2_R1, ..., repN_R1 ], [ rep1_R2, rep2_R2, ..., repN_R2]] and
|
I get it know. Thank you. I will have a try on it. |
btw, there was not a lot of follow-up on my question on slack, except for one comment that suggests appending the Unless @apeltzer disagrees, I'd therefore get rid of the respective code in the |
I agree with what Gregor proposes above and looking at the above discussion this looks like the best way forward to resolve this efficiently :-) Thanks for taking care of it @fmalmeida 💯 |
flattens reads array as it is expected my the module
Did our changes 😄
scrnaseq/subworkflows/local/input_check.nf Lines 11 to 18 in 6d9caad
scrnaseq/modules/local/salmon_alevin.nf Lines 25 to 26 in 5955ac5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fmalmeida!
I added one tiny question in the code, other than that I think we got the pipeline logic right now 🚀
CELLRANGER_COUNT ( | ||
// TODO what is `gem` and why is it needed? | ||
// needed to add .flatten() on reads as cellranger count module expects a flattened array | ||
ch_fastq.map{ meta, reads -> [meta + ["gem": meta.id, "samples": [meta.id]], reads.flatten()] }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this .flatten()
here redundant with the .flatten()
in input_check.nf
?
If so, I would indeed prefer the .flatten()
here (and in the kallisto call) over having it in input_check
, to keep the pairing information intact as long as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is redundant! Nicely spotted. I did this one first before changing on the subworkflow, and forgot about it!
😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You prefer having them on the modules? I mean, this flatten is just to revert what happenned when applying groupTuple so it returns again to the normal convetions from nf-core modules.
That's why I thought to put it in input_check
to make sure what comes from there tries to follow adopted conventions. And also, having only on .flatten call, instead of multiple across modules :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it works I guess it doesn't matter too much 🤷
My line of thinking was that I prefer having
[ [ rep1_R1, rep1_R2], [ rep2_R1, rep2_R2], [ ... ] ]
over
[ rep1_R1, rep1_R2, rep2_R1, rep2_R2, ... ]
because in the former case, the pairing information is explicit.
But I get your point with having only a single reads
argument for the modules. Let's keep it that way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment in the module itself or meta.yml
in what order multiple replicates are expected for the reads
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so users always put their replicates properly ordered when creating the samplesheet, right?
you mean, always putting the corresponding pair of fastqs in one row? I would hope so!
If the files follow some naming convention, we could technically try to validate it, but if not there's not a lot we can do about it anyway...
For now, I just removed the redundancy spotted. If we actually decide on going the other way, and having it on the modules, it does not seem like a complex change to make.
👌 perfect
And about the comment, would we have to make these comments directly on our branch?
I only meant the local
modules for now. The linter shouldn't care about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking in adding a comment like this in the modules?
//
// Input reads are expected to come as: [ meta, [ pair1_read1, pair1_read2, pair2_read1, pair2_read2 ] ]
// Input array for a sample is created in the same order reads appear in samplesheet as pairs from replicates are appended to array.
//
If so I can already add it to the relevant modules.
Then we can later discuss what's still missing for the PR 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added on our relevant local files:
star_align.nf
kallistobustools_count.nf
salmon_alevin.nf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can still think about making these official, e.g. moving these where necessary to nf-core/modules and/or add their functionality to the nf-core/modules and then rely on these for the future 👍🏻
removing redudancy spotted
This reverts commit 5c2a21a.
As discussed with @grst @fmalmeida and @alexblaessle we will merge this to |
Integrate the cellranger workflow into the main pipeline.
(it was already implemented to some degree, but not exposed via the parameters)
Closes #31
PR checklist
scrape_software_versions.py
nf-core lint .
).nextflow run . -profile test,docker
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).