Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow starting the workflow with existing read alignments (sorted BAM/CRAM files) #92

Open
gpertea opened this issue Mar 3, 2023 · 3 comments
Labels
enhancement New feature or request
Milestone

Comments

@gpertea
Copy link
Member

gpertea commented Mar 3, 2023

There are situations where users want to change e.g. featureCounts options, or bring in already prepared read alignments (sorted BAMs) so in such cases it would benefit to have the option to skip the HISAT2/STAR alignment step and proceed with the given alignments as "input" for the other steps in the pipeline.

I am aware this would involve skipping any steps that depend on the FASTQ files (which means not (re)generating the rse_tx object, and not including any fastqc metrics in colData etc.). However, there are ways to generate the rse_tx object from the BAM files (I can help with implementing that option)

It seems that the BIOCMap workflow was in part split into 2 nextflow scripts for a similar reason, if I am not mistaken. A similar interim/simpler solution might the way to address this request initially - create an alternate workflow besides main.nf that would work on BAM (or CRAM) files and run only the steps related to the read alignment data (featureCounts etc.), (with an option to built rse_tx from the provided alignment data).

@gpertea gpertea added the enhancement New feature or request label Mar 3, 2023
@gpertea
Copy link
Member Author

gpertea commented Mar 3, 2023

I can help with most of the shell and R code necessary to implement this alternate workflow (as I already have some non-user-friendly scripts doing that), but I would need some help with the nextflow code/implementation.

@Nick-Eagles
Copy link
Member

BiocMAP was split at "the same point" mostly on the thought that GPUs (used for alignment) might not be available on the same machines where massive CPU/memory resources (used for post-alignment steps) was available. I'm a bit concerned that there are too many ways a user might want to partially run SPEAQeasy (e.g. run transcript quantification again but not alignment, only call variants, etc), and this would be only one specific solution (and unfortunately Nextflow doesn't support this type of partial-running functionality without modifying/adding a lot of code). That said, if starting from aligned files is a repeated use case you're seeing, I can help out.

@gpertea
Copy link
Member Author

gpertea commented Mar 3, 2023

Thank you Nick - perhaps the easiest approach at this point would be to help me put together a cut-down version of main.nf that can take as input the BAM files (different samples.manifest? or just point to a directory with the sorted BAM files?) and then run only the branches of the workflow that depend on those alignments (we could even add another input to be the colData needed to (re) build rse_gene and rse_exon I suppose).

I can take care of the R scripts there (like create_count_objects.R) to make them ignore the transcript assays if they are not available etc. but the nextflow part itself was the problem for me - my limited experience with nextflow (and time constraints) prevented me from attempting this by myself.

@lcolladotor lcolladotor added this to the bioc v3.21 milestone Nov 30, 2023
@lcolladotor lcolladotor moved this to Todo in SPEAQeasy plans Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

3 participants