Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmode collect step for large cohorts #180

Closed
kate-stankiewicz opened this issue Dec 2, 2022 · 4 comments
Closed

qmode collect step for large cohorts #180

kate-stankiewicz opened this issue Dec 2, 2022 · 4 comments

Comments

@kate-stankiewicz
Copy link

  • spladder version: 3.0.3
  • Python version: 3.9.12
  • Operating System: Ubuntu 20

Description

I have followed the steps for running SplAdder on large cohorts as outlined here https://spladder.readthedocs.io/en/latest/spladder_cohort.html

However, I am not sure what the second phase in the quantification step is doing? On the documentation page is says "As a second step to this phase, we need to collect the individual quantifications and aggregate them in a joint database:". However, when I run this step I get no output and no log information either (even if run in -v mode). Additionally, it only runs for about 30 seconds (all other steps in the workflow run for at least 30 mins, if not days). From what I can tell, no new files are created, nor are any existing files modified. The subsequent steps (event calling and testing) seem to run despite this. Is the step doing anything that might not be visible to me? Or is it possible to skip it?

Many thanks!

What I Did

# Here I use a job array to run each of the 83 samples in parallel
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${1} --merge-strat single --no-extract-ase --parallel 2 -v

# next merge the splice graphs
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --merge-strat merge_graphs --no-extract-ase --parallel 40 -v

# next run quantification for each sample separately using a job array again
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${1} --merge-strat merge_graphs --no-extract-ase --quantify-graph --qmode single --parallel 2 -v

# aggregate them into a joint database (this seems to do nothing and produce no output/logs)
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --merge-strat merge_graphs --no-extract-ase --quantify-graph --qmode collect --parallel 40 -v

# call events
spladder build -o ${workdir}/array_spladder_out -a ${workdir}/annotation/genome.annotation.gff -b ${workdir}/bams.txt --parallel 40 -v

# run 30 different tests comparing different groups of samples (again using a job array to automate running each comparison)
spladder test -o ${workdir}/array_spladder_out --out-tag Rem_Prob --conditionA ${workdir}/testing_contrasts/contrast_files/sym_${1}_${3}.txt --conditionB ${workdir}/testing_contrasts/contrast_files/sym_${2}_${3}.txt --labelA ${1}${3} --labelB ${2}${3} --diagnose-plots -v --parallel 5

@akahles
Copy link
Member

akahles commented Feb 6, 2023

Dear @kate-stankiewicz ,

sorry for the late reply. I sometimes fall behind on answering issues due to other higher priority items.

To refer to your individual steps above, I will refer to them numbered from 1 to 6.

In step 1, you generate a single splicing graph per input bam file. In the same step, you quantify the single graphs in that sample, as the --quantify-graph options is set by default. What you would need in this case is to set the --no-quantify-graph option. This will simply generate the individual sample graphs. If you don't have the --no-quantify-graph option, then the single graphs are generated in any case, but also the quantifications per single bam file for each respective graph (which you might not want).

In step 2, you merge the single sample graphs into a joint graphs. That is, all events detected in all samples will be merged per gene. Also here, you would like to add the --no-quantify-graph option, as otherwise the quantifications for the merged graph are done per sample, but in a sequential manner. (This is what you would like to parallelise in step 3.)

In step 3, you will now quantify the merged graph with each single bam file. That is, also events detected in only one sample will get quantified in all samples. This is sometimes useful, as the criteria for adding a new event to the splicing graph are quite stringent. So you might not have enough reads in a sample to add a new event, but you have enough to quantify it, if it was detected in another sample. If you do this in parallel over samples, you can speed up this process depending on how many samples you run in parallel.

In step 4, you now collect the quantifications from step 3. This should generate genes_graph_conf3.merge_graphs.count.hdf5 and genes_graph_conf3.merge_graphs.gene_exp.hdf5 (assuming you run with default confidence level 3). However, in your above run, these files were already generated by step 2. Hence nothing else needed to be done. I agree, that the verbose mode could indeed be a bit more (or at all) verbose about this ...

I hope this answers your question. Please let me know if you have additional ones.

Best,
Andre

@kate-stankiewicz
Copy link
Author

Hi Andre,

Thank you for this informative and in depth explanation--it is really helpful! So if I am understanding correctly, the results between running these steps in the way that I have and running them as you outlines here should not change, correct? The only thing I have missed out on is better parallelization in delaying the graft quantification until a later step?

Also, to possibly save you having to address a similar issue from another user, the --no-quantify-graph option is not listed in the steps for use on large cohorts: https://spladder.readthedocs.io/en/latest/spladder_cohort.html

Only the --no-extract-ase option is listed in the earlier steps. Just wanted to let you know since it seems to be important to include and makes sense as you explained it above.

@akahles
Copy link
Member

akahles commented Feb 7, 2023

Hi @kate-stankiewicz ,

Thanks for the hint with the documentation. I will fix this right away. Regarding you statement about completeness - yes, this is true. It only took longer than necessary, but the outputs are the same.

Best,

Andre

@kate-stankiewicz
Copy link
Author

Hi @akahles ,

Thanks for the clarification! I will close this issue now.

akahles added a commit that referenced this issue Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants