Daylily Whitepaper
Intended to be published in f1000-research.
analysis results being processed
- Reproducibility requires repeatable compute performance.
- Cost, as driven by compute performance (and storage and network) drive decisions which influence accuracy.
- Cost should be reliably predictable (and daylily includes a comprehensive cost prediction UI and other tools: for not just compute, but data transfer, storage, and long term data management)
- Daylily offers a mechanism to share a common, and reasonably accassible to all, hardware environment & be able to better benchmark.
- WGS analysis should only cost ~ $3-4 per sample, everything daylily is doing was possible 4 years ago.
- Daylily was initially designed for tool interaction benchmarking (so, all aligners by all dedupers by all variant callers by all sv callers), and it still serves this purpose admirably.
- Compute hardware reporducibility is MUCH more of a factor when choosing to run pipelines vs. the workflow orchestrators. Yet, the orchestrators is where most attention goes. Why? Distraction, etc.
- Workflows are largely working via snakemake. HOWEVER the compute framework us slurm based, and any orchestrator which can use slurm can run just as easily. Cromwell is working in a very rudimentary fashion, and was quick to get working. Nextflow, etc can be plugged in with minimal fuss.
- When presenting information about analysis pipelines, there should be full transparency on:
-
- user run time
- accuracy of final results
- expectations on vcpu time per task
- expectations on mem required per task
- COST of compute, data transfer
- data storage
- cost and availability of reference and other resources needed to run compute
- long term sustainability/reproducibility of tools involved
- full details on tool versions, etc,...
This is going to end up 2 repos, one for framework stuff, another for analysis workflows.
-
All 7 GIAB samples, google brain 30x 2x150 ILMN no amp WGS fastqs.
-
b37
human reference- 105 SV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 5 SV callers).
- 63 SNV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 3 SV callers)..
- A LOT of qc reports, rolled into 1 multiqc report.
-
hg38
human reference- 105 SV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 5 SV callers).
- 63 SNV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 3 SV callers)..
- A LOT of qc reports, rolled into 1 multiqc report.
- *~ $3-4* per 30x genome, in ~90min walltime. NOTE: there are still opportunities to reduce this by ~20%, given time/resources. NOTE2: raw compute cost can be reduced by ~$1 using sentieon, however, there will be additional s/w lisc fees to add to this.
- Fscores acheivable are as good or better than expectations from publications in the literature (so
0.999+
), depending on tool combination.
All of these have been optimized for the daylily compute framework instances.
- bwa mem2
- strobealigner
- sentieon bwa mem
- doppelmark
- deepvariant
- clair3
- lofreq2
- octopus
- sentieon DNAscope
- for all SNV vcfs, there is a detailed analysis of concordance.
- manta
- tiddit
- dysgu
- svaba being tested still
to be done... assistance welcome~
A large number of QC tools are enabled by default, and aggregated into a final multiqc report
- peddy
- qualimap
- fastqc
- cov eveness
- mosdepth
- goleft
- verifybam
- alignstats
- vcfstats
- bamstats