Skip to content

Latest commit

 

History

History
107 lines (78 loc) · 3.93 KB

README.md

File metadata and controls

107 lines (78 loc) · 3.93 KB

Daylily Whitepaper

Intended to be published in f1000-research.

analysis results being processed

Sections

Abstract

Core Topics

  • Reproducibility requires repeatable compute performance.
  • Cost, as driven by compute performance (and storage and network) drive decisions which influence accuracy.
  • Cost should be reliably predictable (and daylily includes a comprehensive cost prediction UI and other tools: for not just compute, but data transfer, storage, and long term data management)
  • Daylily offers a mechanism to share a common, and reasonably accassible to all, hardware environment & be able to better benchmark.
  • WGS analysis should only cost ~ $3-4 per sample, everything daylily is doing was possible 4 years ago.
  • Daylily was initially designed for tool interaction benchmarking (so, all aligners by all dedupers by all variant callers by all sv callers), and it still serves this purpose admirably.
  • Compute hardware reporducibility is MUCH more of a factor when choosing to run pipelines vs. the workflow orchestrators. Yet, the orchestrators is where most attention goes. Why? Distraction, etc.
  • Workflows are largely working via snakemake. HOWEVER the compute framework us slurm based, and any orchestrator which can use slurm can run just as easily. Cromwell is working in a very rudimentary fashion, and was quick to get working. Nextflow, etc can be plugged in with minimal fuss.
  • When presenting information about analysis pipelines, there should be full transparency on:
    • user run time
    • accuracy of final results
    • expectations on vcpu time per task
    • expectations on mem required per task
    • COST of compute, data transfer
    • data storage
    • cost and availability of reference and other resources needed to run compute
    • long term sustainability/reproducibility of tools involved
    • full details on tool versions, etc,...

Daylily Design

This is going to end up 2 repos, one for framework stuff, another for analysis workflows.

Infrasttructure Framework

Analysis Suite

Integrated Cost / Budget Prediction, Monitoring, Safeguards, Process Interactions

Daylily Performance

Dataset

  • All 7 GIAB samples, google brain 30x 2x150 ILMN no amp WGS fastqs.

  • b37 human reference

    • 105 SV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 5 SV callers).
    • 63 SNV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 3 SV callers)..
    • A LOT of qc reports, rolled into 1 multiqc report.
  • hg38 human reference

    • 105 SV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 5 SV callers).
    • 63 SNV vcfs produced for analysis (7GIAB * 3 aligners * 1 deduper * 3 SV callers)..
    • A LOT of qc reports, rolled into 1 multiqc report.

Cost & Accuracy Performance

  • *~ $3-4* per 30x genome, in ~90min walltime. NOTE: there are still opportunities to reduce this by ~20%, given time/resources. NOTE2: raw compute cost can be reduced by ~$1 using sentieon, however, there will be additional s/w lisc fees to add to this.
  • Fscores acheivable are as good or better than expectations from publications in the literature (so 0.999+ ), depending on tool combination.

Toolsets

All of these have been optimized for the daylily compute framework instances.

Aligners

  • bwa mem2
  • strobealigner
  • sentieon bwa mem

Deduplication

  • doppelmark

SNV Callers

  • deepvariant
  • clair3
  • lofreq2
  • octopus
  • sentieon DNAscope
Automated SNV Concordance Reports
  • for all SNV vcfs, there is a detailed analysis of concordance.

SV Callers

  • manta
  • tiddit
  • dysgu
  • svaba being tested still
Automated SV Concordance Repors

to be done... assistance welcome~

QC Reporting

A large number of QC tools are enabled by default, and aggregated into a final multiqc report

  • peddy
  • qualimap
  • fastqc
  • cov eveness
  • mosdepth
  • goleft
  • verifybam
  • alignstats
  • vcfstats
  • bamstats