JASAPAGE: Phased genome assemblies and pangenome graphs of human populations of Japan and Saudi Arabia
This repository contains workflows and data for constructing pangenomes from Saudi and Japanese population genomic data using the minigraph-cactus pipeline. The workflows are implemented using Common Workflow Language (CWL), enabling reproducible and portable execution across different computing environments.
Minigraph-cactus is a state-of-the-art pipeline that combines the efficiency of minigraph with the accuracy of Progressive Cactus for constructing pangenomes. This hybrid approach offers several advantages:
- Scalable construction of genome graphs from multiple assemblies
- Preservation of structural variants and complex genomic regions
- Efficient handling of large-scale genomic data
- Integration of both reference-based and reference-free approaches
The pipeline first uses minigraph to create initial graphs, followed by Progressive Cactus's sophisticated algorithms for multiple genome alignment, resulting in high-quality pangenome representations.
CWL is a specification for describing analysis workflows and tools. Its key features include:
- Platform independence
- Explicit declaration of dependencies
- Built-in support for Docker/container technologies
- Scalability from single workstations to HPC environments
- Clear separation of tools from the workflow logic
You need to install Docker or Singularity.
# Install required software
pip install cwltool
git clone https://github.com/JaSaPaGe/pangenome-cwl
cwl-runner variant-calling/main-vg.cwl inputs.yml
Create a YAML file with your inputs:
graph:
class: File
location: reference/ksa_jpn_hg38.gbz
reads1:
class: File
location: input/sample_R1.fastq.gz
reads2:
class: File
location: input/sample_R2.fastq.gz
hapl:
class: File
location: reference/ksa_jpn_hg38.hapl
ref_prefix: GRCh38
model_type: WGS
ref:
class: File
location: reference/hg38.fa
secondaryFiles:
- class: File
location: reference/hg38.fa.fai
output_cram: aligned_reads.cram
output_vcf: sample.vcf
output_gvcf: sample.gvcf
aligned_reads.cram
: Alignment against GRCh38sample.gvcf
: Genomic VCFsample.vcf
: Variant calls
If you use this pipeline in your research, please cite:
The data is available under CC0
Please create GitHub issues if you encounter any problems.