Skip to content
Tanger Zhang edited this page Mar 8, 2021 · 17 revisions

ALLHiC

Phasing and scaffolding polyploid genomes based on Hi-C data

Introduction

The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the polyploid genomes. ALLHIC pipeline contains a total of 5 steps: prune, partition, rescue, optimize and build.

Overview of ALLHiC

image
Figure 1. Overview of major steps in ALLHiC algorithm. The newly released ALLHiC pipeline contains a total of 5 functions: prune, partition, rescue, optimize and build. Briefly, the prune step removes the inter-allelic links so that the homologous chromosomes are more easily separated individually. The partition function takes pruned bam file as input and clusters the linked contigs based on the linkage suggested by Hi-C, presumably along the same homologous chromosome in a preset number of partitions. The rescue function searches for contigs that are not involved in partition step from original un-pruned bam files and assigned them to specific clusters according Hi-C signal density. The optimize step takes each partition, and optimize the ordering and orientations for all the contigs. Finally, the build step reconstructs each chromosome by concatenating the contigs, adding gaps between the contigs and generating the final genome release in FASTA format.

Explanation of Prune

Prune function will firstly allow us to detect allelic contigs, which can be achieved by identifying syntenic genes based on a well-assembled close related species or an assembled monoploid genome. Signals (normalized Hi-C reads) between allelic contigs are removed from the input BAM files. In polyploid genome assembly, haplotypes that share high similarity are likely to be collapsed. Signals between the collapsed regions and nearby phased haplotypes result in chimeric scaffolds. In the prune step, only the best linkage between collapsed coting and phased contig is retained.

image Figure 2. Description of Hi-C scaffolding problem in polyploid genome and application of prune approach for haplotype phasing. (a) a schematic diagram of auto-tetraploid genome. Four homologous chromosomes are indicated as different colors (blue, orange, green and purple, respectively). Red regions in the chromosomes indicate sequences with high similarity. (b) Detection of Hi-C signals in the auto-tetraploid genome. Black dash lines indicate Hi-C signals between collapsed regions and un-collpased contigs. Pink dash lines indicate inter-haplotype Hi-C links and grey dash lines indicate intra-haplotype Hi-C links. During assembly, red regions will be collapsed due to high sequence similarity; while, other regions will be separated into different contigs if they have abundant variations. Since the collapsed regions are physically related with contigs from different haplotypes, Hi-C signals will be detected between collapsed regions with all other un-collapsed contigs. (c) Traditional Hi-C scaffolding methods will detect signals among contigs from different haplotypes as well as collapsed regions and cluster all the sequences together. (d) Prune Hi-C signals: 1- remove signals between allelic regions; 2- only retain the strongest signals between collapsed regions and un-collapsed contigs. (e) Partition based on pruned Hi-C information. Contigs are ideally phased into different groups based on prune results.

Citations

Zhang, X. Zhang, S. Zhao, Q. Ming, R. Tang, H. Assembly of allele-aware, chromosomal scale autopolyploid genomes based on Hi-C data. Nature Plants, doi:10.1038/s41477-019-0487-8 (2019).
Zhang, J. Zhang, X. Tang, H. Zhang, Q. et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nature Genetics, doi:10.1038/s41588-018-0237-2 (2018).

Installation

$ git clone https://github.com/tangerzhang/ALLHiC
$ cd ALLHiC
$ chmod +x bin/*
$ chmod +x scripts/*  
$ export PATH=/your/path/to/ALLHiC/scripts/:/your/path/to/ALLHiC/bin/:$PATH

Dependencies

Following is a list of thirty-party programs that will be used in ALLHIC pipeline.

Running ALLHiC

  • Data Preparation

    • Closely related 'diploid' species (chromosomal level) with gene annotation
    • Hi-C reads (with some coverage)
    • Contig level assembly of target genome with decent N50 and gene annotation
  • Map Hi-C reads to draft assembly

use bwa index and samtools faidx to index your draft genme assembly

bwa index -a bwtsw draft.asm.fasta  
samtools faidx draft.asm.fasta  

Aligning Hi-C reads to the draft assembly

bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai  
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai  
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam  

Filtering SAM file

PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
(*)filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam

*Tip: skip this step if you are using bwa mem for alignment
  • Prune

Next, we will used Allele.ctg.table to prune 1) signals that link alleles and 2) weak signals from BAM files

Usage:  

ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r draft.asm.fasta  

Parameters:   
          -h : help and usage  
          -i : Allele.ctg.table  
          -b : input bam file  
          -r : draft.asm.fasta  

Tips: Details on how to identify allelic contigs can be found in the following link: 
https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-identify-allelic-contigs
  • Partition

Assign contigs into a pre-defined number of groups (16 groups in this case).

Usage:  

ALLHiC_partition -b prunning.bam -r draft.asm.fasta -e AAGCTT -k 16  

Parameters: 
      -h : help and usage.
      -b : prunned bam (optional, default prunning.bam)
      -r : draft.sam.fasta
      -e : enzyme_sites (HindIII: AAGCTT; MboI: GATC)
      -k : number of groups (user defined K value)
      -m : minimum number of restriction sites (default, 25)
  • Rescue

Assign unplaced contigs into partitioned clusters

Usage:  

ALLHiC_rescue -b sample.clean.bam -r draft.asm.fasta -c clusters.txt -i counts_RE.txt  

Parameters:  
      -h : help and usage.
      -b : sample.clean.bam (unpruned bam)
      -r : draft.sam.fasta
      -c : clusters.txt
      -i : counts_RE.txt
      -m : minimum single density for rescuing contigs (optional, default 0.01)
Tips: 
    clusters.txt (-c) and counts_RE.txt (-i) can be generated by allhic extract.
    The format of clusters.txt is like below:
    #Group  nContigs    Contigs
    48g1    506         tig0000150 tig0000151 tig0000152 
    48g2    692         tig0000097 tig0000114 tig0000231 
    48g3    683         tig0000015 tig0000035 tig0000235
    The first column is group name. The second is number of
    contigs anchored in the group and the third lists contigs 
    in the group.
  • Optimize

ordering and orientation for each group.

Usage:  

allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT  
allhic optimize group1.txt sample.clean.clm
allhic optimize group2.txt sample.clean.clm
...
allhic optimize group16.txt sample.clean.clm

Parameters:  

allhic extract  
    Input files:  
        bam file and contig-level assembly
    Options:
        --RE value  Restriction site pattern (default: "GATC")

allhic optimize  
    Input files: 
      **counts_RE.txt - counts of restriction sites for each
        contig in a clusterd group, which can be generated 
        by allhic extract. The format is like below:
             #Contig     RECounts    Length
             tig0000001      572     157863
             tig0000002      143     33000
             tig0000003      231     60910
             tig0000004      3789    1044000
             tig0000005      646     166098
             tig0000006      67      15000
             tig0000007      1094    319000
      **clmfile -  The file records basic information of Hi-C
      links between two contigs, including potential ordering
      and orientation, number of supported reads and distance of
      paired-end reads. This file can be accessed by allhic
      extract.
    Options:
           --skipGA       Skip GA step
           --resume       Resume from existing tour file
           --seed value   Random seed (default: 42)
           --npop value   Population size (default: 100)
           --ngen value   Number of generations for convergence (default: 5000)
           --mutpb value  Mutation prob in GA (default: 0.2)

  • Build

convert tour format to fasta sequences and agp location file

Usage:  

ALLHiC_build draft.asm.fasta  

Input file:
    draft.asm.fasta - contig-level assmbly

This step will output a list of superscaffolds and un-achored contigs. Check groups.asm.fasta file.

  • plot

We will generate the chromatin contact matrix to evaluate genome scaffolding

Usage:  

ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf

Input files:
   **bam file   - mapping bam file
   **agp file   - Placement of contigs in each Hi-C groups,
    which can be generated by ALLHiC_build
   **chrn.list  - a list of group name and length. The 
    format of this file is like below:
        group1  125000
        group2  159000
    **bin size - heatmap bin size
    **ext      - extension of plot figures, e.g. pdf 
    

Tested data can be found in the following link

https://pan.baidu.com/s/1caryNh_7tC9GLRdHCWbvDQ
password: hgfl
or from Google drive: https://drive.google.com/file/d/1T58wSCdO4zI5Q4yUs7ggNG7veJTaEBYn/view?usp=sharing