-
Notifications
You must be signed in to change notification settings - Fork 39
Home
Phasing and scaffolding polyploid genomes based on Hi-C data
The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the polyploid genomes. ALLHIC pipeline contains a total of 5 steps: prune, partition, rescue, optimize and build.
Figure 1. Overview of major steps in ALLHiC algorithm. The newly released ALLHiC pipeline contains a total of 5 functions: prune, partition, rescue, optimize and build. Briefly, the prune step removes the inter-allelic links so that the homologous chromosomes are more easily separated individually. The partition function takes pruned bam file as input and clusters the linked contigs based on the linkage suggested by Hi-C, presumably along the same homologous chromosome in a preset number of partitions. The rescue function searches for contigs that are not involved in partition step from original un-pruned bam files and assigned them to specific clusters according Hi-C signal density. The optimize step takes each partition, and optimize the ordering and orientations for all the contigs. Finally, the build step reconstructs each chromosome by concatenating the contigs, adding gaps between the contigs and generating the final genome release in FASTA format.
Prune function will firstly allow us to detect allelic contigs, which can be achieved by identifying syntenic genes based on a well-assembled close related species or an assembled monoploid genome. Signals (normalized Hi-C reads) between allelic contigs are removed from the input BAM files. In polyploid genome assembly, haplotypes that share high similarity are likely to be collapsed. Signals between the collapsed regions and nearby phased haplotypes result in chimeric scaffolds. In the prune step, only the best linkage between collapsed coting and phased contig is retained.
Figure 2. Description of Hi-C scaffolding problem in polyploid genome and application of prune approach for haplotype phasing. (a) a schematic diagram of auto-tetraploid genome. Four homologous chromosomes are indicated as different colors (blue, orange, green and purple, respectively). Red regions in the chromosomes indicate sequences with high similarity. (b) Detection of Hi-C signals in the auto-tetraploid genome. Black dash lines indicate Hi-C signals between collapsed regions and un-collpased contigs. Pink dash lines indicate inter-haplotype Hi-C links and grey dash lines indicate intra-haplotype Hi-C links. During assembly, red regions will be collapsed due to high sequence similarity; while, other regions will be separated into different contigs if they have abundant variations. Since the collapsed regions are physically related with contigs from different haplotypes, Hi-C signals will be detected between collapsed regions with all other un-collapsed contigs. (c) Traditional Hi-C scaffolding methods will detect signals among contigs from different haplotypes as well as collapsed regions and cluster all the sequences together. (d) Prune Hi-C signals: 1- remove signals between allelic regions; 2- only retain the strongest signals between collapsed regions and un-collapsed contigs. (e) Partition based on pruned Hi-C information. Contigs are ideally phased into different groups based on prune results.
Zhang, X. Zhang, S. Zhao, Q. Ming, R. Tang, H. Assembly of allele-aware, chromosomal scale autopolyploid genomes based on Hi-C data. Nature Plants, doi:10.1038/s41477-019-0487-8 (2019).
Zhang, J. Zhang, X. Tang, H. Zhang, Q. et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nature Genetics, doi:10.1038/s41588-018-0237-2 (2018).
$ git clone https://github.com/tangerzhang/ALLHiC
$ cd ALLHiC
$ chmod +x bin/*
$ chmod +x scripts/*
$ export PATH=/your/path/to/ALLHiC/scripts/:/your/path/to/ALLHiC/bin/:$PATH
Following is a list of thirty-party programs that will be used in ALLHIC pipeline.
-
Data Preparation
- Closely related 'diploid' species (chromosomal level) with gene annotation
- Hi-C reads (with some coverage)
- Contig level assembly of target genome with decent N50 and gene annotation
-
Map Hi-C reads to draft assembly
use bwa index and samtools faidx to index your draft genme assembly
bwa index -a bwtsw draft.asm.fasta
samtools faidx draft.asm.fasta
Aligning Hi-C reads to the draft assembly
bwa aln -t 24 draft.asm.fasta reads_R1.fastq.gz > sample_R1.sai
bwa aln -t 24 draft.asm.fasta reads_R2.fastq.gz > sample_R2.sai
bwa sampe draft.asm.fasta sample_R1.sai sample_R2.sai reads_R1.fastq.gz reads_R2.fastq.gz > sample.bwa_aln.sam
Filtering SAM file
PreprocessSAMs.pl sample.bwa_aln.sam draft.asm.fasta MBOI
(*)filterBAM_forHiC.pl sample.bwa_aln.REduced.paired_only.bam sample.clean.sam
samtools view -bt draft.asm.fasta.fai sample.clean.sam > sample.clean.bam
*Tip: skip this step if you are using bwa mem for alignment
- Prune
Next, we will used Allele.ctg.table to prune 1) signals that link alleles and 2) weak signals from BAM files
Usage:
ALLHiC_prune -i Allele.ctg.table -b sample.clean.bam -r draft.asm.fasta
Parameters:
-h : help and usage
-i : Allele.ctg.table
-b : input bam file
-r : draft.asm.fasta
Tips: Details on how to identify allelic contigs can be found in the following link:
https://github.com/tangerzhang/ALLHiC/wiki/ALLHiC:-identify-allelic-contigs
- Partition
Assign contigs into a pre-defined number of groups (16 groups in this case).
Usage:
ALLHiC_partition -b prunning.bam -r draft.asm.fasta -e AAGCTT -k 16
Parameters:
-h : help and usage.
-b : prunned bam (optional, default prunning.bam)
-r : draft.sam.fasta
-e : enzyme_sites (HindIII: AAGCTT; MboI: GATC)
-k : number of groups (user defined K value)
-m : minimum number of restriction sites (default, 25)
- Rescue
Assign unplaced contigs into partitioned clusters
Usage:
ALLHiC_rescue -b sample.clean.bam -r draft.asm.fasta -c clusters.txt -i counts_RE.txt
Parameters:
-h : help and usage.
-b : sample.clean.bam (unpruned bam)
-r : draft.sam.fasta
-c : clusters.txt
-i : counts_RE.txt
-m : minimum single density for rescuing contigs (optional, default 0.01)
Tips:
clusters.txt (-c) and counts_RE.txt (-i) can be generated by allhic extract.
The format of clusters.txt is like below:
#Group nContigs Contigs
48g1 506 tig0000150 tig0000151 tig0000152
48g2 692 tig0000097 tig0000114 tig0000231
48g3 683 tig0000015 tig0000035 tig0000235
The first column is group name. The second is number of
contigs anchored in the group and the third lists contigs
in the group.
- Optimize
ordering and orientation for each group.
Usage:
allhic extract sample.clean.bam draft.asm.fasta --RE AAGCTT
allhic optimize group1.txt sample.clean.clm
allhic optimize group2.txt sample.clean.clm
...
allhic optimize group16.txt sample.clean.clm
Parameters:
allhic extract
Input files:
bam file and contig-level assembly
Options:
--RE value Restriction site pattern (default: "GATC")
allhic optimize
Input files:
**counts_RE.txt - counts of restriction sites for each
contig in a clusterd group, which can be generated
by allhic extract. The format is like below:
#Contig RECounts Length
tig0000001 572 157863
tig0000002 143 33000
tig0000003 231 60910
tig0000004 3789 1044000
tig0000005 646 166098
tig0000006 67 15000
tig0000007 1094 319000
**clmfile - The file records basic information of Hi-C
links between two contigs, including potential ordering
and orientation, number of supported reads and distance of
paired-end reads. This file can be accessed by allhic
extract.
Options:
--skipGA Skip GA step
--resume Resume from existing tour file
--seed value Random seed (default: 42)
--npop value Population size (default: 100)
--ngen value Number of generations for convergence (default: 5000)
--mutpb value Mutation prob in GA (default: 0.2)
- Build
convert tour format to fasta sequences and agp location file
Usage:
ALLHiC_build draft.asm.fasta
Input file:
draft.asm.fasta - contig-level assmbly
This step will output a list of superscaffolds and un-achored contigs. Check groups.asm.fasta file.
- plot
We will generate the chromatin contact matrix to evaluate genome scaffolding
Usage:
ALLHiC_plot sample.clean.bam groups.agp chrn.list 500k pdf
Input files:
**bam file - mapping bam file
**agp file - Placement of contigs in each Hi-C groups,
which can be generated by ALLHiC_build
**chrn.list - a list of group name and length. The
format of this file is like below:
group1 125000
group2 159000
**bin size - heatmap bin size
**ext - extension of plot figures, e.g. pdf
https://pan.baidu.com/s/1caryNh_7tC9GLRdHCWbvDQ
password: hgfl
or from Google drive:
https://drive.google.com/file/d/1T58wSCdO4zI5Q4yUs7ggNG7veJTaEBYn/view?usp=sharing
© 2017 - present, ALLHiC authors