Skip to content

Latest commit

 

History

History
94 lines (86 loc) · 13.4 KB

Earlier_assembly_releases_and_associated_data.md

File metadata and controls

94 lines (86 loc) · 13.4 KB

Previous assembly releases of T2T-CHM13

v1.1

Complete T2T reconstruction of a human genome. Changes from v1.0 include filled rDNA gaps and improved polishing within telomeres. One rare heterozygous variant causing a premature stop codon was changed at chr9:134589924 to the more common allele. Also available at NCBI GCA_009914755.3. Changes made from v1.0 to v1.1 are available as a VCF.

v1.0

Complete T2T reconstruction of a human genome, with the exception of 5 known gaps within the rDNA arrays. Polished assembly based on v0.9. Introduces 4 structural corrections and 993 small variant corrections, including a 4 kb telomere extension on chr18. Polishing was performed using a conservative custom pipeline based on DeepVariant calls and structural corrections were manually curated. Consensus quality exceeds Q60. Prior to a preprint being drafted, a brief summary can be found at this blog post. Also available at NCBI GCA_009914755.2. Changes made from v0.9 to v1.0 are available as a VCF.

v0.9

T2T reconstruction of all 23 chromosomes of CHM13 based on a custom assembly pipeline, briefly featuring:

  1. Homopolymer-compression and self-correction of Pacbio HiFi reads
  2. Rescoring of overlaps to account for recurrent Pacbio HiFi errors
  3. Construction and custom pruning of a string graph built over 100% identical overlaps
  4. Manual reconstruction on chromosomal paths through the graph, if necessary aided by ultra-long Nanopore reads
  5. Layout/consensus of original HiFi reads, corresponding to the resulting paths
  6. Patching of regions absent from HiFi data with v0.7 draft sequences

Consensus quality exceeds Q60. Mitochondrial sequence DNA included. Centers of the 5 rDNA arrays are represented by N-gaps.

v0.7

Assembly draft v0.7 was generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. Contigs with low coverage support were split and the assembly was scaffolded with BioNano. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The X polishing was done using unique markers matched between the assembly and the raw read data, the rest of the genome used traditional polishing. Finally, the assembly was polished with 10X Genomics data. We validated the assembly using independent BACs. The overall QV is estimated to be Q37 (Q42 in unique regions) and the assembly resolves over 80% of available CHM13 BACs (280/341). The assembly is 2.94 Gbp in size with 359 scaffolds (448 contigs) and an NG50 of 83 Mbp (70 Mbp). Outside of Chr8 and ChrX, this should be considered a draft and likely has mis-assemblies. Older unpolished assemblies are available for benchmarking purposes, but are of lower quality and should not be used for analyses. Also available at NCBI GCA_009914755.1.

Downloads