Skip to content

human_customized_genome_lib

Brian Haas edited this page Jul 23, 2023 · 1 revision

Customization of the human CTAT reference genome lib

Certain customizations are performed to both the human reference genome and annotations to facilitate detection of certain types of fusion transcripts, in the default provided human CTAT genome libs or if the prep_genome_lib.pl is executed to build a human lib with option '--human_gencode_filter' .

These modifications include the following:

GTF annotation updates

  • readthru transcripts with long introns (min 100kb) are discarded.

  • IGH and IGL gene annotations are augmented with IG-superloci spanning the entire loci on both strands. These appear like so:

    • IGH.g@-ext IGH-.g@-ext
    • IGL.g@-ext and IGL-.g@-ext
  • The following gene boundaries are extended at each end by the following number of bases.

    • CRLF2, 50kb
    • MALT1, 40kb
    • DUX4, 10kb

Genome fasta file masking

  • homologous regions in the genome to DUX4 and SEPTIN14 corresponding to paralogs or pseudogenes are masked out. These are defined by using blastn with reference transcript sequences searched against the reference genome sequence, performed at ctat genome lib build time.

  • pseudoautosomal regions (PAR) on chrY are masked, including +/- 50kb of PAR features.