Skip to content

mdozmorov/Bioinformatics_notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics notes

License: MIT PR's Welcome

Bioinformatics learning and data analysis tips and tricks. Please, contribute and get in touch!

Table of content

Genomics notes

See MDmisc notes for other programming and genomics-related notes.

Awesome

  • Historical perspective on genome sequencing technologies. From the landmark 1953 Watson and Crick publication through sequencing of nucleic acids, shotgun sequencing (Messing, Sanger), Human Genome Project (HGP) & Celera Genomics, milestones in genome assembly, Next and Third generation sequencing, single molecule sequencing (SMRT, ONT, PacBio), long-read genome assemblers (FALCON, Canu). Box1 - companies developing sequencing technologies, Box 2 - more details on HGP, Box 3 - bioinformatics tools for genome assembly.
    Paper Giani, Alice Maria, Guido Roberto Gallo, Luca Gianfranceschi, and Giulio Formenti. “Long Walk to Genomics: History and Current Approaches to Genome Sequencing and Assembly.” Computational and Structural Biotechnology Journal, November 2019, S2001037019303277. https://doi.org/10.1016/j.csbj.2019.11.002.
  • Milestones in Genomic Sequencing by Nature, 2000-2021 period, interactive infographics

  • Genome Coordinate Cheat Sheet

  • Awesome-Bioinformatics - A curated list of awesome Bioinformatics libraries and software, by Daniel Cook and community-contributed

  • awosome-bioinformatics - A curated list of resources for learning bioinformatics

  • awesome-bioinformatics-education - A curated list of resources specific to learning bioinformatics. Courses, hands-on training, tools.

  • awesome-cancer-variant-databases - A community-maintained repository of cancer clinical knowledge bases and databases focused on cancer variants, by Sean Davis and community-maintained

  • awesome-single-cell - List of software packages for single-cell data analysis, including RNA-seq, ATAC-seq, etc. By Sean Davis and community-maintained

  • awesome-alternative-splicing - Alternative splicing resources

  • awesome-10x-genomics - List of tools and resources related to the 10x genomics GEMCode/Chromium system, by Johan Dahlberg and community-contributed

  • awesome-multi-omics - List of software packages for multi-omics analysis, by Mike Love and community-maintained

  • awesome-bioinformatics-benchmarks - A curated list of bioinformatics bench-marking papers and resources

  • awesome_genome_browsers - genome browsers and genomic visualization tools. By David McGaughey

  • awesome-expression-browser - A curated list of software and resources for exploring and visualizing (browsing) expression data, and more. By Federico Marini and community-maintained

  • awesome-genome-visualization - A list of interesting genome browser or genome-browser-like implementations. By Colin Diesh

  • awesome-microbes - List of software packages (and the people developing these methods) for microbiome (16S), metagenomics (WGS, Shot-gun sequencing), and pathogen identification/detection/characterization. By Steve Tsang and community-contributed

  • aws-for-bioinformatics - Amazon Web Services for Bioinformatics Researchers

  • algorithmsInBioinformatics - Bioinformatics algorithms: Needleman-Wunsch, Feng-Doolittle, Gotoh and Nussinov implemented in Python. By Joachim Wolff, with lecture notes. Also, rklib - Rabin-Karp implementation of sequence substring search for DNA/RNA, and lv89 - C implementation of the Landau-Vishkin algorithm to compute the edit distance between two strings.

  • BAM-tricks - Tip and tricks for BAM files

  • Bedtools Cheatsheet

  • bioicons - A library of free open source icons for science illustrations in biology and chemistry. SVG downloads. GitHub

  • bioconvert - converter between various data formats. Python, command line.

  • biotools - A massive collection of references on the topics of bioinformatics, sequencing technologies, programming, machine learning, and more. By John Didion

  • bio.tools - a community-driven comprehensive and consistent registry of bioinformatics resources. Supported by ELIXIR - the European infrastructure for biological information.

    Paper Ison, Jon, Kristoffer Rapacki, Hervé Ménager, Matúš Kalaš, Emil Rydza, Piotr Chmura, Christian Anthon, et al. “Tools and Data Services Registry: A Community Effort to Document Bioinformatics Resources.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D38–47. https://doi.org/10.1093/nar/gkv1116.
  • BioNumPy - Numpy-like analysis of biological data (FASTQ, BED, BAM, etc.). Indexing, vectorized functions, high efficiency on BED intersection, kmer counting, and VCF operations. Documentation with examples of filtering FASTQ reads, working with BAM files, computing GC content, general handling of sequences, intervals, PWMs.
    Paper Rand, Knut Dagestad, Ivar Grytten, Milena Pavlovic, Chakravarthi Kanduri, and Geir Kjetil Sandve. “BioNumPy: Fast and Easy Analysis of Biological Data with Python.” Preprint. Bioinformatics, December 22, 2022. https://doi.org/10.1101/2022.12.21.521373.
  • Circa - online tool for Circos plot creation

  • Jbrowse 2 - Java-based genome browser that includes linear and advanced visualization for synteny, dotplots, breakpoints, gene fusions, whole genome overview, Circos, Spreadsheet view, SV inspector. Many file formats, including Hi-C (hic), BAM/CRAM (various sort pileup options), plain text (VFC, BED, etc.), some specific to specific views. Bookmarks (can be imported from BED), sessions (shareable). Web, desktop versions, can be run from Jupyter notebooks, R (JBrovseR), command line, extended with plugins. Documentation, Demos/tutorials.

    Paper Diesh, Colin, Garrett J Stevens, Peter Xie, Teresa De Jesus Martinez, Elliot A. Hershberg, Angel Leung, Emma Guo, et al. “JBrowse 2: A Modular Genome Browser with Views of Synteny and Structural Variation.” Preprint. Bioinformatics, July 31, 2022. https://doi.org/10.1101/2022.07.28.501447.

Notes by Ming Tang

T2T genome

  • T2T-mhaESC complete mouse genome. Mouse haploid androgenic embryonic stem cells (mhaESC). HiFi (PacBio), ONT (Oxford Nanopore), 60x Illumina short-read sequencing, 112x Hi-C (Arima), BioNano optical maps. hifiasm for genome assembly, with manual correction. 2.77Gbp total size including 251Mbp previously unassembled regions (mostly satellites and structural duplications) but also 1.3Mb deletion on the long arm of chr2. 639 additional protein-coding genes, some duplicated, some completely new. Assembly and characterization of ribosomal DNA arrays, centromeres. GitHub.
    Paper Liu, Junli, Qilin Li, Yixuan Hu, Yi Yu, Kai Zheng, Dengfeng Li, Lexin Qin, and Xiaochun Yu. “The Complete Telomere-to-Telomere Sequence of a Mouse Genome.” Science 386, no. 6726 (December 6, 2024): 1141–46. https://doi.org/10.1126/science.adq8191.
  • Diploid reference genome assembly. Combination of different technologies (Oxford Nanopore ultra-long, PacBio HiFi, BioNano, Hi-C), tools (12 assembly algorithms), combination strategies. Compared with the T2T CHM13 genome. All raw sequence data used in this study are available at the HPRC GitHub. The final HPRC-HG002 curated assemblies are available in the NCBI under the BioProject IDs PRJNA794175 and PRJNA794172, with the accession numbers GCA_021951015.1 and GCA_021950905.1, for the maternal and paternal haplotypes, respectively.
    Paper Jarvis, Erich D., Giulio Formenti, Arang Rhie, Andrea Guarracino, Chentao Yang, Jonathan Wood, Alan Tracey, et al. “Semi-Automated Assembly of High-Quality Diploid Human Reference Genomes.” Nature 611, no. 7936 (November 17, 2022): 519–31. https://doi.org/10.1038/s41586-022-05325-5.
  • T2T-Y - Complete sequencing of human Y chromosome, as a part of the T2T project. Lots of information about corrected errors, complete structures of sex-determining genes, additional protein-coding genes, alternating satellite patterns, centromere structure, repeats. Illumina 33X, PacBio HiFi 42X, ONT 125X sequencing. Table 1 - stats comparison with GRCh38-Y. Improves alignment, variant calling, helps removing human contamination from microbial studies. Code used in the paper.
    Paper Rhie, Arang, Sergey Nurk, Monika Cechova, Savannah J. Hoyt, Dylan J. Taylor, Nicolas Altemose, Paul W. Hook, et al. “The Complete Sequence of a Human Y Chromosome.” Preprint. Genomics, December 1, 2022. https://doi.org/10.1101/2022.12.01.518724.
  • Investigation of non-syntenic regions between hg38 and T2T genome assemblies. Sarge-scale SVs detected by LASTZ, minigraph, PBSV, PAV. SynPlotter, SafFire, Mummer alignment and visualization tools. Investigation of regions and genes associated with human diseases. Data in supplementary tables.
    Paper Yang, Xiangyu, Xuankai Wang, Yawen Zou, Shilong Zhang, Manying Xia, Lianting Fu, Mitchell R. Vollger, et al. “Characterization of Large-Scale Genomic Differences in the First Complete Human Genome.” Genome Biology 24, no. 1 (July 4, 2023): 157. https://doi.org/10.1186/s13059-023-02995-w.
  • Human pangenome from 47 phased, diploid assemblies from a cohort of genetically diverse individuals. Better captures known variants and haplotypes, and reveals new alleles at structurally complex loci Adds 119Mbp of euchromatin and 1,115 gene duplications relative to GRCh38. Alignment to the pangenome reference reduces errors and increases the number of structural variants. Minigraph, Minigraph-Cactus, PGGB (PanGenome Graph Builder) graph assembers (GFA format). Links to data and resources at GitHub.
    Paper Liao, WW., Asri, M., Ebler, J. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). https://doi.org/10.1038/s41586-023-05896-x

Pipelines

liftOver

  • Benchmarking of six liftover tools (UCSC liftOver, rtracklayer::liftOver, CrossMap, NCBI Remap, flo and segment liftover) on converting CpG sites and WGBS samples. Considering integrity-preserving and non integrity-preserving. Explanation of chain file format (Figure 1). segment liftover generates more reliable results than USCS liftOver.
    Paper Luu, Phuc-Loi, Phuc-Thinh Ong, Thanh-Phuoc Dinh, and Susan J Clark. “Benchmark Study Comparing Liftover Tools for Genome Conversion of Epigenome Sequencing Data.” NAR Genomics and Bioinformatics 2, no. 3 (September 1, 2020): lqaa054. https://doi.org/10.1093/nargab/lqaa054.
  • AirLift - command-line tool (C) for remapping reads between genome assemblies. Very fast, highly accurate in identifying ground truth SNP/InDels in lifted data. Describe limitations of current tools. Methods, lookup tables, eight steps, each read remapped using one of four independent cases. Supports multiple file formats, from BAM to BED. Compared with liftOver, CrossMap. Supplementary - description of other tools (liftOver, CrossMap, MCBI Genome Remapping Service, Segment_liftover, PyLiftover, nf-LO, LevioSAM, Liftoff).
    Paper Kim, Jeremie S., Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Mohammed Alser, Nastaran Hajinazar, Can Alkan, and Onur Mutlu. “AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes.” arXiv, November 21, 2022. http://arxiv.org/abs/1912.08735.
  • BCFtools/liftover - best performing tool for SNP/InDel liftover. Improved support for multiallelic SNPs and indels. Compared with Picard/LiftoverVcf, CrossMap/VCF, GenomeWarp, Transanno/liftvcf, Genozip/DVCF. Algorithm for maximally extending/normalizing VCF records. Details about chains. Lowest dropout rate, memory usage, fastest.
    Paper Genovese, Giulio, Nicole B Rockweiler, Bryan R Gorman, Tim B Bigdeli, Michelle T Pato, Carlos N Pato, Kiku Ichihara, and Steven A McCarroll. “BCFtools/Liftover: An Accurate and Comprehensive Tool to Convert Genetic Variants across Genome Assemblies,” n.d.
  • CrossMap - genome coordinates conversion between different assemblies (such as hg18 (NCBI36) <=> hg19 (GRCh37)). It supports commonly used file formats including BAM, CRAM, SAM, Wiggle, BigWig, BED, GFF, GTF, MAF VCF, and gVCF. Documentation

  • HiCLift - coordinate liftover for region pairs. Converts chromatin contacts in pairs format, allValidPairs outputted by HiC-Pro, cool format and hic (highest resolution) format. Uses UCSC chain files, represented as IntervalTrees, and the T2T chain files. Benchmarked using HiCRep on the original and converted matrices.

    Paper Wang, Xiaotao, and Feng Yue. “HiCLift: A Fast and Efficient Tool for Converting Chromatin Interaction Data between Genome Assemblies.” Edited by Tobias Marschall. Bioinformatics 39, no. 6 (June 1, 2023): btad389. https://doi.org/10.1093/bioinformatics/btad389.
  • Liftoff - An accurate GFF3/GTF lift over pipeline

  • liftover - liftover for python, made fast with cython

  • liftOverBedpe - A liftOver wrapper to accomodate BEDPE files, by Doug Phanstiel, requires Python 2.7

  • pairLiftOver - Python package that converts the two-dimensional genomic coordinates of chromatin contact pairs between assemblies, by Xiaotao Wang. Input - 4D-DCIC pairs format or allValidPairs defined by HiC-Pro. Based on pyliftover

k-mers

  • Mercury - reference-free k-mer analysis for genome assemblies. Also assess completeness, phasing. Works with haplotype-specific assemblies. Visualization, k-mer, hap-mer spectrum plots, can be used to infer CNVs. Inspired by KAT. Benchmarked on model organism and human genomes.
    Paper Rhie, Arang, Brian P. Walenz, Sergey Koren, and Adam M. Phillippy. “Merqury: Reference-Free Quality, Completeness, and Phasing Assessment for Genome Assemblies.” Genome Biology 21, no. 1 (December 2020): 245. https://doi.org/10.1186/s13059-020-02134-9.
  • KAT - The K-mer Analysis Toolkit, compares k-mer spectra. Analyzes jellyfish hashes or FASTQ/FASTA files.

  • kcmbt_mt - A very fast k-mer counter

  • kmertools - kmer based feature extraction tool for bioinformatics, metagenomics, AI/ML and more. Rust, Bioconda installable

  • kmdiff - Differential k-mer analysis

  • Dot - dotplot visualization

Courses

Videos

Bioinformatics core organization

About

Bioinformatics and genomics resources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published