minSH is a short working implementation that aligns sequences astar.py
implements A* with seed heuristic
def build_seed_heuristic(A, B, k):
"""Builds the admissible seed heuristic for A and B with k-mers."""
kmers = { B[j:j+k] for j in range(len(B)-k+1) } # O(nk): Index all kmers from B. O(n) with rolling hash
seeds = [ A[i:i+k] for i in range(0, len(A)-k+1, k) ] # O(n): Split A into non-overlapping seeds of length k.
is_seed_missing = [ s not in kmers for s in seeds ] # O(n): Is seed unseen in B, so >=1 edit must be made to align it.
suffix_sum = np.cumsum(is_seed_missing[::-1])[::-1] # O(n): How many of the remaining seeds have to be edited
return lambda ij, k=k: suffix_sum[ ceildiv(ij[0], k) ] # O(1): How many seeds starting after the current position i have to be edited?
Next, we just use the seed heuristic for a starndard A* on the alignment graph A x B
:
h_seed = build_seed_heuristic(A, B, k=log(len(A)))
astar(A, B, h_seed)
We can look at aligning sequences A and B as a process of sequentially aligning longer and longer prefixes of
The simplest algorithm we can use is Dijkstra's algorithm which finds a shortest path of length
The A* algirthm is a generalization of Dijkstra's algorithm that explores the nodes
- admissible (i.e. to never exceed the remaining distance $d(u,t)$), or otherwise the found path may not be shortest.
- accurate in estimating
$dist(s,u)$ , or otherwise the search will not be directly going to$t$ - fast to be computed for each explored node, or otherwise, the A* algorithm will be slow in practice
To use in your projects, simply add this repository as a dependency:
pip install git+https://github.com/pesho-ivanov/minSH.git
Then call the align
function with two strings A
and B
:
import math
from minsh.astar import h_dijkstra, align, build_seedh, print_stats
A = "ACGT" * 100
B = "AGCT" * 100
alphabet_size = 4
k = math.ceil(math.log(len(A), alphabet_size))
h_seed = build_seedh(A, B, k)
g_seed = align(A, B, h_seed)
print_stats(A, B, k, g_seed)
The library can also be used as a standalone command-line script:
$ astar data/small_A.fa data/small_B.fa
> Aligning sequences with len(A)=18, k=3:
> Edit distance: 3
> Error rate: 16.67%
> Explored band: 3.39
To explore additional tooling - clone the repository and install the dependencies:
git clone https://github.com/pesho-ivanov/minSH && cd minSH
pip install -r requirements.txt
python scripts/test.py # tests
python scripts/generate.py # synthetic FASTA files
Optimizations:
- rolling hash: for linear time precomputation
- greedy matching (aka sliding)
- pruning, using index trees
Presentation:
- visualization of the alignment (png, gif)
- interactivity for adding patches
- report stats
- benchmark
minSH is inspired by minGPT to be small, clean, interpretable and educational re-implementation of the recent aligning approach based on the A* shortest path algorithm.
Detailed Publications on A* for alignment
AStarix semi-global seq-to-graph aligner:
- Ivanov et al., (RECOMB 2020) — Introduces A* with a trie for semi-global.
- Ivanov et al., (RECOMB 2022) — Introduces SH for read alignment on general graph references using trie.
A*PA global seq-to-seq aligner:
- Groot Koerkamp and Ivanov (preprint 2023) — Applies SH to global alignment (edit distance). Generalizes SH with chaining, inexact matches, gap costs (for higher error rates). Optimizes SH with pruning (for near-linear scaling with length), and A* with diagonal transition (for faster quadratic mode). Licensed under the Mozilla Public License, Version 2.0. In short, you are free to use and abuse, but give it back to the community.