minSH is a short working implementation that aligns sequences astar.py
(~50 loc) implements A* with seed heuristic
def build_seed_heuristic(A, B, k):
"""Builds the admissible seed heuristic for A and B with k-mers."""
kmers = { B[j:j+k] for j in range(len(B)-k+1) } # O(nk): Index all kmers from B. O(n) with rolling hash
seeds = [ A[i:i+k] for i in range(0, len(A)-k+1, k) ] # O(n): Split A into non-overlapping seeds of length k.
is_seed_missing = [ s not in kmers for s in seeds ] # O(n): Is seed unseen in B, so >=1 edit must be made to align it.
suffix_sum = np.cumsum(is_seed_missing[::-1])[::-1] # O(n): How many of the remaining seeds have to be edited
return lambda ij, k=k: suffix_sum[ ceildiv(ij[0], k) ] # O(1): How many seeds starting after the current position i have to be edited?
Next, we just use the seed heuristic for a starndard A* on the alignment graph A x B
:
h_seed = build_seed_heuristic(A, B, k=log(len(A)))
astar(A, B, h_seed)
We can look at aligning sequences A and B as a process of sequentially aligning longer and longer prefixes of
The simplest algorithm we can use is Dijkstra's algorithm which finds a shortest path of length
The A* algirthm is a generalization of Dijkstra's algorithm that explores the nodes
- admissible (i.e. to never exceed the remaining distance $d(u,t)$), or otherwise the found path may not be shortest.
- accurate in estimating
$dist(s,u)$ , or otherwise the search will not be directly going to$t$ - fast to be computed for each explored node, or otherwise, the A* algorithm will be slow in practice
Prerequisites:
pip install rolling
pip install numpy
pip install heapq
pip install fenwick
Run tests first:`
python test.py
astar.py
takes k
and a file with two strings (A
and B
), and returns the exact edit distance ed(A,B)
between them:
python minsh/astar.py data/small_A.fa data/small_B.fa
Optimizations:
- rolling hash: for linear time precomputation
- greedy matching (aka sliding)
- pruning, using index trees
Presentation:
- visualization of the alignment (png, gif)
- interactivity for adding patches
- report stats
- benchmark
minSH is inspired by minGPT to be small, clean, interpretable and educational re-implementation of the recent aligning approach based on the A* shortest path algorithm.
Detailed Publications on A* for alignment
AStarix semi-global seq-to-graph aligner:
- Ivanov et al., (RECOMB 2020) — Introduces A* with a trie for semi-global.
- Ivanov et al., (RECOMB 2022) — Introduces SH for read alignment on general graph references using trie.
A*PA global seq-to-seq aligner:
- Groot Koerkamp and Ivanov (preprint 2023) — Applies SH to global alignment (edit distance). Generalizes SH with chaining, inexact matches, gap costs (for higher error rates). Optimizes SH with pruning (for near-linear scaling with length), and A* with diagonal transition (for faster quadratic mode). Licensed under the Mozilla Public License, Version 2.0. In short, you are free to use and abuse, but give it back to the community.