AStarix
is a sequence-to-graph aligner based on the A* shortest path
algorithm. It finds alignments that are optimal according to edit-distance with
non-negative weights. The motivation of AStarix
is to speed up optimal
alignment in the average case. A trie index allows to scale with reference size
[1], while seed heuristic allows to scale with query length [2].
[1]
Ivanov, Bichsel, Mustafa, Kahles, Rätsch, Vechev (RECOMB 2020)
AStarix: Fast and Optimal Sequence-to-Graph Alignment (2020) -- Introduces AStarix
tool, the trie index which allows scaling to bigger references, and Prefix heuristic which outperforms Dijkstra. Evaluations here.
[2] Ivanov, Bichsel, Vechev (RECOMB 2022) Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds -- Introduces Seed heuristic which outperforms other aproaches on Illumina reads and allows to scale to long HiFi reads.
Prerequisites of AStarix
are:
- argp – argument parsing library
- pandas (optional) – dataframe library used for testing and evalutaions
To compile AStarix
and run a test:
make
make test
Third-party libraries are located in the /ext
directory and their own licenses
apply. Tested on Ubuntu 20.04.
This is an example run of AStarix
on a toy data: aligning 100 simulated Illumina reads to a linear graph from the first 10000bp of Escherichia coli. You can expect a similar summary printed to standard output:
$ make test
release/astarix align-optimal -a astar-seeds -t 1 -v 0 -g data/ecoli_head10000_linear/graph.gfa -q data/ecoli_head10000_linear/illumina.fq -o tmp/ecoli_head10000_linear/astar-seeds --fixed_trie_depth 1 --seeds_len 25
----
Assert() checks: OFF
verbose: 0
----
Loading reference graph... Added reverse complement... done in 0.001078s.
Loading queries... done in 0.000216s.
Contructing trie... done in 0.007017s.
Initializing A* heuristic... done in 1.6e-05s.
== General parameters and optimizations ==
Alignment algo: astar-seeds
Edit costs: 0, 1, 5, 5 (match, subst, ins, del)
Greedy match?: true
Threads: 1
== A* parameters ==
seed length: 25 bp
skip near crumbs: 1
== Data ==
Original reference: 9826 nodes, 9824 edges
Trie: 5186 nodes, 24822 edges, depth=7
Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
Trie: 5186 nodes, 24822 edges, depth=7
Reference+ReverseRef+Trie: 24838 nodes, 44470 edges, density: 0
Reads: 100 x 99bp, coverage: 1.00774x
Avg phred value: 0.13640%
== Aligning statistics ==
Explored rate (avg): 1.70707 states/read_bp
States with crumbs: 0.02992%
Explored states: 0.01737%
Skipped states: 99.95271%
Pushed rate (avg, max): 0.77130, 0.01400 [states/bp] (states normalized by query length)
Popped rate (avg, max): 0.09310, 0.00140
Average popped: 7.16000 from trie (76.90655%) vs 2.15000 from ref (per read)
Total cost of aligned reads: 15, 0.15000 per read, 0.15152% per letter
Alignments: 100 unique (100.00000%), 0 ambiguous (0.00000%) and 0 overcost (0.00000%)
== Heuristic stats (astar-seeds) ==
For all reads:
Seeds: 400 (4.00000 per read)
Seed matches: 385 (3.85000 per read, 0.96250 per seed)
States with crumbs: 29105 [+0.00000% repeated], (291.05000 per read)
Heuristic (avg): 0.10000 of potential 4.00000
== Performance ==
Memory: measured | estimated
total: 0.00510gb, 100% | -
reference: 0.00267gb, 52.39521% | 5.74505%
reads: 0.00000gb, 0.00000% | 0.18091%
trie: 0.00100gb, 19.53593% | 5.82224%
equiv. classes opt.: 0.00000gb, 0.00000%
A*-memoization: 0.00000gb, 0.00000
Total wall runtime: 0.04086s
reference loading: 0.00108s
queries loading: 0.00022s
construct trie: 0.00702s
precompute: 0.00002s
align (wall time): 0.03040s = 3289.04470 reads/s = 325.61543 Kbp/s
Total align cpu time: 0.02998s = 3335.11206 reads/s = 330.17609 Kbp/s
| Preprocessing: 31.61686%
| A* query: 18.47652%
| greedy_match: 11.42609%
| other: 38.48052%
DONE
The directory tmp/ecoli_head10000_linear/astar-seeds/
will contain a file with execution logs and a file with alignment statistics.
Short aggregated statistics are print to standard output (to redirect, you can add >summary.txt
).
AStarix
only finds optimal alignments (specified by argument align-optimal
). Currently supported formats are .gfa
without overlapping nodes (for a graph reference) and .fa
/.fasta
(for a linear reference). The queries should be in .fq
/.fastq
format (the phred values are ignored).
$ astarix --help
----
Usage: astarix [OPTION...] align-optimal -g GRAPH.gfa -q READS.fq -o OUT_DIR/
Optimal sequence-to-graph aligner based on A* shortest path.
-a, --algorithm={dijkstra, astar-prefix, astar-seeds}
Shortest path algorithm
-c, --prefix_cost_cap=A*_COST_CAP
The maximum prefix cost for the A* heuristic
-d, --prefix_len_cap=A*_PREFIX_CAP
The upcoming sequence length cap for the A*
heuristic
-D, --tree_depth=TREE_DEPTH Suffix tree depth
-e, --prefix_equivalence_classes=A*_EQ_CLASSES
Whether to partition all nodes to equivalence
classes in order not to reuse the heuristic
--fixed_trie_depth=FIXED_TRIE_DEPTH
Some leafs depth can be less than tree_depth
(variable=0, fixed=1)
-f, --greedy_match=GREEDY_MATCH
Proceed greedily forward if there is a unique
matching outgoing edge
-g, --graph=GRAPH Input graph (.gfa)
-G, --gap=GAP_COST Gap (Insertion or Deletion) penalty [5]
-k, --k_best_alignments=TOP_K Output at most k optimal alignments per read
[1]
-M, --match=MATCH_COST Match penalty [0]
-o, --outdir=OUTDIR Output directory
-q, --query=QUERY Input queries/reads (.fq, .fastq)
--seeds_len=A*_SEED_LEN The length of the A* seeds.
--seeds_skip_near_crumbs={0,1}
-S, --subst=SUBST_COST Substitution penalty [1]
-t, --threads=THREADS Number of threads [1]
-v, --verbose=THREADS Verbosity (silent=0, info=1, debug=2), [0]
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version