Question: Suggestions for handling . wildcards within query patterns #123
-
Hi Andrew, First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful! I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously. The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character. Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring. Any suggestions for how to better handle these wildcards would be much appreciated! Thomas |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Enumerating all cases is what I would suggest if the total number of patterns doesn't get too crazy. If enumerating all of them is infeasible (i.e., would result in more than low millions), then the next suggestion I have would be to search for a common prefix of the set of all enumerated patterns. So for example, if you have Otherwise you might just instead build a regex. The problem is that 100k - 1M might wind up being too big for a crate like If none of those work, then you might want to build something bespoke. For example, copy the non-contiguous NFA implementation from this crate and adapt it to support basic wildcards. |
Beta Was this translation helpful? Give feedback.
-
Another variant: You could split each ur-pattern with wildcards into multiple patterns? So e.g. "ATCNATG" becomes "ATC" and "ATG". And you keep a list separately in some relevant datastructure of these special wildcard patterns. Then, to match:
|
Beta Was this translation helpful? Give feedback.
-
If the main issue is memory usage, then the simplest approach (building on the naive enumeration of all unambiguous patterns) would be to do multiple searches with fewer patterns in each search, would it not? Btw aho-corasick v1.0.4 should also be using significantly less memory than earlier versions. There is also an alternative crate implementing a faster variant of aho-corasick (daachorse) which is less well documented/less feature rich perhaps as this repo; it is reported to be 3x to 5x faster while consuming more than 50% less mem (compared against aho-corasick before the recent mem optimizations). |
Beta Was this translation helpful? Give feedback.
Enumerating all cases is what I would suggest if the total number of patterns doesn't get too crazy. If enumerating all of them is infeasible (i.e., would result in more than low millions), then the next suggestion I have would be to search for a common prefix of the set of all enumerated patterns. So for example, if you have
ATCNG
, then you'd search forATC
and then run another search to confirm whether a match actually exists at that location. This strategy only works if your prefix leads to a low false positive rate of candidates.ATC
, for example, is probably short enough that if you're searching DNA, you'll probably have a very high false positive rate. A long prefix doesn't guarante…