Question: Suggestions for handling . wildcards within query patterns #123

tfwillems · 2023-08-15T18:52:33Z

tfwillems
Aug 15, 2023

Hi Andrew,

First and foremost, thanks for creating and maintaining this package - it truly is awesome and incredibly powerful!

I'm currently using the crate to build large automatons containing 100k - 1M DNA patterns, where each pattern is < 25 characters long. These enable me to search large genomic databases/graphs for many query patterns simultaneously.

The text I'm querying is solely comprised solely of ACGT alphabet, and my patterns can contain ACGT or N, where N is a special character that matches every other character.

Do you have a recommendation for how best to handle this type of wildcard (N = . in regexes) within the automaton? My current naive approach is just to enumerate all unambiguous versions of each pattern (e.g. ANG would generate AAG, ACG, AGG and ATG) and store them all in the automaton. This works fairly well when patterns can only contain a few wildcards, but the combinatorics become memory prohibitive in some of the applications I'm exploring.

Any suggestions for how to better handle these wildcards would be much appreciated!

Thomas

Answered by BurntSushi

Aug 15, 2023

Enumerating all cases is what I would suggest if the total number of patterns doesn't get too crazy. If enumerating all of them is infeasible (i.e., would result in more than low millions), then the next suggestion I have would be to search for a common prefix of the set of all enumerated patterns. So for example, if you have ATCNG, then you'd search for ATC and then run another search to confirm whether a match actually exists at that location. This strategy only works if your prefix leads to a low false positive rate of candidates. ATC, for example, is probably short enough that if you're searching DNA, you'll probably have a very high false positive rate. A long prefix doesn't guarante…

View full answer

BurntSushi · 2023-08-15T18:59:12Z

BurntSushi
Aug 15, 2023
Maintainer

Enumerating all cases is what I would suggest if the total number of patterns doesn't get too crazy. If enumerating all of them is infeasible (i.e., would result in more than low millions), then the next suggestion I have would be to search for a common prefix of the set of all enumerated patterns. So for example, if you have ATCNG, then you'd search for ATC and then run another search to confirm whether a match actually exists at that location. This strategy only works if your prefix leads to a low false positive rate of candidates. ATC, for example, is probably short enough that if you're searching DNA, you'll probably have a very high false positive rate. A long prefix doesn't guarantee a low false positive rate on its own, so that's something you'll need to figure out for yourself. And of course, it's possible to combine these approaches. For example, expand any wildcards up to a certain number and then stop and chop off a prefix and search that. This is, in effect, a heuristic. So whether it works for you or not depends.

Otherwise you might just instead build a regex. The problem is that 100k - 1M might wind up being too big for a crate like regex-automata to handle practically.

If none of those work, then you might want to build something bespoke. For example, copy the non-contiguous NFA implementation from this crate and adapt it to support basic wildcards.

0 replies

itamarst · 2023-08-15T23:43:07Z

itamarst
Aug 15, 2023

Another variant:

You could split each ur-pattern with wildcards into multiple patterns? So e.g. "ATCNATG" becomes "ATC" and "ATG". And you keep a list separately in some relevant datastructure of these special wildcard patterns.

Then, to match:

First you match with normal AhoCorasick algorithm with the split up patterns.
Then, using ancillary data structure, iterate over results and see if they can be assembled to match the wildcards. So e.g. if you matched ATC and ATG, that suggests there's a chance you matched ATCNATG, you would need to check the start index of the ATG is 4 more than that of ATC to match ATCNATG. If you could do this in sorted order with the right datastructure, might be able to do this in reasonable amount of time, possibly even linearly on number of results.

0 replies

tofutofu · 2023-08-25T22:10:55Z

tofutofu
Aug 25, 2023

If the main issue is memory usage, then the simplest approach (building on the naive enumeration of all unambiguous patterns) would be to do multiple searches with fewer patterns in each search, would it not? Btw aho-corasick v1.0.4 should also be using significantly less memory than earlier versions. There is also an alternative crate implementing a faster variant of aho-corasick (daachorse) which is less well documented/less feature rich perhaps as this repo; it is reported to be 3x to 5x faster while consuming more than 50% less mem (compared against aho-corasick before the recent mem optimizations).

1 reply

BurntSushi Aug 25, 2023
Maintainer

Yeah daachorese is not just compared to the version of aho-corasick before the recent memory optimizations, but before 1.0 IIRC.

daachorse is faster and uses less memory in some cases, but not in others. I did some ad hoc benchmarking a while back and found it difficult to characterize in general. YMMV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Suggestions for handling . wildcards within query patterns #123

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Question: Suggestions for handling . wildcards within query patterns #123

tfwillems Aug 15, 2023

Replies: 3 comments · 1 reply

BurntSushi Aug 15, 2023 Maintainer

itamarst Aug 15, 2023

tofutofu Aug 25, 2023

BurntSushi Aug 25, 2023 Maintainer

tfwillems
Aug 15, 2023

Replies: 3 comments 1 reply

BurntSushi
Aug 15, 2023
Maintainer

itamarst
Aug 15, 2023

tofutofu
Aug 25, 2023

BurntSushi Aug 25, 2023
Maintainer