-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore using Aho-Corasick in Regex's FindFirstChar #62447
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsGiven an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named Teddy.
|
@stephentoub I remember we discussed whether a |
Yes.
I'd rather focus on getting something working and validated inside of Regex. Regex can do all the precomputation and optimize the provided expression with whatever mechanism it can throw at it. Creating a custom API for a specific pattern of pattern would necessitate data showing it's so common on such hot paths that any additional overhead associated with Regex is prohibitory. |
@teo-tsirpanis if you're looking for a meaty project that could give us significant regex perf improvements in 7.0 -- perhaps you're interested in creating a vectorized implementation of Aho-Corasick. Teddy is apparently permissively licensed, so it could be a starting point. We would like to do this, but it doesn't fit on the schedule right now. |
Thanks for letting me know @danmoseley. It is indeed a very interesting project, but I am pretty busy these days with pressing obligations, included my university classes that started this week and the work on my thesis. Speaking of this, I had the thought to make this my thesis' subject. I talked with some professors (that was the reason for my delay) and were willing to accept it, but I was warned that "this is much more than an undergraduate thesis, maybe even a postgraduate thesis" and that "it's not something that will finish in 1-2 months" (I linked your message but forgot to explicitly tell them about Teddy though). To better understand the magnitude of this undertaking, a more detailed list of what needs to be done, as well as a more informed guess of the required time by adding a |
Yes to be clear, no pressure whatsoever, just flagging as maybe interesting to you or someone. Hmm, it's a significant project but I do not think this should be huge work. For the most part it would be isolated from Regex: it would be implementing an (internal) API essentially like @stephentoub thoughts? |
Yes...ish. Source generation is a wrinkle. For that, we'd either need to emit the code for the algorithm, which I don't think we want to do, or we would need to expose some public API the generated code could use. And while we could leave such a thing to just be for the non-source generated regexes, that would complicate the story for when to pick source generation over not. I think the actual story would need to be some sort of API for it, which we'd then use from all the regex engines. But, just getting to the point where we could test and demonstrate the potential would be valuable. Figuring out the API from there shouldn't be a big deal. It would likely be something where you pass to a factory all the target strings, and then the resulting object would provide a search method that took a span, returned an index, and potentially also provided the target string (or its id) that was found. |
That makes sense -- thinking of this as an isolated experiment to implement a high performance |
(obviously not that signature) |
Making this concrete in discussion with @stephentoub we would this shape: internal class MultiStringSearcher
{
public MultiStringSearcher(string[] strings);
public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
} Our first implementation would be to use Aho-Corasick with the whole DFA precomputed (looks like #45697 was a start). There would be no grow-up NFA mode: we would presumably fall back to naive search if there are "too many" search strings. We assume that this would be beneficial enough to Regex to stand alone. Teddy is a natural next step cutting over from Aho-Corasick when hardware supports and there are relatively few search strings (Rust regex seems to cap at 32). As a detail, the Rust Teddy crate itself uses Rabin-Karp for very short strings or (I guess) leftovers. Note, great credit is due to @BurntSushi and others for their work on the impressive Rust crates which help us chart our basic path here. |
Since I was pinged, I'll share some very high level thoughts. Please take these with a heap of salt because I have approximately zero familiarity with .NET's regex engine or constraints. :-)
That's all I've got for now. Happy to answer questions! |
I've started working on this. I will first create a separate project that implements the Teddy algorithm using the cross-platform vector intrinsics and fallbacks to Aho-Corasick. Once benchmarks show an improvement over |
@teo-tsirpanis sounds great. For the sake of maximizing the chance of success and making digestible PR’s I wonder whether it would make sense to put up a change with just A-C first, which I expect would be fast enough to stand alone. A general purpose API is proposed above. If it’s to satisfy the needs of regex it will presumably need to return leftmost match (as BurntSushi alludes to above; you’ll need to figure out from the Rust create how he achieves this leftmost match behavior since it’s not the textbook A-C behavior.) Presumably leftmost match is a reasonable behavior for a general purpose API. Also, regex (at least as currently exists) only needs the start index of the leftmost match. It doesn’t need the matched string, and it certainly doesn’t need any other strings (overlapping) matched at that position. Given that, a general purpose API might best only return the index, (the sketch API above suggested both index and match). If I understand A-C correctly, this means building it can be quicker/ simpler - the automaton will not need “suffix links” (aka “dictionary links”) only the “error links”. So the ordering I’m imagining would be something like
Getting the public API in first is a nice sub step but it also avoids the need to emit the implementation from the regex source generator, which can only call public API. |
(Of course feel free to disagree with suggestion above if you think there’s a better ordering that could still avoid one big PR) |
Yes, starting with AC and leaving Teddy for later is a good idea.
My implementation remembers the match's location upon encountering one and keeps looping over the characters, accepting subsequent matches only if they occur before the last we accepted. If we return to the root node and have seen a match, we stop the loop and return it. The Rust implementation is doing this by modifying the trie construction; I will investigate if it's worth it.
I think that in a general purpose API returning the matched string is useful; it will be included in my proposal which will be filed soon. I also ran some benchmarks. I took a random Project Gutenberg book and counted the first
|
How does it handle the case when you have the patterns Another thing to consider is whether you plan on building the DFA variant of AC. In that case, you can't really reason about root nodes and failure transitions at search time. |
Yes, exactly, it returns if it reaches the root or input ends. We can reconsider this once we come at implementing the DFA variant. |
Even though this is being worked on now, it is not tracked as a must-have for 7.0, so changing the milestone accordingly. |
Moving to 9.0 |
superceded by #85693 |
Given an alternation like "abc|def|ghi", we will currently do an IndexOfAny to find the next position for a possible match. At that point, we try to avoid false positives by doing a few set lookups, e.g. we IndexOfAny('a', 'd', 'g'), and then we test whether the next character is in the set [beh]. But specifically for expressions that are small alternations, or that begin with small alternations, or that begin with constructs we can convert to small alternations, we could instead use Aho-Corasick to actually validate whether any of the strings in the alternation are a prefix of the text at the current location. Rust uses a vectorized implementation named Teddy.
The text was updated successfully, but these errors were encountered: