-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: port RE2's DFA matcher to the regexp package #11646
Comments
CC @rsc |
IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation. Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not. |
One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem. After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes. Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher. |
I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems. |
@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:
|
I'm working on this and hope to have it ready for 1.7. (How do I add myself as an assignee?) |
CL https://golang.org/cl/22189 mentions this issue. |
CL https://golang.org/cl/22246 mentions this issue. |
This isn't going to get in for Go 1.7. |
For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark. |
@matloob plan on sending new CLs this cycle? |
Ooh I really want to... but... rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back. In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark. |
Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall). |
If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures. The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work. |
The Teddy algorithm used in Hyperscan is much better than even RE2's DFA: Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy |
@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals. |
Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet? |
@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a I'm trying to understand A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking. |
If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does. This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC. |
Regexps are still slow as hell #26943 |
@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite. |
1. Performance of regexp in go is unacceptable, that's the truth, there's
nothing rude about it.
2. Such design flaw (that's been know for quite a long time) affects my an
most likely others people work and apparently some built-ins aren't
optimized in any way AND there are not plans to fix it.
That's really bad for a language as I thought I could trust go, in fact i
wanted to start using it for my commercial projects but for know, it goes
back to the toy bucket again the first time was when I tried iris and owner
of it deleted all commits).
niedz., 12 sie 2018, 21:37 użytkownik Brad Fitzpatrick <
notifications@github.com> napisał:
… @kamilgregorczyk <https://github.com/kamilgregorczyk>, you can file
constructive bug reports without saying things are "slow as hell" or
"unacceptable". People are more likely to deprioritize bug reports from
people who seem rude and more likely to help people being polite.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11646 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AI1koPbrT3jeUZemw6Pc6VPY-QLsgiCCks5uQIPvgaJpZM4FVpkq>
.
|
What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude faster in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example? What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed. |
Of course it differs, in some cases it will be fine in some it won't, it
can brake any benchmarks/tests/whatever, it's still slower than
java/python/ruby in my any in some other cases which were reported. What
scares me away is that there are no plans to even start fixing about etc.
niedz., 12 sie 2018 o 22:19 cznic <notifications@github.com> napisał(a):
… What do you even mean by 'unacceptable performance'? Do you care to
define/explain? Do you mean the happy case, the average or the worst one?
For every of that you get different answers. Do you know, that Go is
actually in orders of magnitude *faster* in some worst cases
<https://swtch.com/%7Ersc/regexp/regexp1.html> compared to PCRE? Have you
heard about https://en.wikipedia.org/wiki/ReDoS, for example?
What's the best mix of the performance/safety characteristics in the
different cases is a design choice, not a simple nor universal "truth" as
falsely claimed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11646 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AI1koP3Bp63-5Suk4Ici1_MwJFWXJNCZks5uQI3jgaJpZM4FVpkq>
.
|
You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing. |
Since
|
Let's keep this open. This is a specific task with good discussion and history. |
If there are a lot of routing prefixes to be validated, regexp compilation can become a performance bottleneck. Aside: This could be because Go's regexp implementation might be relatively slow compared to other languages: golang/go#11646
* Cache regexp compilations during string validation. If there are a lot of routing prefixes to be validated, regexp compilation can become a performance bottleneck. Aside: This could be because Go's regexp implementation might be relatively slow compared to other languages: golang/go#11646 * Add RWLocks for global caches and add benchmark test
Any updates? |
Store blobs inline in the index. This way we can mmap them in, avoiding file open costs and using the page cache. Sadly, this rules out compression except at the file system level. Also restore the original codesearch Regexp. Unlike the Go regexp, this one has an DFA implementation that seems to perform much faster on case-insensitive searches (while having spikier memory usage per regex). See [Issue 11646] for details. # Future See the comment about [regex previews], which may be a good route for even faster parsing of large files. Notably, we spend a lot of time eliminating non-matching files, so this would help. Also, consider some basic query*file caching noting lack of matches. Perhaps a bloom filter? [Issue 11646]: golang/go#11646 (comment) [regex previews]: golang/go#11646 (comment)
The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.
The text was updated successfully, but these errors were encountered: