Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: port RE2's DFA matcher to the regexp package #11646

Open
michaelmatloob opened this issue Jul 9, 2015 · 36 comments
Open

regexp: port RE2's DFA matcher to the regexp package #11646

michaelmatloob opened this issue Jul 9, 2015 · 36 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance Proposal Proposal-Accepted
Milestone

Comments

@michaelmatloob
Copy link
Contributor

The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.

@michaelmatloob michaelmatloob changed the title proposal: port RE2's DFA matcher to the regexp package proposal: regexp: port RE2's DFA matcher to the regexp package Jul 9, 2015
@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jul 10, 2015
@ianlancetaylor
Copy link
Member

CC @rsc

@bradfitz
Copy link
Contributor

IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation.

Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not.

@michaelmatloob
Copy link
Contributor Author

One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem.

After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes.

Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher.

@robpike
Copy link
Contributor

robpike commented Jul 27, 2015

I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems.

@rsc
Copy link
Contributor

rsc commented Oct 24, 2015

@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:

  1. A bound on the space used by the DFA can be kept.
  2. When the DFA cache runs out of space and must be flushed, the allocated memory can be reused directly. (That is, we don't allocate a new cache and rely on a GC cycle to reclaim the old cache memory.)

@rsc rsc changed the title proposal: regexp: port RE2's DFA matcher to the regexp package regexp: port RE2's DFA matcher to the regexp package Oct 24, 2015
@matloob
Copy link
Contributor

matloob commented Mar 4, 2016

I'm working on this and hope to have it ready for 1.7.

(How do I add myself as an assignee?)

@matloob matloob self-assigned this Mar 4, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/22189 mentions this issue.

@matloob matloob modified the milestones: Go1.7Early, Unplanned Apr 18, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/22246 mentions this issue.

@matloob matloob removed this from the Go1.7Early milestone Apr 27, 2016
@matloob
Copy link
Contributor

matloob commented Apr 27, 2016

This isn't going to get in for Go 1.7.

@bradfitz bradfitz added this to the Go1.8 milestone Apr 27, 2016
@matloob
Copy link
Contributor

matloob commented Apr 27, 2016

For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark.

@cespare
Copy link
Contributor

cespare commented Aug 17, 2016

@matloob plan on sending new CLs this cycle?

@matloob
Copy link
Contributor

matloob commented Aug 17, 2016

Ooh I really want to... but...

rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back.

In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark.

@rsc rsc modified the milestones: Go1.9, Go1.8 Oct 18, 2016
@olekukonko
Copy link

@matloob can you share some of @rsc ideas ?

@matloob
Copy link
Contributor

matloob commented Jan 13, 2017

Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall).

@junyer
Copy link
Contributor

junyer commented Jan 17, 2017

If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures.

The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work.

@olekukonko
Copy link

@matloob @jubalh thanks for the links ...

@adsouza
Copy link

adsouza commented Apr 18, 2018

The Teddy algorithm used in Hyperscan is much better than even RE2's DFA:
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set

Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy

@BurntSushi
Copy link

BurntSushi commented Apr 18, 2018

@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals.

@junyer
Copy link
Contributor

junyer commented May 6, 2018

Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet?

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 30, 2018
@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label May 30, 2018
@smasher164
Copy link
Member

@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a sync.Pool, to play nicely with the GC. We would still be swapping a lot given a single DFA cache is 2K bytes, so it would be nice to reduce the size of the bitmap.

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong). I think @rsc mentions this idea of constructing smaller FSMs to handle unicode in https://swtch.com/~rsc/regexp/regexp1.html under Character sets.

A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking.

@BurntSushi
Copy link

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong).

If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does.

This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC.

@kamilgregorczyk
Copy link

Regexps are still slow as hell #26943

@bradfitz
Copy link
Contributor

@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite.

@kamilgregorczyk
Copy link

kamilgregorczyk commented Aug 12, 2018 via email

@cznic
Copy link
Contributor

cznic commented Aug 12, 2018

What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude faster in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example?

What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed.

@kamilgregorczyk
Copy link

kamilgregorczyk commented Aug 12, 2018 via email

@bradfitz
Copy link
Contributor

You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing.

@mohanraj-r
Copy link

Since

should this issue be closed in favor of #26623 @bradfitz ?

@bradfitz
Copy link
Contributor

Let's keep this open. This is a specific task with good discussion and history.

wenovus added a commit to openconfig/ygot that referenced this issue Apr 7, 2021
If there are a lot of routing prefixes to be validated, regexp
compilation can become a performance bottleneck.

Aside: This could be because Go's regexp implementation might be
relatively slow compared to other languages:
golang/go#11646
wenovus added a commit to openconfig/ygot that referenced this issue Apr 10, 2021
* Cache regexp compilations during string validation.

If there are a lot of routing prefixes to be validated, regexp
compilation can become a performance bottleneck.

Aside: This could be because Go's regexp implementation might be
relatively slow compared to other languages:
golang/go#11646

* Add RWLocks for global caches and add benchmark test
@blacktop
Copy link

Any updates?

@matloob matloob removed their assignment Oct 13, 2022
sgrankin added a commit to sgrankin/cs that referenced this issue Jan 26, 2024
Store blobs inline in the index.
This way we can mmap them in, avoiding file open costs and using the page cache.
Sadly, this rules out compression except at the file system level.

Also restore the original codesearch Regexp.
Unlike the Go regexp, this one has an DFA implementation that seems to perform much faster on case-insensitive searches (while having spikier memory usage per regex).
See [Issue 11646] for details.

# Future

See the comment about [regex previews], which may be a good route for even faster parsing of large files.
Notably, we spend a lot of time eliminating non-matching files, so this would help.

Also, consider some basic query*file caching noting lack of matches.
Perhaps a bloom filter?

[Issue 11646]: golang/go#11646 (comment)
[regex previews]: golang/go#11646 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance Proposal Proposal-Accepted
Projects
None yet
Development

No branches or pull requests