regexp: port RE2's DFA matcher to the regexp package #11646

michaelmatloob · 2015-07-09T22:02:08Z

The regexp package currently chooses between the standard NFA matcher, onepass, or the backtracker. This proposal is for porting over RE2's DFA matcher to be used as an option by exec.

ianlancetaylor · 2015-07-10T03:37:58Z

CC @rsc

bradfitz · 2015-07-13T16:41:01Z

IIRC, @rsc and @robpike were discussing this a few weeks ago and @rsc said he'd hit some design stumbling block where the C++ implementation didn't have an obvious Go translation.

Also, you were mentioning that the UTF-8-edness of the Go's regexp package and its runes make the 256-sized tables (which are just bytes in the C++ version) potentially tricky. Not sure whether restricting the DFA optimization to ASCII-only regexps helps or not.

michaelmatloob · 2015-07-26T16:50:06Z

One problem @rsc mentioned was the allocation of state data, but he gave me a solution so I don't think that will be a big problem.

After looking some more at the rune vs byte issue, I don't think we would need to skip the DFA for non-ascii regexps. The DFA makes a set of character ranges, and indexes the output transitions by range. We could do the same range lookup for input runes < 256 (or some other arbitrary table size) and do a slower lookup for larger runes.

Of course, there are a number of other things that don't map easily. I think most of those will have replacements and for any that don't we'll be able to do something slightly slower than RE2 but still be much faster than the nfa matcher.

robpike · 2015-07-27T01:21:37Z

I would like to understand why the NFA is not performant first. I am afraid that adding a DFA (I presume for performance reasons) may hit the same problems.

rsc · 2015-10-24T02:53:33Z

@michaelmatloob and I talked about this at Gophercon. I'm okay with adding this provided:

A bound on the space used by the DFA can be kept.
When the DFA cache runs out of space and must be flushed, the allocated memory can be reused directly. (That is, we don't allocate a new cache and rely on a GC cycle to reclaim the old cache memory.)

matloob · 2016-03-04T23:41:46Z

I'm working on this and hope to have it ready for 1.7.

(How do I add myself as an assignee?)

gopherbot · 2016-04-18T17:01:01Z

CL https://golang.org/cl/22189 mentions this issue.

gopherbot · 2016-04-19T20:00:59Z

CL https://golang.org/cl/22246 mentions this issue.

matloob · 2016-04-27T16:41:45Z

This isn't going to get in for Go 1.7.

matloob · 2016-04-27T16:46:02Z

For those interested in trying it out, I've got a standalone copy of my work thus far in matloob.io/regexp. It passes all the regexp tests and is 2-4x faster than regexp for hard regexps and large inputs. It's 5-10x faster for the new BenchmarkMatchHard1 benchmark.

cespare · 2016-08-17T02:29:43Z

@matloob plan on sending new CLs this cycle?

matloob · 2016-08-17T03:16:30Z

Ooh I really want to... but...

rsc had some ideas that I want to look at before moving forward with this. I think things will need to be on hold till he gets back.

In the meantime I'll sync the cl to tip and re-check-in an even harder benchmark.

olekukonko · 2017-01-13T09:12:06Z

@matloob can you share some of @rsc ideas ?

matloob · 2017-01-13T19:13:52Z

Russ would like to implement the ideas in the following paper: https://doi.org/10.1002/spe.2436. I haven't had a chance to read the paper yet (and it's behind a paywall).

junyer · 2017-01-17T15:36:27Z

If it helps, there is a video of the seminar that Chivers held about one month before the paper was published. The "preview" idea seems elegant and more appealing (to me, at least) than dealing with SIMD instructions for various architectures.

The first page of the paper mentions the use of loop counting to implement counted repetitions. If that idea is of interest, there are at least two papers about "extended" finite automata that predate Chivers' work. I must also point out that Smith, Estan and Jha patented their work.

olekukonko · 2017-01-18T12:55:31Z

@matloob @jubalh thanks for the links ...

adsouza · 2018-04-18T11:16:59Z

The Teddy algorithm used in Hyperscan is much better than even RE2's DFA:
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set

Perhaps somebody can translate it from the Rust implementation at https://github.com/jneem/teddy

BurntSushi · 2018-04-18T11:31:33Z

@adsouza Note that the Teddy algorithm is a specific form of literal optimization that only really applies to a small number of small literals.

junyer · 2018-05-06T11:54:14Z

Has anyone evaluated the DFA implementation in https://github.com/google/codesearch/blob/master/regexp/match.go yet?

smasher164 · 2018-06-05T01:36:27Z

@junyer The implementation in codesearch keeps a state cache of size 256 for each DFA state (although the alphabet spans the size of a rune). A cache like this could be placed into a sync.Pool, to play nicely with the GC. We would still be swapping a lot given a single DFA cache is 2K bytes, so it would be nice to reduce the size of the bitmap.

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong). I think @rsc mentions this idea of constructing smaller FSMs to handle unicode in https://swtch.com/~rsc/regexp/regexp1.html under Character sets.

A preview approach might be to use unicode character ranges to create a preview DFA of depth 3 to filter out invalid byte sequences. This presents a tradeoff between the number of states (from the depth value) and size of the DFA cache, which we could tune to our liking.

BurntSushi · 2018-06-05T01:48:51Z

I'm trying to understand toByteProg, which I believe modifies a syntax.Prog to break up a >1-byte-range rune instruction into multiple 1-byte-range rune instructions (correct me if I'm wrong).

If you can read Rust, then this is exactly what the utf8-ranges crate does: https://github.com/BurntSushi/utf8-ranges There isn't much code, so it should be pretty easy to port. The idea is itself inspired by what RE2 does.

This approach is how you get the alphabet size down to 256, even while supporting all of Unicode. You can further reduce the alphabet size by merging the symbols in your alphabet that are indistinguishable from each other for a specific regex. @rsc talks about this in his third regexp article IIRC.

kamilgregorczyk · 2018-08-12T15:08:06Z

Regexps are still slow as hell #26943

bradfitz · 2018-08-12T19:36:15Z

@kamilgregorczyk, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite.

kamilgregorczyk · 2018-08-12T20:04:29Z

1. Performance of regexp in go is unacceptable, that's the truth, there's nothing rude about it. 2. Such design flaw (that's been know for quite a long time) affects my an most likely others people work and apparently some built-ins aren't optimized in any way AND there are not plans to fix it. That's really bad for a language as I thought I could trust go, in fact i wanted to start using it for my commercial projects but for know, it goes back to the toy bucket again the first time was when I tried iris and owner of it deleted all commits). niedz., 12 sie 2018, 21:37 użytkownik Brad Fitzpatrick < notifications@github.com> napisał:

…

@kamilgregorczyk <https://github.com/kamilgregorczyk>, you can file constructive bug reports without saying things are "slow as hell" or "unacceptable". People are more likely to deprioritize bug reports from people who seem rude and more likely to help people being polite. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11646 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AI1koPbrT3jeUZemw6Pc6VPY-QLsgiCCks5uQIPvgaJpZM4FVpkq> .

cznic · 2018-08-12T20:18:31Z

What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude faster in some worst cases compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example?

What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed.

kamilgregorczyk · 2018-08-12T20:43:25Z

Of course it differs, in some cases it will be fine in some it won't, it can brake any benchmarks/tests/whatever, it's still slower than java/python/ruby in my any in some other cases which were reported. What scares me away is that there are no plans to even start fixing about etc. niedz., 12 sie 2018 o 22:19 cznic <notifications@github.com> napisał(a):

…

What do you even mean by 'unacceptable performance'? Do you care to define/explain? Do you mean the happy case, the average or the worst one? For every of that you get different answers. Do you know, that Go is actually in orders of magnitude *faster* in some worst cases <https://swtch.com/%7Ersc/regexp/regexp1.html> compared to PCRE? Have you heard about https://en.wikipedia.org/wiki/ReDoS, for example? What's the best mix of the performance/safety characteristics in the different cases is a design choice, not a simple nor universal "truth" as falsely claimed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11646 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AI1koP3Bp63-5Suk4Ici1_MwJFWXJNCZks5uQI3jgaJpZM4FVpkq> .

bradfitz · 2018-08-13T04:27:00Z

You could say something like "I find its performance unacceptable" or "It's unacceptable for my application", but saying that it's flat-out "unacceptable" is what's coming across as rude in your tone. It's been almost 9 years of people largely accepting its performance, even if there are always things that could be better. The bug is one such thing.

mohanraj-r · 2019-10-17T19:51:14Z

Since

@matloob 's PR seems to have stalled as per conversations in this issue,
and there doesn't seem to be any regexp optimizations in Go 1.12 (or Go 1.13) as @rsc mentioned in an issue regexp: Go regexp lib is unacceptably slow #26943 (comment) (referenced from this issue)

should this issue be closed in favor of #26623 @bradfitz ?

bradfitz · 2019-10-17T23:34:16Z

Let's keep this open. This is a specific task with good discussion and history.

If there are a lot of routing prefixes to be validated, regexp compilation can become a performance bottleneck. Aside: This could be because Go's regexp implementation might be relatively slow compared to other languages: golang/go#11646

* Cache regexp compilations during string validation. If there are a lot of routing prefixes to be validated, regexp compilation can become a performance bottleneck. Aside: This could be because Go's regexp implementation might be relatively slow compared to other languages: golang/go#11646 * Add RWLocks for global caches and add benchmark test

blacktop · 2022-05-28T03:09:29Z

Any updates?

Store blobs inline in the index. This way we can mmap them in, avoiding file open costs and using the page cache. Sadly, this rules out compression except at the file system level. Also restore the original codesearch Regexp. Unlike the Go regexp, this one has an DFA implementation that seems to perform much faster on case-insensitive searches (while having spikier memory usage per regex). See [Issue 11646] for details. # Future See the comment about [regex previews], which may be a good route for even faster parsing of large files. Notably, we spend a lot of time eliminating non-matching files, so this would help. Also, consider some basic query*file caching noting lack of matches. Perhaps a bloom filter? [Issue 11646]: golang/go#11646 (comment) [regex previews]: golang/go#11646 (comment)

michaelmatloob changed the title ~~proposal: port RE2's DFA matcher to the regexp package~~ proposal: regexp: port RE2's DFA matcher to the regexp package Jul 9, 2015

ianlancetaylor added this to the Unplanned milestone Jul 10, 2015

mikioh added the Proposal label Aug 13, 2015

rsc added the Proposal-Accepted label Oct 24, 2015

rsc changed the title ~~proposal: regexp: port RE2's DFA matcher to the regexp package~~ regexp: port RE2's DFA matcher to the regexp package Oct 24, 2015

bradfitz mentioned this issue Feb 1, 2016

Dramatically speed down regexp while grow difficulty it #14167

Closed

matloob self-assigned this Mar 4, 2016

matloob modified the milestones: Go1.7Early, Unplanned Apr 18, 2016

matloob removed this from the Go1.7Early milestone Apr 27, 2016

bradfitz added this to the Go1.8 milestone Apr 27, 2016

bradfitz mentioned this issue Aug 17, 2016

regexp: regex is much slower than java #16758

Closed

rsc modified the milestones: Go1.9, Go1.8 Oct 18, 2016

bradfitz modified the milestones: Go1.11, Unplanned May 30, 2018

bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label May 30, 2018

mvdan mentioned this issue Oct 9, 2018

go/ast: add func IsGenerated(*File) bool #28089

Closed

voiprodrigo mentioned this issue Jan 9, 2019

telegraf logparser too slow influxdata/telegraf#3539

Closed

zmanian mentioned this issue Jul 28, 2019

Add memo checker cosmos/cosmos-sdk#4523

Open

4 tasks

Mygod mentioned this issue Feb 20, 2020

Considering alternative coroutine sslocal impls shadowsocks/shadowsocks-android#2452

Closed

wenovus mentioned this issue Apr 7, 2021

Cache regexp compilations during string validation. openconfig/ygot#515

Merged

benhoyt mentioned this issue Jan 23, 2022

Optimize interpreter speed (umbrella issue) benhoyt/goawk#91

Closed

5 tasks

junyer mentioned this issue Aug 24, 2022

regexp: investigate further performance improvements #26623

Open

matloob removed their assignment Oct 13, 2022

kachick mentioned this issue Apr 19, 2024

2023-07-23 - re2 を deno とかで使いたい。 re2-wasm が使えれば万事解消するのでは・・・？ kachick/times_kachick#234

Closed

RadhiFadlillah mentioned this issue Sep 11, 2024

Real world cases where Go codes that generated by re2go is magnitude slower than Go's standard regexp skvadrik/re2c#487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: port RE2's DFA matcher to the regexp package #11646

regexp: port RE2's DFA matcher to the regexp package #11646

michaelmatloob commented Jul 9, 2015

ianlancetaylor commented Jul 10, 2015

bradfitz commented Jul 13, 2015

michaelmatloob commented Jul 26, 2015

robpike commented Jul 27, 2015

rsc commented Oct 24, 2015

matloob commented Mar 4, 2016

gopherbot commented Apr 18, 2016

gopherbot commented Apr 19, 2016

matloob commented Apr 27, 2016

matloob commented Apr 27, 2016

cespare commented Aug 17, 2016

matloob commented Aug 17, 2016

olekukonko commented Jan 13, 2017

matloob commented Jan 13, 2017

junyer commented Jan 17, 2017

olekukonko commented Jan 18, 2017

adsouza commented Apr 18, 2018

BurntSushi commented Apr 18, 2018 •

edited

Loading

junyer commented May 6, 2018

smasher164 commented Jun 5, 2018

BurntSushi commented Jun 5, 2018

kamilgregorczyk commented Aug 12, 2018

bradfitz commented Aug 12, 2018

kamilgregorczyk commented Aug 12, 2018 via email

cznic commented Aug 12, 2018

kamilgregorczyk commented Aug 12, 2018 via email

bradfitz commented Aug 13, 2018

mohanraj-r commented Oct 17, 2019

bradfitz commented Oct 17, 2019

blacktop commented May 28, 2022

regexp: port RE2's DFA matcher to the regexp package #11646

regexp: port RE2's DFA matcher to the regexp package #11646

Comments

michaelmatloob commented Jul 9, 2015

ianlancetaylor commented Jul 10, 2015

bradfitz commented Jul 13, 2015

michaelmatloob commented Jul 26, 2015

robpike commented Jul 27, 2015

rsc commented Oct 24, 2015

matloob commented Mar 4, 2016

gopherbot commented Apr 18, 2016

gopherbot commented Apr 19, 2016

matloob commented Apr 27, 2016

matloob commented Apr 27, 2016

cespare commented Aug 17, 2016

matloob commented Aug 17, 2016

olekukonko commented Jan 13, 2017

matloob commented Jan 13, 2017

junyer commented Jan 17, 2017

olekukonko commented Jan 18, 2017

adsouza commented Apr 18, 2018

BurntSushi commented Apr 18, 2018 • edited Loading

junyer commented May 6, 2018

smasher164 commented Jun 5, 2018

BurntSushi commented Jun 5, 2018

kamilgregorczyk commented Aug 12, 2018

bradfitz commented Aug 12, 2018

kamilgregorczyk commented Aug 12, 2018 via email

cznic commented Aug 12, 2018

kamilgregorczyk commented Aug 12, 2018 via email

bradfitz commented Aug 13, 2018

mohanraj-r commented Oct 17, 2019

bradfitz commented Oct 17, 2019

blacktop commented May 28, 2022

BurntSushi commented Apr 18, 2018 •

edited

Loading