-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp: case-insensitive MatchString performance #13288
Comments
Introducing a LUT for runes < 0x80, helped the Easy0i case.
Results:
|
Nice. Use https://godoc.org/golang.org/x/tools/cmd/benchcmp to show the before & after numbers. Send in a change? https://golang.org/doc/contribute.html It's probably too late for Go 1.6, though, unless you send it soon and it addresses an existing bug or regression. |
Sure, I can make CL for the unicode.SimpleFold optimization. I'm not sure whether |
I don't think you can change SimpleFold to ToLower in regexp. I suspect that would break case-insensitive regexps over characters like "kKK" (little k, big k, Kelvin symbol)? But no, ToLower of kelvin symbol is lowercase 'k'.... http://play.golang.org/p/pqhR4ksOk_ But maybe there are other characters which SimpleFold to each other but don't map to the same ToLower character? You could brute force all the characters (or characters in the defined ranges) and see? |
CL https://golang.org/cl/16943 mentions this issue. |
Did the brute-force, it seems that there are multiple such cases: https://play.golang.org/p/D1_JuZYP0v. So yes, |
This change significantly speeds up case-insensitive regexp matching. benchmark old ns/op new ns/op delta BenchmarkMatchEasy0i_32-8 2690 1473 -45.24% BenchmarkMatchEasy0i_1K-8 80404 42269 -47.43% BenchmarkMatchEasy0i_32K-8 3272187 2076118 -36.55% BenchmarkMatchEasy0i_1M-8 104805990 66503805 -36.55% BenchmarkMatchEasy0i_32M-8 3360192200 2126121600 -36.73% benchmark old MB/s new MB/s speedup BenchmarkMatchEasy0i_32-8 11.90 21.72 1.83x BenchmarkMatchEasy0i_1K-8 12.74 24.23 1.90x BenchmarkMatchEasy0i_32K-8 10.01 15.78 1.58x BenchmarkMatchEasy0i_1M-8 10.00 15.77 1.58x BenchmarkMatchEasy0i_32M-8 9.99 15.78 1.58x Issue golang#13288 Change-Id: I94af7bb29e75d60b4f6ee760124867ab271b9642 Reviewed-on: https://go-review.googlesource.com/16943 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
Probably enough for Go 1.7. Leaving this open, though, if you were planning on doing more later. |
Go 1.7 is looming and I don't see this mentioned anywhere in the release notes and no update on this ticket - is this being rolled into the 1.7 release? Regex is an unfortunate pain point for Go performance compared to other languages. |
@aprice @egonelbre's improvements (https://go-review.googlesource.com/#/c/16943/) were merged and will be part of Go 1.7. (Localized performance improvements are typically not called out in the release notes.) |
A change was submitted to speed up |
I have moved my question to the forum, as requested. |
Let's move general code/style questions to a mailing list and not clutter this bug. I'm happy to answer on a mailing list. |
When making the easy fix, the better solution would have been to avoid calling |
You can't simply use strings.ToLower on "knicks" because that will fail to match the "k" against the Unicode Kelvin symbol U+212A (K). The extra overhead here compared to strings.ToLower is exactly to add that symbol. |
I was investigating why https://github.com/dimroc/etl-language-comparison/blob/517507b033dbb938dd1e83401914cbd4dd9a79bc/golang/search.go#L160 ends up significantly slower with regular expression.
Profiling the etl example lead me to:
Which lead me to: https://github.com/golang/go/blob/master/src/regexp/syntax/prog.go#L213
I assume that there is a good reason for using
unicode.SimpleFold
, i.e. some specialunicode
symbols. When I changed it:It didn't break any tests... also I created
easy0i = "(?i)ABCDEFGHIJKLMNOPQRSTUVWXYZ$"
in https://github.com/golang/go/blob/master/src/regexp/exec_test.go#L674 to measure the performance change:So, is there a reason why
unicode.SimpleFold
is used instead of something else, such asunicode.ToLower
? Either way, there might be some significant performance gains here for case-insensitive matches.The text was updated successfully, but these errors were encountered: