-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regex matching for &[u8]. #183
Conversation
Important notes:
|
1927236
to
eac9c6f
Compare
This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.
wow, amazing! |
Looks very nice. |
Holy cow, amazing work as always @BurntSushi! Out of curiosity, does this have much impact on compile times of this crate? It'd be kinda unfortunate if most users don't use byte-based regexes but end up doubling compile times (and may be indicative of perhaps another crate should exist?), but if it's small already probably doesn't matter too much. Other than that the concept seems good to me, having a separate API for bytes-based regexes as it does seems like they'll only ever be disjointly used. |
If comple time is a problem, the feature may be under feature gate in cargo. |
For debug compile times:
For
So it looks like the difference is negligible. I'm actually not too surprised by this since the matching engines themselves always match on With that said, I have toyed with the idea of introducing a new crate, maybe called
Those were my thoughts as well. :-) |
Awesome, 👍 from me! |
Add regex matching for &[u8].
[rust-lang/regex#183](rust-lang/regex#183) has made the following change that broke the lint: src/re.rs → src/re_unicode.rs
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called
bytes
that duplicates the API of the top-level module, except&str
for subjects is replaced by&[u8]
. Additionally, Unicodesupport in the regular expression is disabled by default but can be
selectively re-enabled with the
u
flag. (Unicode support cannot beselectively disabled in the standard top-level API.)
Most of the interesting changes occurred in the
regex-syntax
crate,where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.
This PR makes a few other changes out of convenience:
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
by using conditional compilation.
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
regex::Regex
or specifically forregex::bytes::Regex
.class optimization for the DFA.
Closes #85.