Add regex matching for &[u8]. #183

BurntSushi · 2016-03-10T01:57:53Z

This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called bytes that duplicates the API of the top-level module, except
&str for subjects is replaced by &[u8]. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the u flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the regex-syntax crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
Benchmarks were added for Oniguruma.
The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
regex::Regex or specifically for regex::bytes::Regex.
Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.

BurntSushi · 2016-03-10T01:58:50Z

cc @alexcrichton @vi @flying-sheep @birkenfeld @nikomatsakis @jneem

BurntSushi · 2016-03-10T02:01:42Z

Important notes:

This doubles the size of the public API since the new bytes sub-module is essentially a replica of the top-level crate API, except with s/&str/&[u8] for the search strings. (e.g., is_match takes a &[u8] instead of a &str.)
The API isn't quite identical, but the key changes made were to address add second "capture" lifetime to SubCaptures and SubCapturesNamed #168 and fixed Replacer trait to work more like BufRead::read_line #151. (Both of which I'd like to make to the top-level API for 1.0.) Neither are significant changes. All of the key methods like is_match or captures_iter are the same.
There should be no breaking changes in regex (there are breaking changes in regex-syntax).

This commit enables support for compiling regular expressions that can match on arbitrary byte slices. In particular, we add a new sub-module called `bytes` that duplicates the API of the top-level module, except `&str` for subjects is replaced by `&[u8]`. Additionally, Unicode support in the regular expression is disabled by default but can be selectively re-enabled with the `u` flag. (Unicode support cannot be selectively disabled in the standard top-level API.) Most of the interesting changes occurred in the `regex-syntax` crate, where the AST now explicitly distinguishes between "ASCII compatible" expressions and Unicode aware expressions. This PR makes a few other changes out of convenience: 1. The DFA now knows how to "give up" if it's flushing its cache too often. When the DFA gives up, either backtracking or the NFA algorithm take over, which provides better performance. 2. Benchmarks were added for Oniguruma. 3. The benchmarks in general were overhauled to be defined in one place by using conditional compilation. 4. The tests have been completely reorganized to make it easier to split up the tests depending on which regex engine we're using. For example, we occasionally need to be able to write tests specifically for `regex::Regex` or specifically for `regex::bytes::Regex`. 5. Fixes a bug where NUL bytes weren't represented correctly in the byte class optimization for the DFA. Closes #85.

flying-sheep · 2016-03-10T19:23:19Z

wow, amazing!

birkenfeld · 2016-03-10T19:49:52Z

Looks very nice.

alexcrichton · 2016-03-11T22:26:19Z

Holy cow, amazing work as always @BurntSushi!

Out of curiosity, does this have much impact on compile times of this crate? It'd be kinda unfortunate if most users don't use byte-based regexes but end up doubling compile times (and may be indicative of perhaps another crate should exist?), but if it's small already probably doesn't matter too much.

Other than that the concept seems good to me, having a separate API for bytes-based regexes as it does seems like they'll only ever be disjointly used.

vi · 2016-03-12T03:49:15Z

If comple time is a problem, the feature may be under feature gate in cargo.

BurntSushi · 2016-03-12T03:49:41Z

For debug compile times:

$ touch src/lib.rs 
$ time cargo build
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m5.770s
user    0m5.487s
sys     0m0.173s

$ $EDITOR src/*.rs # remove `bytes` module
$ touch src/lib.rs 
$ time cargo build
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m5.506s
user    0m5.283s
sys     0m0.130s

For --release times:

$ touch src/lib.rs 
$ time cargo build --release
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m16.714s
user    0m16.517s
sys     0m0.090s
$ $EDITOR src/*.rs # remove `bytes` module
$ touch src/lib.rs 
$ time cargo build --release
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m16.346s
user    0m16.103s
sys     0m0.140s

So it looks like the difference is negligible. I'm actually not too surprised by this since the matching engines themselves always match on &[u8], even when using Unicode-only regexes. (It wasn't always this way, but it became more convenient for a variety of reasons with the arrival of the DFA, which really really demands &[u8] matching.) As a result, the only "additional" code this introduces is the API itself, which all told isn't that much. :-) The public API is really just a wrapper around a more flexible internal API.

With that said, I have toyed with the idea of introducing a new crate, maybe called regex-internal or something, but I'm really wishy washy on it (and extremely biased, personally).

Other than that the concept seems good to me, having a separate API for bytes-based regexes as it does seems like they'll only ever be disjointly used.

Those were my thoughts as well. :-)

alexcrichton · 2016-03-12T18:14:02Z

Awesome, 👍 from me!

Add regex matching for &[u8].

[rust-lang/regex#183](rust-lang/regex#183) has made the following change that broke the lint: src/re.rs → src/re_unicode.rs

BurntSushi force-pushed the raw-bytes branch 2 times, most recently from 1927236 to eac9c6f Compare March 10, 2016 02:20

BurntSushi force-pushed the raw-bytes branch from eac9c6f to d98ec1b Compare March 10, 2016 02:32

BurntSushi added a commit that referenced this pull request Mar 13, 2016

Merge pull request #183 from rust-lang-nursery/raw-bytes

8fbec9a

Add regex matching for &[u8].

BurntSushi merged commit 8fbec9a into master Mar 13, 2016

BurntSushi deleted the raw-bytes branch March 13, 2016 15:05

mcarton added a commit to rust-lang/rust-clippy that referenced this pull request Apr 14, 2016

Fix the REGEX_MACRO lint

578cc3d

[rust-lang/regex#183](rust-lang/regex#183) has made the following change that broke the lint: src/re.rs → src/re_unicode.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add regex matching for &[u8]. #183

Add regex matching for &[u8]. #183

Uh oh!

BurntSushi commented Mar 10, 2016

Uh oh!

BurntSushi commented Mar 10, 2016

Uh oh!

BurntSushi commented Mar 10, 2016

Uh oh!

flying-sheep commented Mar 10, 2016

Uh oh!

birkenfeld commented Mar 10, 2016

Uh oh!

alexcrichton commented Mar 11, 2016

Uh oh!

vi commented Mar 12, 2016

Uh oh!

BurntSushi commented Mar 12, 2016

Uh oh!

alexcrichton commented Mar 12, 2016

Uh oh!

Uh oh!

Add regex matching for &[u8]. #183

Add regex matching for &[u8]. #183

Uh oh!

Conversation

BurntSushi commented Mar 10, 2016

Uh oh!

BurntSushi commented Mar 10, 2016

Uh oh!

BurntSushi commented Mar 10, 2016

Uh oh!

flying-sheep commented Mar 10, 2016

Uh oh!

birkenfeld commented Mar 10, 2016

Uh oh!

alexcrichton commented Mar 11, 2016

Uh oh!

vi commented Mar 12, 2016

Uh oh!

BurntSushi commented Mar 12, 2016

Uh oh!

alexcrichton commented Mar 12, 2016

Uh oh!

Uh oh!