Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex matching for &[u8]. #183

Merged
merged 1 commit into from
Mar 13, 2016
Merged

Add regex matching for &[u8]. #183

merged 1 commit into from
Mar 13, 2016

Conversation

BurntSushi
Copy link
Member

This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called bytes that duplicates the API of the top-level module, except
&str for subjects is replaced by &[u8]. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the u flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the regex-syntax crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

  1. The DFA now knows how to "give up" if it's flushing its cache too
    often. When the DFA gives up, either backtracking or the NFA algorithm
    take over, which provides better performance.
  2. Benchmarks were added for Oniguruma.
  3. The benchmarks in general were overhauled to be defined in one place
    by using conditional compilation.
  4. The tests have been completely reorganized to make it easier to split
    up the tests depending on which regex engine we're using. For example,
    we occasionally need to be able to write tests specifically for
    regex::Regex or specifically for regex::bytes::Regex.
  5. Fixes a bug where NUL bytes weren't represented correctly in the byte
    class optimization for the DFA.

Closes #85.

@BurntSushi
Copy link
Member Author

@BurntSushi
Copy link
Member Author

Important notes:

  1. This doubles the size of the public API since the new bytes sub-module is essentially a replica of the top-level crate API, except with s/&str/&[u8] for the search strings. (e.g., is_match takes a &[u8] instead of a &str.)
  2. The API isn't quite identical, but the key changes made were to address add second "capture" lifetime to SubCaptures and SubCapturesNamed #168 and fixed Replacer trait to work more like BufRead::read_line #151. (Both of which I'd like to make to the top-level API for 1.0.) Neither are significant changes. All of the key methods like is_match or captures_iter are the same.
  3. There should be no breaking changes in regex (there are breaking changes in regex-syntax).

@BurntSushi BurntSushi force-pushed the raw-bytes branch 2 times, most recently from 1927236 to eac9c6f Compare March 10, 2016 02:20
This commit enables support for compiling regular expressions that can
match on arbitrary byte slices. In particular, we add a new sub-module
called `bytes` that duplicates the API of the top-level module, except
`&str` for subjects is replaced by `&[u8]`. Additionally, Unicode
support in the regular expression is disabled by default but can be
selectively re-enabled with the `u` flag. (Unicode support cannot be
selectively disabled in the standard top-level API.)

Most of the interesting changes occurred in the `regex-syntax` crate,
where the AST now explicitly distinguishes between "ASCII compatible"
expressions and Unicode aware expressions.

This PR makes a few other changes out of convenience:

1. The DFA now knows how to "give up" if it's flushing its cache too
often. When the DFA gives up, either backtracking or the NFA algorithm
take over, which provides better performance.
2. Benchmarks were added for Oniguruma.
3. The benchmarks in general were overhauled to be defined in one place
by using conditional compilation.
4. The tests have been completely reorganized to make it easier to split
up the tests depending on which regex engine we're using. For example,
we occasionally need to be able to write tests specifically for
`regex::Regex` or specifically for `regex::bytes::Regex`.
5. Fixes a bug where NUL bytes weren't represented correctly in the byte
class optimization for the DFA.

Closes #85.
@flying-sheep
Copy link
Contributor

wow, amazing!

@birkenfeld
Copy link

Looks very nice.

@alexcrichton
Copy link
Member

Holy cow, amazing work as always @BurntSushi!

Out of curiosity, does this have much impact on compile times of this crate? It'd be kinda unfortunate if most users don't use byte-based regexes but end up doubling compile times (and may be indicative of perhaps another crate should exist?), but if it's small already probably doesn't matter too much.

Other than that the concept seems good to me, having a separate API for bytes-based regexes as it does seems like they'll only ever be disjointly used.

@vi
Copy link

vi commented Mar 12, 2016

If comple time is a problem, the feature may be under feature gate in cargo.

@BurntSushi
Copy link
Member Author

For debug compile times:

$ touch src/lib.rs 
$ time cargo build
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m5.770s
user    0m5.487s
sys     0m0.173s

$ $EDITOR src/*.rs # remove `bytes` module
$ touch src/lib.rs 
$ time cargo build
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m5.506s
user    0m5.283s
sys     0m0.130s

For --release times:

$ touch src/lib.rs 
$ time cargo build --release
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m16.714s
user    0m16.517s
sys     0m0.090s
$ $EDITOR src/*.rs # remove `bytes` module
$ touch src/lib.rs 
$ time cargo build --release
   Compiling regex v0.1.55 (file:///home/andrew/data/projects/rust/regex)

real    0m16.346s
user    0m16.103s
sys     0m0.140s

So it looks like the difference is negligible. I'm actually not too surprised by this since the matching engines themselves always match on &[u8], even when using Unicode-only regexes. (It wasn't always this way, but it became more convenient for a variety of reasons with the arrival of the DFA, which really really demands &[u8] matching.) As a result, the only "additional" code this introduces is the API itself, which all told isn't that much. :-) The public API is really just a wrapper around a more flexible internal API.

With that said, I have toyed with the idea of introducing a new crate, maybe called regex-internal or something, but I'm really wishy washy on it (and extremely biased, personally).

Other than that the concept seems good to me, having a separate API for bytes-based regexes as it does seems like they'll only ever be disjointly used.

Those were my thoughts as well. :-)

@alexcrichton
Copy link
Member

Awesome, 👍 from me!

BurntSushi added a commit that referenced this pull request Mar 13, 2016
@BurntSushi BurntSushi merged commit 8fbec9a into master Mar 13, 2016
@BurntSushi BurntSushi deleted the raw-bytes branch March 13, 2016 15:05
mcarton added a commit to rust-lang/rust-clippy that referenced this pull request Apr 14, 2016
[rust-lang/regex#183](rust-lang/regex#183) has made the following change that broke the lint:

src/re.rs → src/re_unicode.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants