diff --git a/active/0000-regexps.md b/active/0000-regexps.md new file mode 100644 index 00000000000..a5a3138b458 --- /dev/null +++ b/active/0000-regexps.md @@ -0,0 +1,279 @@ +- Start Date: 2014-04-12 +- RFC PR #: (leave this empty) +- Rust Issue #: (leave this empty) + +# Summary + +Add a `regexp` crate to the Rust distribution in addition to a small +`regexp_macros` crate that provides a syntax extension for compiling regular +expressions during the compilation of a Rust program. + +The implementation that supports this RFC is ready to receive +feedback: https://github.com/BurntSushi/regexp + +Documentation for the crate can be seen here: +http://burntsushi.net/rustdoc/regexp/index.html + +regex-dna benchmark (vs. Go, Python): +https://github.com/BurntSushi/regexp/tree/master/benchmark/regex-dna + +Other benchmarks (vs. Go): +https://github.com/BurntSushi/regexp/tree/master/benchmark + +(Perhaps the links should be removed if the RFC is accepted, since I can't +guarantee they will always exist.) + +# Motivation + +Regular expressions provide a succinct method of matching patterns against +search text and are frequently used. For example, many programming languages +include some kind of support for regular expressions in its standard library. + +The outcome of this RFC is to include a regular expression library in the Rust +distribution and resolve issue +[#3591](https://github.com/mozilla/rust/issues/3591). + +# Detailed design + +(Note: This is describing an existing design that has been implemented. I have +no idea how much of this is appropriate for an RFC.) + +The first choice that most regular expression libraries make is whether or not +to include backreferences in the supported syntax, as this heavily influences +the implementation and the performance characteristics of matching text. + +In this RFC, I am proposing a library that closely models Russ Cox's RE2 +(either its C++ or Go variants). This means that features like backreferences +or generalized zero-width assertions are not supported. In return, we get +`O(mn)` worst case performance (with `m` being the size of the search text and +`n` being the number of instructions in the compiled expression). + +My implementation currently simulates an NFA using something resembling the +Pike VM. Future work could possibly include adding a DFA. (N.B. RE2/C++ +includes both an NFA and a DFA, but RE2/Go only implements an NFA.) + +The primary reason why I chose RE2 was that it seemed to be a popular choice in +issue [#3591](https://github.com/mozilla/rust/issues/3591), and its worst case +performance characteristics seemed appealing. I was also drawn to the limited +set of syntax supported by RE2 in comparison to other regexp flavors. + +With that out of the way, there are other things that inform the design of a +regexp library. + +## Unicode + +Given the already existing support for Unicode in Rust, this is a no-brainer. +Unicode literals should be allowed in expressions and Unicode character classes +should be included (e.g., general categories and scripts). + +Case folding is also important for case insensitive matching. Currently, this +is implemented by converting characters to their uppercase forms and then +comparing them. Future work includes applying at least a simple fold, since +folding one Unicode character can produce multiple characters. + +Normalization is another thing to consider, but like most other regexp +libraries, the one I'm proposing here does not do any normalization. (It seems +the recommended practice is to do normalization before matching if it's +needed.) + +A nice implementation strategy to support Unicode is to implement a VM that +matches characters instead of bytes. Indeed, my implementation does this. +However, the public API of a regular expression library should expose *byte +indices* corresponding to match locations (which ought to be guaranteed to be +UTF8 codepoint boundaries by construction of the VM). My reason for this is +that byte indices result in a lower cost abstraction. If character indices are +desired, then a mapping can be maintained by the client at their discretion. + +Additionally, this makes it consistent with the `std::str` API, which also +exposes byte indices. + +## Word boundaries, word characters and Unicode + +At least Python and D define word characters, word boundaries and space +characters with Unicode character classes. My implementation does the same +by augmenting the standard Perl character classes `\d`, `\s` and `\w` with +corresponding Unicode categories. + +## Leftmost-first + +As of now, my implementation finds the leftmost-first match. This is consistent +with PCRE style regular expressions. + +I've pretty much ignored POSIX, but I think it's very possible to add +leftmost-longest semantics to the existing VM. (RE2 supports this as a +parameter, but I believe still does not fully comply with POSIX with respect to +picking the correct submatches.) + +## Public API + +There are three main questions that can be asked when searching text: + +1. Does the string match this expression? +2. If so, where? +3. Where are its submatches? + +In principle, an API could provide a function to only answer (3). The answers +to (1) and (2) would immediately follow. However, keeping track of submatches +is expensive, so it is useful to implement an optimization that doesn't keep +track of them if it doesn't have to. For example, submatches do not need to be +tracked to answer questions (1) and (2). + +The rabbit hole continues: answering (1) can be more efficient than answering +(2) because you don't have to keep track of *any* capture groups ((2) requires +tracking the position of the full match). More importantly, (1) enables early +exit from the VM. As soon as a match is found, the VM can quit instead of +continuing to search for greedy expressions. + +Therefore, it's worth it to segregate these operations. The performance +difference can get even bigger if a DFA were implemented (which can answer (1) +and (2) quickly and even help with (3)). Moreover, most other regular +expression libraries provide separate facilities for answering these questions +separately. + +Some libraries (like Python's `re` and RE2/C++) distinguish between matching an +expression against an entire string and matching an expression against part of +the string. My implementation favors simplicity: matching the entirety of a +string requires using the `^` and/or `$` anchors. In all cases, an implicit +`.*?` is added the beginning and end of each expression evaluated. (Which is +optimized out in the presence of anchors.) + +Finally, most regexp libraries provide facilities for splitting and replacing +text, usually making capture group names available with some sort of `$var` +syntax. My implementation provides this too. (These are a perfect fit for +Rust's iterators.) + +This basically makes up the entirety of the public API, in addition to perhaps +a `quote` function that escapes a string so that it may be used as a literal in +an expression. + +## The `regexp!` macro + +With syntax extensions, it's possible to write an `regexp!` macro that compiles +an expression when a Rust program is compiled. This includes translating the +matching algorithm to Rust code specific to the expression given. This "ahead +of time" compiling results in a performance increase. Namely, it elides all +heap allocation. + +I've called these "native" regexps, whereas expressions compiled at runtime are +"dynamic" regexps. The public API need not impose this distinction on users, +other than requiring the use of a syntax extension to construct a native +regexp. For example: + + let re = regexp!("a*"); + +After construction, `re` is indistinguishable from an expression created +dynamically: + + let re = Regexp::new("a*").unwrap(); + +In particular, both have the same type. This is accomplished with a +representation resembling: + + enum MaybeNative { + Dynamic(~[Inst]), + Native(fn(MatchKind, &str, uint, uint) -> ~[Option]), + } + +This syntax extension requires a second crate, `regexp_macros`, where the +`regexp!` macro is defined. Technically, this could be provided in the `regexp` +crate, but this would introduce a runtime dependency on `libsyntax` for any use +of the `regexp` crate. + +[@alexcrichton +remarks](https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112) +that this state of affairs is a wart that will be corrected in the future. + +## Untrusted input + +Given worst case `O(mn)` time complexity, I don't think it's worth worrying +about unsafe search text. + +Untrusted regular expressions are another matter. For example, it's very easy +to exhaust a system's resources with nested counted repetitions. For example, +`((a{100}){100}){100}` tries to create `100^3` instructions. My current +implementation does nothing to mitigate against this, but I think a simple hard +limit on the number of instructions allowed would work fine. (Should it be +configurable?) + +## Name + +The name of the crate being proposed is `regexp` and the type describing a +compiled regular expression is `Regexp`. I think an equally good name would be +`regex` (and `Regex`). Either name seems to be frequently used, e.g., "regexes" +or "regexps" in colloquial use. I chose `regexp` over `regex` because it +matches the name used for the corresponding package in Go's standard library. + +Other possible names are `regexpr` (and `Regexpr`) or something with +underscores: `reg_exp` (and `RegExp`). However, I perceive these to be more +ugly and less commonly used than either `regexp` or `regex`. + +Finally, we could use `re` (like Python), but I think the name could be +ambiguous since it's so short. `regexp` (or `regex`) unequivocally identifies +the crate as providing regular expressions. + +For consistency's sake, I propose that the syntax extension provided be named +the same as the crate. So in this case, `regexp!`. + +## Summary + +My implementation is pretty much a port of most of RE2. The syntax should be +identical or almost identical. I think matching an existing (and popular) +library has benefits, since it will make it easier for people to pick it up and +start using it. There will also be (hopefully) fewer surprises. There is also +plenty of room for performance improvement by implementing a DFA. + +# Alternatives + +I think the single biggest alternative is to provide a backtracking +implementation that supports backreferences and generalized zero-width +assertions. I don't think my implementation precludes this possibility. For +example, a backtracking approach could be implemented and used only when +features like backreferences are invoked in the expression. However, this gives +up the blanket guarantee of worst case `O(mn)` time. I don't think I have the +wisdom required to voice a strong opinion on whether this is a worthwhile +endeavor. + +Another alternative is using a binding to an existing regexp library. I think +this was discussed in issue +[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people +favor a native Rust implementation if it's to be included in the Rust +distribution. (Does the `regexp!` macro require it? If so, that's a huge +advantage.) Also, a native implementation makes it maximally portable. + +Finally, it is always possible to persist without a regexp library. + +# Unresolved questions + +The public API design is fairly simple and straight-forward with no +surprises. I think most of the unresolved stuff is how the backend is +implemented, which should be changeable without changing the public API (sans +adding features to the syntax). + +I can't remember where I read it, but someone had mentioned defining a *trait* +that declared the API of a regexp engine. That way, anyone could write their +own backend and use the `regexp` interface. My initial thoughts are +YAGNI---since requiring different backends seems like a super specialized +case---but I'm just hazarding a guess here. (If we go this route, then we +might want to expose the regexp parser and AST and possibly the +compiler and instruction set to make writing your own backend easier. That +sounds restrictive with respect to making performance improvements in the +future.) + +I personally think there's great value in keeping the standard regexp +implementation small, simple and fast. People who have more specialized needs +can always pick one of the existing C or C++ libraries. + +For now, we could mark the API as `#[unstable]` or `#[experimental]`. + +# Future work + +I think most of the future work for this crate is to increase the performance, +either by implementing different matching algorithms (e.g., a DFA) or by +improving the code generator that produces native regexps with `regexp!`. + +If and when a DFA is implemented, care must be taken when creating a code +generator, as the size of the code required can grow rapidly. + +Other future work (that is probably more important) includes more Unicode +support, specifically for simple case folding. +