From c250f8bb0b249eedc0cbb24fbf1238320770fba1 Mon Sep 17 00:00:00 2001 From: Andrew Gallant Date: Fri, 18 Apr 2014 18:22:37 -0400 Subject: [PATCH] Include information about native regexps. --- active/0000-regexps.md | 76 ++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/active/0000-regexps.md b/active/0000-regexps.md index bbbdefb763c..a5a3138b458 100644 --- a/active/0000-regexps.md +++ b/active/0000-regexps.md @@ -90,7 +90,9 @@ exposes byte indices. ## Word boundaries, word characters and Unicode At least Python and D define word characters, word boundaries and space -characters with Unicode character classes. I propose we do the same. +characters with Unicode character classes. My implementation does the same +by augmenting the standard Perl character classes `\d`, `\s` and `\w` with +corresponding Unicode categories. ## Leftmost-first @@ -147,31 +149,39 @@ an expression. ## The `regexp!` macro With syntax extensions, it's possible to write an `regexp!` macro that compiles -an expression when a Rust program is compiled. In my case, it seemed simplest -to compile it to *static* data. For example: +an expression when a Rust program is compiled. This includes translating the +matching algorithm to Rust code specific to the expression given. This "ahead +of time" compiling results in a performance increase. Namely, it elides all +heap allocation. - static re: Regexp = regexp!("a*"); +I've called these "native" regexps, whereas expressions compiled at runtime are +"dynamic" regexps. The public API need not impose this distinction on users, +other than requiring the use of a syntax extension to construct a native +regexp. For example: -At first this seemed difficult to accommodate, but it turned out to be -relatively easy with a type like this: + let re = regexp!("a*"); - pub enum MaybeStatic { - Dynamic(Vec), - Static(&'static [T]), - } +After construction, `re` is indistinguishable from an expression created +dynamically: + + let re = Regexp::new("a*").unwrap(); + +In particular, both have the same type. This is accomplished with a +representation resembling: -Another option is for the `regexp!` macro to produce a non-static value, but I -found this difficult to do with zero-runtime cost. Either way, the ability to -statically declare a regexp is pretty cool I think. + enum MaybeNative { + Dynamic(~[Inst]), + Native(fn(MatchKind, &str, uint, uint) -> ~[Option]), + } -Note that the syntax extension is the reason for the `regexp_macros` crate. It's -very small and contains the macro registration function. I'm not sure how this -fits into the Rust distribution, but my vote is to document the `regexp!` macro -in the `regexp` crate and hide the `regexp_macros` crate from public -documentation. (Or link it to the `regexp` crate.) +This syntax extension requires a second crate, `regexp_macros`, where the +`regexp!` macro is defined. Technically, this could be provided in the `regexp` +crate, but this would introduce a runtime dependency on `libsyntax` for any use +of the `regexp` crate. -It seems like the `regexp!` macro will become a bit nicer to use once -[#11640](https://github.com/mozilla/rust/issues/11640) is fixed. +[@alexcrichton +remarks](https://github.com/rust-lang/rfcs/pull/42#issuecomment-40320112) +that this state of affairs is a wart that will be corrected in the future. ## Untrusted input @@ -234,11 +244,7 @@ Finally, it is always possible to persist without a regexp library. # Unresolved questions -Firstly, I'm not entirely clear on how the `regexp_macros` crate will be handled. -I gave a suggestion above, but I'm not sure if it's a good one. Is there any -precedent? - -Secondly, the public API design is fairly simple and straight-forward with no +The public API design is fairly simple and straight-forward with no surprises. I think most of the unresolved stuff is how the backend is implemented, which should be changeable without changing the public API (sans adding features to the syntax). @@ -247,8 +253,8 @@ I can't remember where I read it, but someone had mentioned defining a *trait* that declared the API of a regexp engine. That way, anyone could write their own backend and use the `regexp` interface. My initial thoughts are YAGNI---since requiring different backends seems like a super specialized -case---but I'm just hazarding a guess here. (If we go this route, then we'd -probably also have to expose the regexp parser and AST and possibly the +case---but I'm just hazarding a guess here. (If we go this route, then we +might want to expose the regexp parser and AST and possibly the compiler and instruction set to make writing your own backend easier. That sounds restrictive with respect to making performance improvements in the future.) @@ -263,19 +269,11 @@ For now, we could mark the API as `#[unstable]` or `#[experimental]`. I think most of the future work for this crate is to increase the performance, either by implementing different matching algorithms (e.g., a DFA) or by -compiling a regular expression to native Rust code. - -With regard to native compilation, there are a few notes: +improving the code generator that produces native regexps with `regexp!`. -* If and when a DFA is implemented, care must be taken, as the size of the code - required can grow rapidly. -* Adding native compilation will very likely change the interface of the crate - in a meaningful way, particularly if we want the interface to be consistent - between natively compiled and dynamically compiled regexps. (i.e., Make - `Regexp` a trait.) +If and when a DFA is implemented, care must be taken when creating a code +generator, as the size of the code required can grow rapidly. Other future work (that is probably more important) includes more Unicode -support, specifically for simple case folding. Also, words and word boundaries -should also be Unicode friendly, but I plan to have this done before I submit a -PR. +support, specifically for simple case folding.