Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a regexp crate to the Rust distribution #42

Merged
merged 12 commits into from
Apr 22, 2014
Merged

add a regexp crate to the Rust distribution #42

merged 12 commits into from
Apr 22, 2014

Conversation

BurntSushi
Copy link
Member

Links to an existing implementation, documentation and benchmarks are in the RFC. This RFC is meant to resolve issue #3591.

I apologize in advance if I've made any amateur mistakes. I'm still fairly new to the Rust world (~1 month), so I'm sure I still have some misunderstandings about the language lurking somewhere.

A nice implementation strategy to support Unicode is to implement a VM that
matches characters instead of bytes. Indeed, my implementation does this.
However, the public API of a regular expression library should expose *byte
indices* corresponding to match locations (which ought to be guaranteed to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The APIs in std::str expose byte indices too, so this is well supported in Rust-land.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Fixed.

found this difficult to do with zero-runtime cost. Either way, the ability to
statically declare a regexp is pretty cool I think.

Note that the syntax extension is the reason for the `regexp_re` crate. It's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should have a convention for crates and their syntax extension pairs, e.g. for a crate foo, have foo_macros or foo_synext or something. (I'd personally be ok with foo_macros, e.g. regexp_macros in this case.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like foo_macros too. (I'll change this once there's a consensus?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used foo_mac but foo_macros seems fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

Essentially, I wouldn't worry too much about the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I changed the name for now to regexp_macros. Even if it isn't necessary, I think it's probably a better name on its own than regexp_re. Happy to comply with anything though.

[#3591](https://github.com/mozilla/rust/issues/3591) and it seems like people
favor a native Rust implementation if it's to be included in the Rust
distribution. (Does the `re!` macro require it? If so, that's a huge
advantage.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small downside of binding to an existing library is that it's not necessarily as portable as rust code. Libraries written in rust are maximally portable because they'll go wherever rust goes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Fixed.

@alexcrichton
Copy link
Member

This looks amazing, fantastic work!

@chris-morgan
Copy link
Member

@alexcrichton:

In the future all that will be necessary is #[phase(syntax, link)] extern crate regexp;. The compiler will automatically dynamically load the appropriate syntax extension crate, and then it will link to the target crate. Note that this is all far off, and is another reason why phase is feature gated.

But doesn't #[phase(syntax, link)] extern crate regexp; work? That's the recommended invocation for log, and how it is being used.

I certainly don't want any public _macros convention—I want it to Just Work™. (Ideally without needing to specify #[phase] at all, but I'll live with it for the moment.)

@huonw
Copy link
Member

huonw commented Apr 13, 2014

If a procedural* macro is defined in a crate, then it will want to pull in syntax as a (dynamic) dependency. Hence having the procedural macro defined in the regexp crate will mean everything that uses regexp will need syntax at runtime, which is entirely unacceptable. Having it in a separate crate allows you to depend on that crate only at compile time, with no runtime effect (there are possibly bugs with this ATM).

*It doesn't affect liblog because all it's macros are macro_rules, which don't need libsyntax to be linked in.

@chris-morgan
Copy link
Member

@huonw That should be able to be fixed in dead code removal for link time optimisation (would it in practice be?) but I get the point now. Thanks for the explanation.

@chris-morgan
Copy link
Member

A couple of other ideas that I have had with regards to regular expressions are:

  • Truly compiled regular expressions: as in, no dependency on a regexp library at runtime at all, but rather expanding it to approximately what a person might have written by hand without a regular expressions library.
  • Create anonymous structs for matches, with direct field access (or indexed access) for groups.

I would expect that these would lead to somewhat larger compiled code, but to code that should run more efficiently. I'm not sure if it's a good trade-off or not.

Anyway, these lead to something like this:

re!(FancyIdentifier, r"^(?P<letters>[a-z]+)(?P<numbers>[0-9]+)?$")

expanding to something approximating this, plus quite a bit more (I recognise that it isn't a valid expansion in a static value and has various other issues, but it gives the general idea of what I think would be really nice):

struct FancyIdentifier<'a> {
    all: &'a str,
    letters: &'a str,
    numbers: Option<&'a str>,
}

impl<'a> Index<uint, Option<&'a str>> for FancyIdentifier<'a> {
    fn index(&'a self, index: &uint) -> Option<&'a str> {
        if *index == 0u {
            Some(self.all)
        } else if *index == 1u {
            Some(self.letters)
        } else if *index == 2u {
            self.numbers
        } else {
            fail!("no such group {}", *index);
        }
    }
}

impl<'a> FancyIdentifier<'a> {
    pub fn captures<'t>(text: &'t str) -> Option<FancyIdentifier<'t>> {
        let mut chars = text.chars();
        loop {
            // go through, byte/char by byte/char, keeping track of position
            if b < 'a' || b > 'z' {
                return None;
            }
        }
        loop {
            // … get numbers in much the same way …
        }
        Some(FancyIdentifier {
            all: text,
            letters: letters,
            numbers: numbers,
        })
    }
}

This allows nicer usage:

let foo12 = FancyIdentifier::captures("foo12");
assert_eq!(foo12.letters, "foo");
assert_eq!(foo12.numbers, Some("12"));
assert_eq!(foo12[0], Some("foo12"));
assert_eq!(foo12[2], Some("12"));

I expect this would be rather difficult to implement, too. Still, just thought I'd toss the idea into the ring as I haven't seen it suggested, but it's been sitting in my mind the whole time the discussion has gone on.

case---but I'm just hazarding a guess here. (If we go this route, then we'd
probably also have to expose the regexp parser and AST and possibly the
compiler and instruction set to make writing your own backend easier. That
sounds restrictive with respect to making performance improvements in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could expose it as an #[unstable] or even #[experimental] interface: i.e. subject to change, but it's possible to use if you really need it.

@BurntSushi
Copy link
Member Author

@chris-morgan That's a really interesting idea. I hadn't thought of it. (I spoke with @eddyb about it on IRC.)

I think it's something worth trying and could potentially increase performance dramatically, but I also think it's complex enough that it be thrown in the bin of future work. @eddyb and I both agree that it would require a specialization of the Pike VM, which I think is doable (without allocation even). A more naive implementation is difficult because (I think) it would rely on recursion for handling non-determinism, which would pretty easily result in stack overflows. (In fact, most of the complexity of the Pike VM is a direct result of manually managing a queue of states. Any recursion in the VM is strictly bounded to the number of instructions in the regexp.)

If this sounds OK to you, I'll add it to the RFC as possible future work. (Since it may require an API change, I think the #[unstable] and #[experimental] would make that OK.)

@alexcrichton
Copy link
Member

@chris-morgan, sadly #[phase(syntax, link)] extern crate regexp; will not work becuase this is using procedural macros rather than macro_rules macros (two separate systems).

@sfackler
Copy link
Member

@alexcrichton well, it'll work, but introduce a runtime dependency on libsyntax :P

@brendanzab
Copy link
Member

@alexcrichton Is that a current wart that should be fixed, or will that remain the same?

@alexcrichton
Copy link
Member

Yes, that is the wart that will be fixed. In the future world, there will be no need to manually compile two crates, and there will be no runtime dependency on libsyntax, and the syntax will be #[phase(syntax, link)] extern crate my_crate_with_syntax_extensions;

@brendanzab
Copy link
Member

@BurntSushi Awesome work. Have you by any chance seen D's compile time regex with templates? (Scroll down to "Regular Expression Compiler"). I'm guessing you are probably using a very different method to statically compile things though, but its still interesting.

Also, could you explain why you chose the specific identifier for your library and types? Here are some choices you could have made:

  • Re, libre: too ambiguous? but consistent with re!
  • Regex, libregex: the shortening most people use in conversation
  • Regexp, libregexp: current proposal
  • Regexpr, libregexpr: rust uses expr in the macro_rules thing - more consistent maybe?
  • RegExp, libreg_exp: might be more consistent with the accepted identifier style
  • RegExpr, libreg_expr: see above

We could bikeshed this forever, but I do think it deserves at least some passing consideration before we pull the trigger.

@BurntSushi
Copy link
Member Author

@bjz Thanks! I did look at D's regexes this morning. From what I understand D provides something similar to my current re! macro, which compiles a regexp at compile time but still relies on a general implementation to do matching. The example I found here (toward the bottom) is: static r = regex("Boo-hoo");. D also supports compiling a regexp to native code with ctRegex, which is what @chris-morgan suggested above. I'm not exactly sure why they are separate in the public API though.

Also, for the name, I didn't put much thought into it. If you pressed me, I'd say I used it simply because that's the name of the package in Go's standard library. (Which isn't that good of a reason.)

re is what Python uses, but I agree with you that it might be too ambiguous.

I'd also be happy with Regex and libregex. I'm less a fan of the other suggestions, just because they look more ugly to me. Also, I think people tend to refer to them as either "regexes" or "regexps" rather than "regexprs", so maybe that's another reason to stick with regex/regexp.

@BurntSushi
Copy link
Member Author

Meta: should the RFC include a discussion/justification of the name?

@brendanzab
Copy link
Member

Regarding D, cool to hear your impressions! A while back I heard Andrei make some very bold claims about D's regex performance compared to other libs, and I would wonder how yours would compare. I realise however that this is an RFC in regarding to the public API, and the internals could be improved later.

I would make a mention of the naming in the RFC – I think it is important to show you have considered alternatives and precedents rather than jumping on the first one that came to mind. The bike shedding is inevitable (and sometimes necessary), but at least it helps to focus the debate.

@BurntSushi
Copy link
Member Author

@bjz RE D: Yeah, I think it would be very exciting to see what @chris-morgan's suggestion would do to performance. That along with implementing a DFA are two major optimizations for future work. (Along with a few other minor ones, like a one-pass NFA.)

I've added some stuff about the name to the RFC.

@BurntSushi
Copy link
Member Author

@bjz I agree. I would actually prefer that it be called regex! or regexp! (whatever the crate name is I guess). I called it re! because that's what people had been writing (when talking about a hypothetical macro).

@chris-morgan
Copy link
Member

I personally prefer crate re and macro re!. But then, I come from a Python background, so don't trust me.

@chris-morgan
Copy link
Member

Google trends for the regexpr, regexp and regex:

  • "regexpr" is basically never used;
  • "regexp" is steadily declining in usage;
  • "regex" has been the preferred form for at least ten years.

The `\w` character class and the zero-width word boundary assertion `\b` are
defined in terms of the ASCII character set. I'm not aware of any
implementation that defines these in terms of proper Unicode character classes.
Do we want to be the first?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\w and \d and \s all default to Unicode under Python 3. So there's a little bit of precedent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I actually think D also does it. I'd say that's probably enough precedent to go with Unicode. (For word boundaries too, I think.)

@lambda-fairy
Copy link
Contributor

DFA compilation would be great, though probably as an option. The main advantage is performance: it matches in O(n) time and O(1) memory (zero allocations!). It has no runtime dependencies. Plus, since it's effectively a finite state machine, it's straightforward to translate to LLVM.

It's not a free lunch though -- a DFA matcher has worst case exponential code size, which can make it impractical for complex expressions.

If a DFA compiler is implemented, we can either tuck it under a flag, or enable it by default but fall back if the code becomes too large. Either way, I think finishing Unicode support (especially case folding) is a higher priority.

(As for the name, I vote for regex. The 'p' in regexp doesn't add anything semantically; we might as well take it out.)

@BurntSushi
Copy link
Member Author

@lfairy Note that a compiled NFA should also have zero (heap) allocations. (The generalized NFA simulator has O(m) (heap) space complexity, where m is the number of instructions in the regexp.)

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

@huonw
Copy link
Member

huonw commented Apr 14, 2014

Maybe the DFA approach could be performed by hooking up Ragel to generate Rust AST (this may be tricky). (cc https://github.com/erickt/ragel, which is generating Rust code as text.)

@seanmonstar
Copy link
Contributor

I find it slightly odd that Regexp::new() returns a Result. I've come
to assume that new() will always return that object, and that it's safe to
do so.

Would Regexp::compile(str) -> Result feel nicer?

@lambda-fairy
Copy link
Contributor

@BurntSushi The VM does allocate, to create a list of running threads -- but given the allocation only happens once, I see where you're coming from.

I believe RE2/C++ tries to use a DFA and falls back to an NFA if the state cache gets flushed too frequently (I think).

That sounds right.

@huonw Thanks for the link. I suspect people who need the performance/expressiveness of Ragel would use it directly though.

This leaves the DFA approach in an awkward spot, methinks -- simple cases work well with re2-style state caching, and advanced cases can use a lexer generator or something magical like Ragel. Looks like Russ Cox had it right all along ;)

@huonw
Copy link
Member

huonw commented Apr 14, 2014

I find it slightly odd that Regexp::new() returns a Result. I've come to assume that new() will always return that object, and that it's safe to do so. Would Regexp::compile(str) -> Result feel nicer?

The types mean that you can never accidentally use the return of ::new() incorrectly (I'm personally fine with using new for this reason: strong types).

I suspect people who need the performance/expressiveness of Ragel would use it directly though.

That doesn't preclude using Ragel just as a step in an efficient regex -> native code translator. (That is, writing a regex syntax -> ragel syntax translator and an output-to-Rust-AST mode for ragel may be easier than writing a direct regex syntax -> Rust-AST translator that results in equally good code. Of course, adding a ragel dependency to the core distribution would be a no-go.)

@BurntSushi
Copy link
Member Author

I've been working on what @chris-morgan suggested: real compilation to native Rust with the re! macro. I'm sure there are more performance gains to be had, but I'm at a reasonable place right now. One bummer is that it can no longer be declared statically. Instead, it can be used anywhere an expression can be used. But, it is also indistinguishable from a regexp compiled at runtime. Internally, the representation looks like this:

pub enum MaybeNative {
    Dynamic(~[Inst]),
    Native(fn(MatchKind, &str, uint, uint) -> ~[Option<uint>]),
}

This makes runtime and compiled regexps have a completely identical API. It means it might not be as nice as the API that @chris-morgan suggested, but I think a consistent API between both is probably more valuable.

The doco for the updated code is at a different URL. There are some examples using the new macro: http://burntsushi.net/rustdoc/exp/regexp/index.html (But other portions of the doco are rightfully unchanged.)

The benchmarks are also very encouraging. Left column is for dynamic regexps and the right column is for natively compiled regexps.

literal                                 422 ns/iter (+/- 2)                     120 ns/iter (+/- 17)              
not_literal                            1904 ns/iter (+/- 7)                     931 ns/iter (+/- 642)
match_class                            2452 ns/iter (+/- 6)                    1276 ns/iter (+/- 336)
match_class_in_range                   2559 ns/iter (+/- 91)                   1298 ns/iter (+/- 433)
replace_all                            5221 ns/iter (+/- 529)                  1216 ns/iter (+/- 648)
anchored_literal_short_non_match        939 ns/iter (+/- 10)                    420 ns/iter (+/- 168)
anchored_literal_long_non_match        8979 ns/iter (+/- 64)                   5407 ns/iter (+/- 1982)
anchored_literal_short_match            576 ns/iter (+/- 5)                     126 ns/iter (+/- 92)
anchored_literal_long_match             553 ns/iter (+/- 8)                     150 ns/iter (+/- 103)
one_pass_short_a                       2039 ns/iter (+/- 20)                   1036 ns/iter (+/- 397)
one_pass_short_a_not                   2698 ns/iter (+/- 8)                    1365 ns/iter (+/- 623)
one_pass_short_b                       1457 ns/iter (+/- 14)                    710 ns/iter (+/- 495)
one_pass_short_b_not                   2037 ns/iter (+/- 13)                    974 ns/iter (+/- 552)
one_pass_long_prefix                   1188 ns/iter (+/- 7)                     383 ns/iter (+/- 117)
one_pass_long_prefix_not               1217 ns/iter (+/- 7)                     344 ns/iter (+/- 196)
easy0_32                                564 ns/iter (+/- 14) = 56 MB/s           44 ns/iter (+/- 12) = 727 MB/s
easy0_1K                               2389 ns/iter (+/- 167) = 428 MB/s       1903 ns/iter (+/- 390) = 538 MB/s
easy0_32K                             59404 ns/iter (+/- 882) = 551 MB/s      59889 ns/iter (+/- 34128) = 547 MB/s
easy1_32                                543 ns/iter (+/- 145) = 58 MB/s          55 ns/iter (+/- 58) = 581 MB/s
easy1_1K                               3495 ns/iter (+/- 829) = 292 MB/s       1629 ns/iter (+/- 601) = 628 MB/s
easy1_32K                             92901 ns/iter (+/- 5203) = 352 MB/s     48938 ns/iter (+/- 8302) = 669 MB/s
medium_32                              1611 ns/iter (+/- 61) = 19 MB/s          526 ns/iter (+/- 60) = 60 MB/s
medium_1K                             33457 ns/iter (+/- 621) = 30 MB/s       14541 ns/iter (+/- 5849) = 70 MB/s
medium_32K                          1044635 ns/iter (+/- 19853) = 31 MB/s    472571 ns/iter (+/- 177623) = 69 MB/s
hard_32                                2447 ns/iter (+/- 129) = 13 MB/s        1025 ns/iter (+/- 516) = 31 MB/s
hard_1K                               54844 ns/iter (+/- 297) = 18 MB/s       30248 ns/iter (+/- 11665) = 33 MB/s
hard_32K                            1744529 ns/iter (+/- 40267) = 18 MB/s    993564 ns/iter (+/- 455100) = 32 MB/s

There's also a similarly big jump in performance on the regex-dna benchmark. Old. New.

@BurntSushi
Copy link
Member Author

I've updated the RFC to use natively compiled regexps and simplified some sections based on discussion here. And the implementation is now Unicode friendly for Perl character classes and word boundaries.

Aside from the name of the crate (how is that decided?), I think I've incorporated all feedback given.

@sfackler
Copy link
Member

How does the codegen size compare between the old and new syntax extension implementations? Will a binary with a lot of regexes need to avoid native compilation because it would bloat the binary too much?

@seanmonstar
Copy link
Contributor

indeed, i was thinking similarly: If I use regexp! several times, won't it be generating a lot of redundant code? I imagine that repeatable part can be put into a regexp::native module, and the macro can just expand calling some of those functions with the expanded values.

@BurntSushi
Copy link
Member Author

@sfackler I don't (yet) have a comparison with the old regexp! macro, but with native compilation, my test binary with roughly 434 regexps (many of which are pretty big) is 17MB compiled without optimization. Compiled with -O, the binary shrinks to 6.7MB. Compiled with --opt-level=3 -Z lto, the binary is 5.3MB.

As a baseline, if I compile test using dynamic regexps, then the binary sizes are 6MB, 4.3MB and 2.7MB, respectively.

These sizes seem pretty reasonable to me, since I think 400+ regexps is a pretty extreme case.

@seanmonstar That is indeed possible and I'm already doing it for some pieces. There is more that could be done though. Any piece that has knowledge of types like [T, ..N] has to be specialized though.

@BurntSushi
Copy link
Member Author

Here's another perspective. Given a minimal binary with a single small regexp (that prints all capture groups), compiling without optimization increases the binary size by 54KB (comparing dynamic vs. native regexp). Compiling with -O increases size by 9KB. Compiling with --opt-level=3 -Z lto decreases size by 184KB. (That seems wicked. Maybe the optimizer knows to leave out Regexp::new and all of its requisite machinery? e.g., The VM, parser and compiler.)

@alexcrichton alexcrichton merged commit c250f8b into rust-lang:master Apr 22, 2014
@alexcrichton
Copy link
Member

We discussed this in today's meeting and decided to merge it.

The only caveat we'd like to attach is that the entire crate is #[experimental] for now (so we can get some traction first). Other than that though, we're all looking forward to being able to use regular expressions!

@bearophile
Copy link

@BurntSushi: >I'm not exactly sure why they are separate in the public API though.<

I think because until someone patches D the interpreter of Compile Time Function Excution to make it more memory-efficient, you sometimes want to avoid ctRegex, to use less compilation memory.

withoutboats pushed a commit to withoutboats/rfcs that referenced this pull request Jan 15, 2017
Fix links to promise function vs Promise type
@Centril Centril added A-regex Proposals relating to regular expressions. A-nursery Proposals relating to the rust-lang-nursery. labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-nursery Proposals relating to the rust-lang-nursery. A-regex Proposals relating to regular expressions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants