Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan: moving regex engines to regex-automata #656

Closed
BurntSushi opened this issue Mar 24, 2020 · 8 comments
Closed

plan: moving regex engines to regex-automata #656

BurntSushi opened this issue Mar 24, 2020 · 8 comments
Labels

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Mar 24, 2020

In discussions in #524, it became clearer to me that it might be good to say more about what my short/long term plans are for the regex crate.

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax).

The main motivation for doing this is to clean up the regex internals, specifically with the intent to make it easier to add more optimizations. Currently, the infrastructure is a bit of a hodge podge and it can be quite difficult to know where and when a particular regex engine is being used. On top of that, since every regex engine is strictly internal, testing each engine independently is quite annoying, as it requires exporting undocumented APIs. As a result, bugs have cropped up where certain engines are inconsistent with other engines. Similarly, benchmarking each engine independent of the other is made difficult, which in turn makes it difficult to focus on optimizing certain scenarios that come up in the course of regex execution.

On top of all of this, putting the regex engines into their own crate with a proper API means that folks can use those engines directly. It also provides an outlet of sorts to provide lower level APIs that would I otherwise wouldn't want cluttering up the main regex crate too much. The main thing on my mind here are streaming APIs (see #425), but there may be other things of use.

I've actually been working towards this goal for a long time now. Arguably, i started around 3 years ago when I began rewriting the regex-syntax crate. I then moved to rewriting memchr and aho-corasick as well. In the process, I pushed complexity down as much as I could. For example, the HIR in regex-syntax no longer exposes any regex flags (such as case insensitivity), which makes regex compilers simpler. As another example, aho-corasick now supports leftmost-first matching, which matches the regex crate's match semantics precisely, and now makes it easier to defer to aho-corasick for an alternation of literals. Yet another example is the Teddy algorithm that uses SIMD for finding matches of multiple literals very quickly. That has been pushed down into aho-corasick as well, which means more people get to benefit from it and it no longer needs to be maintained inside of regex proper.

A little over a year ago, I started work on regex-automata. The main use case for regex-automata was to generate fully compiled DFAs that could be cheaply embedded into a program and deserialized cheaply for easy matching. (For example, they power the grapheme, word and sentence iterators in the bstr crate.)

In the course of building regex-automata, I had to write another HIR -> NFA compiler, with the intent that this would one day become the compiler inside the regex crate. It has many improvements over the existing one. Firstly, it is much simpler by doing away with the "instruction hole" mechanism used in the current compiler. Secondly, it is also simpler in that it does away with dichotomy between a Unicode NFA and a byte-based NFA, which has been the cause of many bugs and increases the complexity of the matching engines. Thirdly, by getting rid of the Unicode/bytes dichotomy in favor of just using bytes, we will be able to compile one fewer NFAs than what the regex crate currently does (one Unicode NFA for the Pike VM and the backtracker, 1 forwards byte NFA and 1 backwards byte NFA for the lazy DFA). Fourthly, it is more aggressive about epsilon removal and does away with the binary "alt" instructions in favor of flattening the byte code, which should decrease overhead substantially when matching. Finally, it use's Daciuk's algorithm (the same one used in the fst crate) to compiler nearly minimal UTF-8 automata in linear time, which substantially reduces the size of the corresponding NFAs and DFAs, while also making them faster to execute by virtue of having fewer epsilon transitions.

regex-automata also contains new test infrastructure, and most of the tests in the regex crate have already been ported to it. This test infrastructure should make it much easier to add new tests and also much faster to run while testing more engines. Today's infrastructure is a terrible macro soup, and just building it takes forever because every test is compiled for every regex engine that we test. The new test infrastructure instead loads tests from TOML files and runs them using its own harness. The downside is that this pushes more things outside of Rust's standard unit test harness, but it's well worth it.

As of a couple months ago, I've begun preparing regex-automata to absorb the matching engines from regex proper. This will include the lazy DFA, the Pike VM and the backtracker. regex-automata also provides its own ahead-of-time-compiled DFAs, which I hope to use as yet another engine for cases where the regex is small and full DFA compilation is cheap. This organization should drastically decrease the complexity of adding new regex engines. I hope this will allow me to finally bring in things like one-pass DFA (see #467), as well as some other ideas I've had for years, such as a bit-parallel NFA and even a JIT powered by Cranelift, basically with the goal of fixing one of the weakest parts of the regex crate right now: the performance of matching with capturing groups. (N.B. If a JIT were added, it would be an optional opt-in feature.)

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository and start integrating it into regex proper. At that point, the main responsibility of regex will be to glue everything together, plan the appropriate optimizations, including literal oriented optimizations and provide the higher level API that it does today. At this point in time, I do not plan on making any breaking changes to regex during all of this.

As a stretch goal, one of the things I'd like to do is to bring the cheap deserialization of regex-automata's DFAs to the regex crate proper. This would permit users to generate pre-compiled Regex objects that can be embedded into any program. This would allow one to completely avoid paying for compilation time. This does require a lot of work though, so I don't know if this will happen initially. (Doing this basically implies that every single representation of a finite state machine, including all of the literal prefilters, need to be able to work directly from a more primitive representation such as a &[u8].)

Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

@ethanpailes
Copy link
Contributor

You probably already plan to do something like this, but just in case you have not already planned on it, I think it would be useful for the regex-automata crate to have a set of features that people can use to pull in just a single engine. I'm thinking in particular about situations where people might care more about code size than performance and are happy with just pulling in the PikeVM. It might even be useful to have a separate regex-engine-lite crate that just has the PikeVM so that people who don't know the difference between the different matchers can pull it in.

@BurntSushi
Copy link
Member Author

@ethanpailes Yeah I may consider that. Each engine on its own doesn't usually take up too much code and usually doesn't require any extra dependencies. But sure, chopping it up with features like I did for regex is definitely on the table. I'll probably avoid adding new *-lite crates though.

@Voultapher
Copy link

Still amazed to witness someone as clever and dedicated as you. The programming community would be a sadder place without you. Thank you for you wonderful hard work.

@giovanniberti
Copy link

TL;DR - Push all regex engines down into the regex-automata crate, and bring that in as an internal-but-published dependency of regex as an official part of this repository. It will become a second required dependency (with the first being regex-syntax).
[...]
Once this Great Regex Engine Refactor is complete, the result will have been a complete rewrite of the regex crate.

Given that you are pushing regex engines out of the regex crate, shouldn't it be called the Great Regex Engine Pushout Refactor, or GREP-R for short? 😉

Btw, love your work and the dedication you put into maintaining one of the most important crates in the rust ecosystem! 💪

@chrisduerr
Copy link

Once regex-automata has absorbed the regex engines inside of regex, my plan at that point is to move regex-automata into this repository

At this point in time, I do not plan on making any breaking changes to regex during all of this.

While this sounds like a great effort for regex and the unification of the two different crates, I'd be curious what this means for users of regex-automata. I'm not particularly concerned about breaking changes itself, but is there any plan to remove functionality in favor of unification?

It seems like the basics of regex-automata are fundamentally required for both regex and regex-automata, with a DFA and transition between states, but I'd like to make sure that usecases like streaming regex parsers are still captured in that new ecosystem. Especially since even regex-automata has that as a secondary usecase since it's rarely desired of course.

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

@BurntSushi
Copy link
Member Author

Fundamentally it sounds to me like regex-automata users will just benefit from more regex engine choices without any drawbacks, which would obviously be great.

That's the intent yes. I have no plans to remove functionality from regex-automata. :-)

@BurntSushi BurntSushi added the plan label May 9, 2020
BurntSushi added a commit that referenced this issue Mar 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 20, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Mar 21, 2023
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
BurntSushi added a commit that referenced this issue Mar 21, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 15, 2023
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 15, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
@BurntSushi
Copy link
Member Author

BurntSushi commented Apr 17, 2023

I have a PR up for phase 1 of 2 of this transition: #977

Basically, this PR is preparing for the regex crate to cut over to regex-automata for its internals. That will include bringing regex-automata into this repository. I don't have a specific timeline for when that will happen, but I do want to let this first phase bake for a little bit. Hopefully within a month. regex-automata itself is ready.

BurntSushi added a commit that referenced this issue Apr 17, 2023
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 17, 2023
An empty character class is effectively a way to write something that
can never match anything. The regex crate has pretty much always
returned an error for such things because it was never taught how to
handle "always fail" states. Partly because I just didn't think about it
when initially writing the regex engines and partly because it isn't
often useful.

With that said, it should be supported for completeness and because
there is no real reason to not support it. Moreover, it can be useful in
certain contexts where regexes are generated and you want to insert an
expression that can never match. It's somewhat contrived, but it
happens when the interface is a regex pattern.

Previously, the ban on empty character classes was implemented in the
regex-syntax crate. But with the rewrite in #656 getting closer and
closer to landing, it's now time to relax this restriction. However, we
do keep the overall restriction in the 'regex' API by returning an error
in the NFA compiler. Once #656 is done, the new regex engines will
permit this case.
BurntSushi added a commit that referenced this issue Apr 17, 2023
This adds Look::StartCRLF and Look::EndCRLF. And also adds a new flag,
'R', for making ^/$ be CRLF aware in multi-line mode. The 'R' flag also
causes '.' to *not* match \r in addition to \n (unless the 's' flag is
enabled of course).

The intended semantics are that CRLF mode makes \r\n, \r and \n line
terminators but with one key property: \r\n is treated as a single line
terminator. That is, ^/$ do not match between \r and \n.

This partially addresses #244 by adding syntax support. Currently, if
you try to use this new flag, the regex compiler will report an error.
We intend to finish support for this once #656 is complete. (Indeed, at
time of writing, CRLF matching works in regex-automata.)
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
@BurntSushi
Copy link
Member Author

For anyone following this thread, regex 1.9 (the next release) is going to close this issue. Finally. I plan to release it after I've finished documentation and polishing work. (There is a fair bit to do here.)

I am also currently planning to release a new crate, regex-lite, simultaneously with regex 1.9. You can see more about that and why specifically I chose to do it with regex 1.9 here: #961 (comment)

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue May 2, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;181-2023-04-21)

\==================
This is a patch release that fixes a bug where a regex match could be reported
where none was found. Specifically, the bug occurs when a pattern contains some
literal prefixes that could be extracted *and* an optional word boundary in the
prefix.

Bug fixes:

-   [BUG #&#8203;981](rust-lang/regex#981):
    Fix a bug where a word boundary could interact with prefix literal
    optimizations and lead to a false positive match.

### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;180-2023-04-20)

\==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represents preparatory work
for the second release, which will be even bigger. Namely, this release:

-   Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
-   Upgrades its dependency on `aho-corasick` to the recently released 1.0
    version.
-   Upgrades its dependency on `regex-syntax` to the simultaneously released
    `0.7` version. The changes to `regex-syntax` principally revolve around a
    rewrite of its literal extraction code and a number of simplifications and
    optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](rust-lang/regex#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

-   [FEATURE #&#8203;501](rust-lang/regex#501):
    Permit many more characters to be escaped, even if they have no significance.
    More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
    escaped. Also, a new routine, `is_escapeable_character`, has been added to
    `regex-syntax` to query whether a character is escapeable or not.
-   [FEATURE #&#8203;547](rust-lang/regex#547):
    Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
    introduce any new expressive power.
-   [FEATURE #&#8203;595](rust-lang/regex#595):
    Capture group names are now Unicode-aware. They can now begin with either a `_`
    or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
    can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
    `]`. Note that replacement syntax has not changed.
-   [FEATURE #&#8203;810](rust-lang/regex#810):
    Add `Match::is_empty` and `Match::len` APIs.
-   [FEATURE #&#8203;905](rust-lang/regex#905):
    Add an `impl Default for RegexSet`, with the default being the empty set.
-   [FEATURE #&#8203;908](rust-lang/regex#908):
    A new method, `Regex::static_captures_len`, has been added which returns the
    number of capture groups in the pattern if and only if every possible match
    always contains the same number of matching groups.
-   [FEATURE #&#8203;955](rust-lang/regex#955):
    Named captures can now be written as `(?<name>re)` in addition to
    `(?P<name>re)`.
-   FEATURE: `regex-syntax` now supports empty character classes.
-   FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
    to `regex` in the second release.)
-   FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
    made to it.
-   FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
    mode. This will be supported in `regex` proper in the second release.
-   FEATURE: `regex-syntax` now has proper support for "regex that never
    matches" via `Hir::fail()`.
-   FEATURE: The `hir::literal` module of `regex-syntax` has been completely
    re-worked. It now has more documentation, examples and advice.
-   FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
    to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

-   PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
    cases. It's difficult to characterize exactly which patterns this might impact,
    but if there are a small number of longish (>= 4 bytes) prefix literals, then
    it might be faster than before.

Bug fixes:

-   [BUG #&#8203;514](rust-lang/regex#514):
    Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
-   BUGS [#&#8203;516](rust-lang/regex#516),
    [#&#8203;731](rust-lang/regex#731):
    Fix a number of issues with printing `Hir` values as regex patterns.
-   [BUG #&#8203;610](rust-lang/regex#610):
    Add explicit example of `foo|bar` in the regex syntax docs.
-   [BUG #&#8203;625](rust-lang/regex#625):
    Clarify that `SetMatches::len` does not (regretably) refer to the number of
    matches in the set.
-   [BUG #&#8203;660](rust-lang/regex#660):
    Clarify "verbose mode" in regex syntax documentation.
-   BUG [#&#8203;738](rust-lang/regex#738),
    [#&#8203;950](rust-lang/regex#950):
    Fix `CaptureLocations::get` so that it never panics.
-   [BUG #&#8203;747](rust-lang/regex#747):
    Clarify documentation for `Regex::shortest_match`.
-   [BUG #&#8203;835](rust-lang/regex#835):
    Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
-   [BUG #&#8203;846](rust-lang/regex#846):
    Add more clarifying documentation to the `CompiledTooBig` error variant.
-   [BUG #&#8203;854](rust-lang/regex#854):
    Clarify that `regex::Regex` searches as if the haystack is a sequence of
    Unicode scalar values.
-   [BUG #&#8203;884](rust-lang/regex#884):
    Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
-   [BUG #&#8203;893](rust-lang/regex#893):
    Optimize case folding since it can get quite slow in some pathological cases.
-   [BUG #&#8203;895](rust-lang/regex#895):
    Reject `(?-u:\W)` in `regex::Regex` APIs.
-   [BUG #&#8203;942](rust-lang/regex#942):
    Add a missing `void` keyword to indicate "no parameters" in C API.
-   [BUG #&#8203;965](rust-lang/regex#965):
    Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
-   [BUG #&#8203;975](rust-lang/regex#975):
    Clarify documentation for `\pX` syntax.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42MS4wIiwidXBkYXRlZEluVmVyIjoiMzUuNjYuMyIsInRhcmdldEJyYW5jaCI6ImRldmVsb3AifQ==-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
BurntSushi added a commit that referenced this issue Jul 5, 2023
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants