rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

BurntSushi · 2023-10-06T15:45:29Z

No description provided.

The name was quite vague, so add a little specificity.

Breaking lines in the middle of backticks appears to be bad juju for some Markdown renderers.

This fixes a bug that can occur when: 1. The regex has a Unicode word boundary. 2. The haystack contains some non-ASCII Unicode scalar value. 3. An inner or suffix literal optimization is in play. Specifically, this provokes a case where a match is detected in one of the meta engine's ad hoc DFA search routines, but before the match reaches its correct endpoint, a quit state is entered. (Because DFAs can't deal with Unicode word boundaries on non-ASCII haystacks.) The correct thing to do is to return a quit error and let the higher level logic divert to a different engine, but it was returning the match that it had found up until that point instead. The match returned is not technically incorrect in the sense that a match does indeed exist, but the offsets it reports may be shorter than what the true match actually is. So... if a quit state is entered, return an error regardless of whether a match has been found. Fixes #1046

This was previously using the raw representation of a `LookSet`, which is fine, but would have errantly overwritten bits unrelated to look-around assertions if they were set in a `LookSet`. This can't happen today because we don't have more than 10 assertions. And the one-pass DFA constructor specifically errors if more assertions exist and are in the pattern. But still, it seems like good form to mask out only the bits we care about.

This puts every Ast value behind a box to conserve space. It makes things like Vec<Ast> quite a bit smaller than what they would be otherwise, which is especially beneficial for the representation of concatenations and alternations. This doesn't quite solve the memory usage problems though, since an AstKind is still quite big (over 200 bytes). The next step will be boxing each of the variants of an AstKind which should hopefully resolve the issue. Ref #1090

This does reduce memory, but not as much as it is reduced if we don't box the Ast.

The AstKind experiment proved unfruitful. I think the issue here is that the savings on Vec<Ast> didn't prove to be enough to offset the extra heap allocation that resulted from the indirection. This seems to be a sweet spot. It would be nice to get Ast down below 16 bytes, but it's not clear how to do that (without much larger changes that I don't feel inclined to pursue). Fixes #1090

Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088

It turns out that requiring callers to provide an `Input` (and thus a `&[u8]` haystack) is a bit onerous for all cases. Namely, part of the point of `regex-automata` was to expose enough guts to make it tractable to write a streaming regex engine. A streaming regex engine, especially one that does a byte-at-a-time loop, is somewhat antithetical to having a haystack in a single `&[u8]` slice. This made computing start states possible but very awkward and quite unclear in terms of what the implementation would actually do with the haystack. This commit fixes that by exposing a lower level `start_state` method on both of the DFAs that can be called without materializing an `Input`. Instead, callers must create a new `start::Config` value which provides all of the information necessary for the DFA to compute the correct start state. This in turn also exposes the `crate::util::start` module. This is ultimately a breaking change because it adds a new required method to the `Automaton` trait. It also makes `start_state_forward` and `start_state_reverse` optional. It isn't really expected for callers to implement the `Automaton` trait themselves (and perhaps I will seal it so we can do such changes in the future without it being breaking), but still, this is technically breaking. Callers using `start_state_forward` or `start_state_reverse` with either DFA remain unchanged and unaffected. Closes #1031

Ref #1053

That should cover all of them. Closes #1053

This reduces or eliminates allocation when combining Unicode classes and should make some things faster. It's unlikely for these optimizations to matter much in practice, but they are likely to help in niche or pathological cases where there are a lot of ops in a class. Closes #1051

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

In this commit, all of the regex engines now support the new special word boundary assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}. Of course, when they are Unicode-aware, the DFAs will quit upon seeing a non-ASCII character, just like for the \b and \B assertions. For now, we don't add support to the one-pass DFA, since it would either make it use more memory or reduce the number of capture groups it supports. I think these assertions will be rare enough that it isn't worth adding support yet. This is a breaking change because it adds new variants to the `Look` enum.

Closes #469

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

It is almost completely wrong now. Instead of rewriting it---which would be a huge endeavor---we just point folks toward my blog on regex internals. Closes #1058

Ref #1090

Some doc tests make 64-bit assumptions and fail on 32-bit. I'd be open to perhaps refactoring the tests somehow to make them work on both, but I literally have no easy way to run doc tests in a 32-bit environment. Without being able to actually run them myself, I don't feel comfortable doing anything other than squashing the tests in that case. Closes #1041

These panics I do not believe can occur from an actual pattern, since the parser will either never produce such things or will return an error. But still, the Ast->Hir translator shouldn't panic in such cases. Actually, the non-sensical Ast values are actually somewhat sensible, and they don't map to invalid regexes. These panics were likely the result of the regex crate not supporting empty patterns or "fail" patterns particularly well in the fast. But now that we do, we can just let the Asts through and generate the Hir you'd expect. Fixes #1047

This commit fixes a subtle *performance* bug in the start state computation. The issue here is rather tricky, but it boils down to the fact that the way the look-behind assertions are computed in the start state is not quite precisely equivalent to how they're computed during normal state generation. Namely, in normal state generation, we only compute look-behind assertions if the NFA actually has one (or one similar to it) in its graph somewhere. If it doesn't, then there's no point in saving whether the assertion is satisfied or not. Logically speaking, this doesn't matter too much, because if the look-around assertions don't match up with how they're computed in the start state, a new state will simply be created. Not a huge deal, but wasteful. The real problem is that the new state will no longer be considered a start state. It will just be like any other normal state. We rely on being able to detect start states at search time to know when to trigger the prefilter. So if we re-generate start states as non-start states, then we may end up not triggering the prefilter. That's bad. rebar actually caught this bug via the `imported/sherlock/line-boundary-sherlock-holmes` benchmark, which recorded a 20x slowdown due to the prefilter not running. Owch! This specifically was caused by the start states unconditionally attaching half-starting word boundary assertions whenever they were satisfied, where as normal state generation only does this when there is actually a half-starting word boundary assertion in the NFA. So this led to re-generating start states needlessly. Interestingly, the start state computation was unconditionally attaching all different types of look-behind assertions, and thus in theory, this problem already existed under different circumstances. My hypothesis is that it wasn't "as bad" because it was mostly limited to line terminators. But the half-starting word boundary assertion is much more broadly applicable. We remedy this not only for the half-starting word boundary assertion, but for all others as well. I also did manual mutation testing in this start state computation and found a few branches not covered by tests. We add those tests here. Thanks rebar!

This MSRV bump is mostly motivated by "good sense," and in particular, Rust 1.65 means we can use 'let ... else'. We don't actually start peppering the code with 'let ... else' just yet, but we fix a few outstanding small issues and update our Rust version everywhere. Also, Rust 1.65 is about a year old at time of writing. Let's keep the trains moving.

Closes #1043 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61570 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62436

It's not feasible for us to check such things when deserializing a DFA, so we just have no real choice but to remove the assert and let the search proceed with incorrect results. I had previously wrote these as real asserts and then swapped them to debug_asserts, but of course, the fuzzer still trips over them. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60652

It's possible for DFA deserialization to result in an otherwise valid DFA, but one that records accelerated DFA states without any actual accelerator. We remedy that by checking for it at deserialization time. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60739 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61255 fixup

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60779 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61434

Otherwise we risk timeouts. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61484

This rejiggers some code so that we can more reliably check whether start state IDs are valid or not. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62726

I couldn't get this to reproduce. Maybe some of my recent changes to regex-syntax fixed this? Not sure. I'm not a huge fan of this fuzzer in general because it isn't really testing a rock solid guarantee that we provide. And the positions are tough to deal with. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62382

BurntSushi added 6 commits October 2, 2023 20:24

automata: clean up regression test

e711f8e

The name was quite vague, so add a little specificity.

automata: fix line wrapping

c12a7df

Breaking lines in the middle of backticks appears to be bad juju for some Markdown renderers.

syntax: box each AstKind variant

8b0b0b0

This does reduce memory, but not as much as it is reduced if we don't box the Ast.

BurntSushi force-pushed the ag/work branch 4 times, most recently from 87aa51c to eab70f5 Compare October 8, 2023 14:22

BurntSushi mentioned this pull request Oct 8, 2023

consider providing \< and \> #469

Closed

BurntSushi force-pushed the ag/work branch 2 times, most recently from 296b1b6 to 52140fa Compare October 8, 2023 19:26

BurntSushi changed the title ~~rollup: bug fixes, memory usage reductions, API improvements, more word boundary assertions and more~~ rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more Oct 8, 2023

BurntSushi and others added 14 commits October 9, 2023 15:10

automata: fix doc links

ac51c5c

automata: fix one outdated regex-cli test command

2e67b6f

Ref #1053

automata: fix more out-dated regex-cli commands

5e9204f

That should cover all of them. Closes #1053

syntax and automata: bump LookSet representation from u16 to u32

a5aa233

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

syntax/hir: add new special word boundaries to HIR

37faa6e

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

doc: explain the new word boundary assertions

915a154

Closes #469

lite: add special word boundaries to regex-lite

bd36c6f

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

doc: remove HACKING document

048b6f8

It is almost completely wrong now. Instead of rewriting it---which would be a huge endeavor---we just point folks toward my blog on regex internals. Closes #1058

BurntSushi force-pushed the ag/work branch from daec17a to b736504 Compare October 9, 2023 19:13

changelog: add note about decreasing memory usage

2d7b355

Ref #1090

BurntSushi and others added 12 commits October 9, 2023 15:48

changelog: start filling out the 1.10 release

1de1a37

fuzz: institute sane limits for arbitrary-based fuzzers

9e503cd

Closes #1043 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61570 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62436

lite: reduce size limit to avoid timeouts

e378b4d

Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=60779 Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61434

regex: reject large patterns when fuzzing

c8e4c2e

Otherwise we risk timeouts. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61484

automata: improve sparse DFA validation

24d08d5

This rejiggers some code so that we can more reliably check whether start state IDs are valid or not. Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=62726

BurntSushi force-pushed the ag/work branch from b736504 to 109c8c4 Compare October 9, 2023 19:48

BurntSushi merged commit 2c44e2a into master Oct 9, 2023

BurntSushi deleted the ag/work branch October 9, 2023 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

Uh oh!

BurntSushi commented Oct 6, 2023

Uh oh!

Uh oh!

rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

rollup: bump MSRV to Rust 1.65, bug fixes, memory usage reductions, API improvements, more word boundary assertions and more #1098

Uh oh!

Conversation

BurntSushi commented Oct 6, 2023

Uh oh!

Uh oh!