Feature request: unbuffered text matching

(Probably blocked on #656 restructuring, and potentially better exposed as part of regex-automata.)

TL;DR:

In the advanced/"lower level" impl section, or perhaps on applicable `Regex` types provided by regex-automata:

```rust
fn Regex::is_match_streaming<O: Display>(&self, text: O) -> bool;
```

> `pub fn is_match_streaming<O: Display>(&self, text: O) -> bool`
> 
> Returns true if and only if there is a match for the regex in the display representation of the object given.
> 
> If you have or can get `&str` for the text to match, you should use [`is_match`](https://docs.rs/regex/latest/regex/struct.Regex.html#method.is_match) instead. This method is potentially useful to reduce allocations, as you don't have to first write the object's display representation into a buffer to match it.
>
> However, this can lead to less performant regex execution. Normal regex execution does multiple optimizations that rely on having the full text available to scan, which can't be done in a streaming fashion. Instead, this method has to always run the most general regex VM byte-by-byte over the input.
>
> To test against streaming text which is not solely a single object's `Display` impl, use `format_args!` to manufacture a `Display` impl with the text you want to match.
>
> ### Examples
>
> ```rust
> // Fictional people from a name generator.
> // Any similarity to actual persons, living or dead, is purely coincidental.
> let tiffany = "Ms Tiffany Northman";
> let raymond = "Dr Raymond Hyde";
> let regex = Regex::new("(?i)tiffany .* loves .* raymond").unwrap();
> assert!(regex.is_match_streaming(format_args!("{tiffany} loves {raymond}")));
> ```

## Motivation

TL;DR: the [matchers](https://lib.rs/crates/matchers) crate, as used in [tracing-subscriber](https://lib.rs/crates/tracing-subscriber).

> The [`regex`] crate implements regular expression matching on strings and byte arrays. However, in order to match the output of implementations of `fmt::Debug` and `fmt::Display`, or by any code which writes to an instance of `fmt::Write` or `io::Write`, it is necessary to first allocate a buffer, write to that buffer, and then match the buffer against a regex.
> 
> In cases where it is not necessary to extract substrings, but only to test whether or not output matches a regex, it is not strictly necessary to allocate and write this output to a buffer. This crate provides a simple interface on top of the lower-level [`regex-automata`] library that implements `fmt::Write` and `io::Write` for regex patterns. This may be used to test whether streaming output matches a pattern without buffering that output.
> 
> Users who need to extract substrings based on a pattern or who already have buffered data should probably use the [`regex`] crate instead.
> 
> [`regex`]: https://crates.io/crates/regex
> [`regex-automata`]: https://crates.io/crates/regex-automata

`tracing` spans/events structured logging facilities capture non-primitive types [by their `Debug` formatting](https://docs.rs/tracing/latest/tracing/field/trait.Visit.html#tymethod.record_debug). It's expected for such formatting to be relatively cheap — cheaper to do multiple times than to allocate a buffer and do so a singular time, anyway. In most cases, the captured values will only be formatted once, into a large buffer used for recording the event, with all recorded fields printing into the same buffer.

Where the regex comes in is filtering. Filters can observe the recorded field, and filter spans/events in/out depending on the value of the field, and this is primarily done via regex matching. When doing this filtering, tracing-subscriber wants to avoid allocating a buffer for the fields to test, just to throw it away afterwards. This is done by writing the formatting machinery directly into driving a regex-automata DFA byte-by-byte.

tracing-subscriber filters are not supposed to be resilient to untrusted construction, but using the regex crate instead of regex-automata would reduce the failure case from regex compilation taking near-unbounded CPU/RAM to filter application taking O(n) time on the length of the filter string.

## Implementation sketch

For simplicity's sake, I'm providing an implementation sketch using [`regex_automata@0.2.0::hybrid::dfa::DFA`](https://docs.rs/regex-automata/0.2.0/regex_automata/hybrid/dfa/struct.DFA.html). This implementation is derived from `matchers`', which is MIT licensed[^1].

[^1]: I believe the code as presented here is simple/axiomatic enough that it is uncopyrightable. As much as is permissible by law, I release copyright on my modifications to the code, to the detriment of my heirs. (My modifications can be considered licensed under [the Unlicense](https://unlicense.org/).)

<details><summary>Implementation sketch</summary>

```rust
impl DFA {
    fn is_match_streaming(
        &self,
        cache: &mut Cache,
        bytes: impl Display,
    ) -> Result<bool, CacheError> {
        let mut matcher = StreamingMatcher {
            dfa: self,
            // NB: we need a start_state method that doesn't take the
            //     text as input, like the one in regex-automata@0.1
            state: self.start_state_anchored(cache, None)?,
            cache: cache,
        };
        use io::Write;
        write!(&mut matcher, "{}", bytes).expect("infallible Write impl");
        matcher.finish()
    }
}

struct StreamingMatcher<'a> {
    dfa: &'a DFA,
    state: Result<LazyStateID, CacheError>,
    cache: &'a mut Cache,
}

impl<'a> io::Write for StreamingMatcher<'a> {
    fn write(&mut self, bytes: &[u8]) -> io::Result<usize> {
        let mut i = 0;
        for &byte in bytes {
            self.advance(byte);
            i += 1;
            if self.is_dead() {
                break;
            }
        }
        Ok(i)
    }
}

impl StreamingMatcher<'_> {
    fn advance(&mut self, input: u8) {
        let Ok(state) = self.state else { return; }
        // TODO: use next_state_untagged/next_state_untagged_unchecked
        // TODO: replace CacheError with MatchError::GaveUp
        self.state = self.automaton.next_state(self.cache, state, input);
    }

    fn is_dead(&self) -> bool {
        // TODO: handle quit state
        match self.state {
            Ok(state) => state.is_dead(),
            Err(_) => true,
        }
    }

    fn finish(self) -> Result<bool, MatchError> {
        let state = self.state?;
        let state = self.dfa.next_eoi_state(self.cache, state)?;
        Ok(state.is_match())
    }
}
```
</details>

(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)

By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: unbuffered text matching #852

Examples

Motivation

Implementation sketch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: unbuffered text matching #852

Description

Examples

Motivation

Implementation sketch

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions