Skip to content

Feature request: unbuffered text matching #852

Closed as not planned
Closed as not planned
@CAD97

Description

@CAD97

(Probably blocked on #656 restructuring, and potentially better exposed as part of regex-automata.)

TL;DR:

In the advanced/"lower level" impl section, or perhaps on applicable Regex types provided by regex-automata:

fn Regex::is_match_streaming<O: Display>(&self, text: O) -> bool;

pub fn is_match_streaming<O: Display>(&self, text: O) -> bool

Returns true if and only if there is a match for the regex in the display representation of the object given.

If you have or can get &str for the text to match, you should use is_match instead. This method is potentially useful to reduce allocations, as you don't have to first write the object's display representation into a buffer to match it.

However, this can lead to less performant regex execution. Normal regex execution does multiple optimizations that rely on having the full text available to scan, which can't be done in a streaming fashion. Instead, this method has to always run the most general regex VM byte-by-byte over the input.

To test against streaming text which is not solely a single object's Display impl, use format_args! to manufacture a Display impl with the text you want to match.

Examples

// Fictional people from a name generator.
// Any similarity to actual persons, living or dead, is purely coincidental.
let tiffany = "Ms Tiffany Northman";
let raymond = "Dr Raymond Hyde";
let regex = Regex::new("(?i)tiffany .* loves .* raymond").unwrap();
assert!(regex.is_match_streaming(format_args!("{tiffany} loves {raymond}")));

Motivation

TL;DR: the matchers crate, as used in tracing-subscriber.

The regex crate implements regular expression matching on strings and byte arrays. However, in order to match the output of implementations of fmt::Debug and fmt::Display, or by any code which writes to an instance of fmt::Write or io::Write, it is necessary to first allocate a buffer, write to that buffer, and then match the buffer against a regex.

In cases where it is not necessary to extract substrings, but only to test whether or not output matches a regex, it is not strictly necessary to allocate and write this output to a buffer. This crate provides a simple interface on top of the lower-level regex-automata library that implements fmt::Write and io::Write for regex patterns. This may be used to test whether streaming output matches a pattern without buffering that output.

Users who need to extract substrings based on a pattern or who already have buffered data should probably use the regex crate instead.

tracing spans/events structured logging facilities capture non-primitive types by their Debug formatting. It's expected for such formatting to be relatively cheap — cheaper to do multiple times than to allocate a buffer and do so a singular time, anyway. In most cases, the captured values will only be formatted once, into a large buffer used for recording the event, with all recorded fields printing into the same buffer.

Where the regex comes in is filtering. Filters can observe the recorded field, and filter spans/events in/out depending on the value of the field, and this is primarily done via regex matching. When doing this filtering, tracing-subscriber wants to avoid allocating a buffer for the fields to test, just to throw it away afterwards. This is done by writing the formatting machinery directly into driving a regex-automata DFA byte-by-byte.

tracing-subscriber filters are not supposed to be resilient to untrusted construction, but using the regex crate instead of regex-automata would reduce the failure case from regex compilation taking near-unbounded CPU/RAM to filter application taking O(n) time on the length of the filter string.

Implementation sketch

For simplicity's sake, I'm providing an implementation sketch using regex_automata@0.2.0::hybrid::dfa::DFA. This implementation is derived from matchers', which is MIT licensed1.

Implementation sketch
impl DFA {
    fn is_match_streaming(
        &self,
        cache: &mut Cache,
        bytes: impl Display,
    ) -> Result<bool, CacheError> {
        let mut matcher = StreamingMatcher {
            dfa: self,
            // NB: we need a start_state method that doesn't take the
            //     text as input, like the one in regex-automata@0.1
            state: self.start_state_anchored(cache, None)?,
            cache: cache,
        };
        use io::Write;
        write!(&mut matcher, "{}", bytes).expect("infallible Write impl");
        matcher.finish()
    }
}

struct StreamingMatcher<'a> {
    dfa: &'a DFA,
    state: Result<LazyStateID, CacheError>,
    cache: &'a mut Cache,
}

impl<'a> io::Write for StreamingMatcher<'a> {
    fn write(&mut self, bytes: &[u8]) -> io::Result<usize> {
        let mut i = 0;
        for &byte in bytes {
            self.advance(byte);
            i += 1;
            if self.is_dead() {
                break;
            }
        }
        Ok(i)
    }
}

impl StreamingMatcher<'_> {
    fn advance(&mut self, input: u8) {
        let Ok(state) = self.state else { return; }
        // TODO: use next_state_untagged/next_state_untagged_unchecked
        // TODO: replace CacheError with MatchError::GaveUp
        self.state = self.automaton.next_state(self.cache, state, input);
    }

    fn is_dead(&self) -> bool {
        // TODO: handle quit state
        match self.state {
            Ok(state) => state.is_dead(),
            Err(_) => true,
        }
    }

    fn finish(self) -> Result<bool, MatchError> {
        let state = self.state?;
        let state = self.dfa.next_eoi_state(self.cache, state)?;
        Ok(state.is_match())
    }
}

(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)

By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.

Footnotes

  1. I believe the code as presented here is simple/axiomatic enough that it is uncopyrightable. As much as is permissible by law, I release copyright on my modifications to the code, to the detriment of my heirs. (My modifications can be considered licensed under the Unlicense.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions