Description
(Probably blocked on #656 restructuring, and potentially better exposed as part of regex-automata.)
TL;DR:
In the advanced/"lower level" impl section, or perhaps on applicable Regex
types provided by regex-automata:
fn Regex::is_match_streaming<O: Display>(&self, text: O) -> bool;
pub fn is_match_streaming<O: Display>(&self, text: O) -> bool
Returns true if and only if there is a match for the regex in the display representation of the object given.
If you have or can get
&str
for the text to match, you should useis_match
instead. This method is potentially useful to reduce allocations, as you don't have to first write the object's display representation into a buffer to match it.However, this can lead to less performant regex execution. Normal regex execution does multiple optimizations that rely on having the full text available to scan, which can't be done in a streaming fashion. Instead, this method has to always run the most general regex VM byte-by-byte over the input.
To test against streaming text which is not solely a single object's
Display
impl, useformat_args!
to manufacture aDisplay
impl with the text you want to match.Examples
// Fictional people from a name generator. // Any similarity to actual persons, living or dead, is purely coincidental. let tiffany = "Ms Tiffany Northman"; let raymond = "Dr Raymond Hyde"; let regex = Regex::new("(?i)tiffany .* loves .* raymond").unwrap(); assert!(regex.is_match_streaming(format_args!("{tiffany} loves {raymond}")));
Motivation
TL;DR: the matchers crate, as used in tracing-subscriber.
The
regex
crate implements regular expression matching on strings and byte arrays. However, in order to match the output of implementations offmt::Debug
andfmt::Display
, or by any code which writes to an instance offmt::Write
orio::Write
, it is necessary to first allocate a buffer, write to that buffer, and then match the buffer against a regex.In cases where it is not necessary to extract substrings, but only to test whether or not output matches a regex, it is not strictly necessary to allocate and write this output to a buffer. This crate provides a simple interface on top of the lower-level
regex-automata
library that implementsfmt::Write
andio::Write
for regex patterns. This may be used to test whether streaming output matches a pattern without buffering that output.Users who need to extract substrings based on a pattern or who already have buffered data should probably use the
regex
crate instead.
tracing
spans/events structured logging facilities capture non-primitive types by their Debug
formatting. It's expected for such formatting to be relatively cheap — cheaper to do multiple times than to allocate a buffer and do so a singular time, anyway. In most cases, the captured values will only be formatted once, into a large buffer used for recording the event, with all recorded fields printing into the same buffer.
Where the regex comes in is filtering. Filters can observe the recorded field, and filter spans/events in/out depending on the value of the field, and this is primarily done via regex matching. When doing this filtering, tracing-subscriber wants to avoid allocating a buffer for the fields to test, just to throw it away afterwards. This is done by writing the formatting machinery directly into driving a regex-automata DFA byte-by-byte.
tracing-subscriber filters are not supposed to be resilient to untrusted construction, but using the regex crate instead of regex-automata would reduce the failure case from regex compilation taking near-unbounded CPU/RAM to filter application taking O(n) time on the length of the filter string.
Implementation sketch
For simplicity's sake, I'm providing an implementation sketch using regex_automata@0.2.0::hybrid::dfa::DFA
. This implementation is derived from matchers
', which is MIT licensed1.
Implementation sketch
impl DFA {
fn is_match_streaming(
&self,
cache: &mut Cache,
bytes: impl Display,
) -> Result<bool, CacheError> {
let mut matcher = StreamingMatcher {
dfa: self,
// NB: we need a start_state method that doesn't take the
// text as input, like the one in regex-automata@0.1
state: self.start_state_anchored(cache, None)?,
cache: cache,
};
use io::Write;
write!(&mut matcher, "{}", bytes).expect("infallible Write impl");
matcher.finish()
}
}
struct StreamingMatcher<'a> {
dfa: &'a DFA,
state: Result<LazyStateID, CacheError>,
cache: &'a mut Cache,
}
impl<'a> io::Write for StreamingMatcher<'a> {
fn write(&mut self, bytes: &[u8]) -> io::Result<usize> {
let mut i = 0;
for &byte in bytes {
self.advance(byte);
i += 1;
if self.is_dead() {
break;
}
}
Ok(i)
}
}
impl StreamingMatcher<'_> {
fn advance(&mut self, input: u8) {
let Ok(state) = self.state else { return; }
// TODO: use next_state_untagged/next_state_untagged_unchecked
// TODO: replace CacheError with MatchError::GaveUp
self.state = self.automaton.next_state(self.cache, state, input);
}
fn is_dead(&self) -> bool {
// TODO: handle quit state
match self.state {
Ok(state) => state.is_dead(),
Err(_) => true,
}
}
fn finish(self) -> Result<bool, MatchError> {
let state = self.state?;
let state = self.dfa.next_eoi_state(self.cache, state)?;
Ok(state.is_match())
}
}
(I do not know how such an implementation would handle unicode word boundaries, as AIUI those are handled by annoying special cases and not the bytewise (hybrid) DFA.)
By my quick scanning through the regex implementation, the pikeVM is always driven byte-by-byte, and could be driven in a streaming fashion this same way.
Footnotes
-
I believe the code as presented here is simple/axiomatic enough that it is uncopyrightable. As much as is permissible by law, I release copyright on my modifications to the code, to the detriment of my heirs. (My modifications can be considered licensed under the Unlicense.) ↩