-
Notifications
You must be signed in to change notification settings - Fork 471
Description
Consider this program:
extern crate regex;
use regex::bytes::Regex;
fn main() {
let re = Regex::new("").unwrap();
for m in re.find_iter("☃".as_bytes()) {
println!("{:?}", (m.start(), m.end()));
}
}
its output is
(0, 0)
(1, 1)
(2, 2)
(3, 3)
Also, consider this program, which is a different manifestation of the same underlying bug:
extern crate regex;
use regex::bytes::Regex;
fn main() {
let re = Regex::new("").unwrap();
for m in re.find_iter(b"b\xFFr") {
println!("{:?}", (m.start(), m.end()));
}
}
its output is:
(0, 0)
(1, 1)
(2, 2)
(3, 3)
In particular, the empty pattern matches everything, including the locations between UTF-8 code units and otherwise invalid UTF-8.
A related note here is that find_iter
is implemented slightly differently in bytes::Regex
when compared with Regex
. Namely, upon observing an empty match, the iterator forcefully advances its current position by a single character. For Unicode regexes, a character is a Unicode codepoint. For byte oriented regexes, a character is any single byte. The problem here is that the bytes::Regex
iterator always assumes the byte oriented definition, even when Unicode mode is enabled for the entire regex (which is the default).
We could fix part of this issue by making the bytes::Regex
iterator respect the value of the unicode
flag when set via bytes::RegexBuilder
. Namely, we could make the iterator advance one Unicode codepoint in the case of an empty match when Unicode mode is enabled for the entire regex. The problem here is the behavior in the second example, when Unicode mode is enabled, but we match at invalid UTF-8 boundaries. In that case, "skipping ahead one Unicode codepoint" doesn't really make sense, because it kind of assumes valid UTF-8. This is why the bytes::Regex
iterator works the way it does. The intention was to rely on the matching semantics themselves to preserve the UTF-8 guarantee.
I guess ideally, the empty regex shouldn't match at locations that aren't valid UTF-8 boundaries when Unicode mode is enabled. This would completely fix the entire issue. I'm not entirely sure what the best way to implement this would be though.
This bug was initially reported as a bug in ripgrep in BurntSushi/ripgrep#937.