Skip to content

byte regex can produce empty matches between UTF-8 code units #484

@BurntSushi

Description

@BurntSushi

Consider this program:

extern crate regex;

use regex::bytes::Regex;

fn main() {
    let re = Regex::new("").unwrap();
    for m in re.find_iter("☃".as_bytes()) {
        println!("{:?}", (m.start(), m.end()));
    }
}

its output is

(0, 0)
(1, 1)
(2, 2)
(3, 3)

Also, consider this program, which is a different manifestation of the same underlying bug:

extern crate regex;

use regex::bytes::Regex;

fn main() {
    let re = Regex::new("").unwrap();
    for m in re.find_iter(b"b\xFFr") {
        println!("{:?}", (m.start(), m.end()));
    }
}

its output is:

(0, 0)
(1, 1)
(2, 2)
(3, 3)

In particular, the empty pattern matches everything, including the locations between UTF-8 code units and otherwise invalid UTF-8.

A related note here is that find_iter is implemented slightly differently in bytes::Regex when compared with Regex. Namely, upon observing an empty match, the iterator forcefully advances its current position by a single character. For Unicode regexes, a character is a Unicode codepoint. For byte oriented regexes, a character is any single byte. The problem here is that the bytes::Regex iterator always assumes the byte oriented definition, even when Unicode mode is enabled for the entire regex (which is the default).

We could fix part of this issue by making the bytes::Regex iterator respect the value of the unicode flag when set via bytes::RegexBuilder. Namely, we could make the iterator advance one Unicode codepoint in the case of an empty match when Unicode mode is enabled for the entire regex. The problem here is the behavior in the second example, when Unicode mode is enabled, but we match at invalid UTF-8 boundaries. In that case, "skipping ahead one Unicode codepoint" doesn't really make sense, because it kind of assumes valid UTF-8. This is why the bytes::Regex iterator works the way it does. The intention was to rely on the matching semantics themselves to preserve the UTF-8 guarantee.

I guess ideally, the empty regex shouldn't match at locations that aren't valid UTF-8 boundaries when Unicode mode is enabled. This would completely fix the entire issue. I'm not entirely sure what the best way to implement this would be though.

This bug was initially reported as a bug in ripgrep in BurntSushi/ripgrep#937.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions