(Probably needs some fixes in CppGuess for some platforms I tried though)
This turns out to be harder than I was thinking. The first step is to compile two versions of the regexp, one for matching UTF-8 and one for matching Latin1 (maybe on demand).
RE2 won’t accept \x{…} escapes that are greater than the current character set. I was hoping it would be possible to give a string containing these to RE2 then let RE2 realise part of it won’t match (e.g. (?:foo|\x{1234}) will still match foo, even if the input string isn’t UTF-8).
(I’m only talking about \x{…}; this is the only case I have to care about, \p{…} are accepted by RE2 regardless. Due to Perl’s behaviour we can’t have raw UTF-8 in the string if the UTF-8 flag isn’t on.)
The approach for now will probably be to replace \x{nnn} in strings (where nnn>0xFF) with something that won’t match (maybe [^\x00-\xff]), but allows the other branches to match.
At least at the top level, implementing within RE2 would be silly.
RE2 doesn’t have all the behaviours perl does (i.e. /a is implied for \d, etc.). Might just be a case of documenting what RE2 does, once UTF-8 is working to some extent. An alternative could be to make things explicit (e.g. you need to say “no feature ‘unicode_strings’” if you happen to have enabled them to use RE2).
i.e. a Regexp::RE2::Set class that can have RE2 regexps added into it then a match method.
See maybe https://github.com/axiak/pyre2/blob/master/tests/performance.py