Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex Escaping #6124

Closed
absherwin opened this issue Mar 12, 2014 · 19 comments
Closed

Regex Escaping #6124

absherwin opened this issue Mar 12, 2014 · 19 comments
Labels
strings "Strings!"

Comments

@absherwin
Copy link

I can't find a function to escape a string for use in regular expression matching. That is: f such that match(Regex(f(s1)),s2)!=nothing iff s1==s2.

Python has such a function (re.escape) as do Ruby (Regexp.escape) and MATLAB (regextranslate). R, on the the other hand does not.

Assuming the community decides to add, escape_regex seems like a clear name but bothers me slightly because it's acting on a string rather than a Regex. escape_regexstring seems more accurate but needlessly verbose.

I'm happy to contribute the patch if the community agrees this is worthwhile.

Thoughts?

@StefanKarpinski
Copy link
Member

You can use Regex("^\\Q$s1\\E\$"). You can also search for literal strings, which is likely to be faster.

@JeffBezanson
Copy link
Member

I guess we will go with that solution; we can reopen if we decide to add this function.

@bkamins
Copy link
Member

bkamins commented Sep 19, 2017

@StefanKarpinski The following code

s1 = "\\E"
r = Regex("^\\Q$s1\\E\$")
match(r, "")

finds a match, however it should not as we wanted to look for \E. Maybe it is worth to consider reopening this issue and adding a function that escapes string for use in regex?

@StefanKarpinski
Copy link
Member

Yeah, this could use a more complete solution. It seems to be strictly a feature addition, however.

@mlhetland
Copy link
Contributor

Didn't want to go full PR on this (yet), but it seems there are semi-standard ways of doing this. My implementations of three canonical-ish versions follow. First, the Perl version, which escapes everything except word characters:

# Perl version:
quotemeta(s::AbstractString) = replace(s, r"(\W)" => s"\\\1")

As the docs say, this is the internal version behind the \Q escapes; given that those are available in PCRE as well, I assume there might be a function there we could expose directly (though a cursory grepping through the code didn't uncover it).

This is simple enought it might simply be a recipe in the docs, I guess. ¯\_(ツ)_/¯

The C++ wrapper of PCRE has an implementation of this (which suggests that it's not available directly from PCRE), and it does the same thing, except it also replaces "\0" with "0" (in addition to escaping it), as PCRE doesn't handle "\0", escaped or not:

# pcrecpp version:
function quotemeta(s::AbstractString)
    res = replace(s, r"(\W)" => s"\\\1")
    replace(res, "\0" => "0")
end

Finally, there's the more conservative PHP version, which only escapes the special characters .\+*?[^]$(){}=!<>|:-, and possibly one optional delimiter (which I guess ought to be a single character, but the PHP docs don't explicitly state that; then again, escaping the first character of a character sequence might be enough).

# PHP version:
function quotemeta(s::AbstractString; delim=nothing)
    res = replace(s, r"([.\\+*?[^\]$(){}=!<>|:-])" => s"\\\1")
    delim  nothing ? replace(res, delim => "\\$delim") : res
end

I do think escaping "\0" seems sensible, as in pcrecpp – and I also think escaping «everything» is perhaps going a bit far (though it does ensure that no special character is left out, of course). On the other hand, the extra delimiter seems a bit out of place (for us). Maybe a combination of the pcrecpp and PHP versions?

# Julia version?
function quotemeta(s::AbstractString)
    res = replace(s, r"([.\\+*?[^\]$(){}=!<>|:-])" => s"\\\1")
    replace(res, "\0" => "\\0")
end

@mlhetland
Copy link
Contributor

mlhetland commented Jul 28, 2018

The Python version escapes ()[]{}?*+-|^$\.&~# \t\n\r\v\f; from what I can tell, the Ruby version omits& and ~. The MATLAB version seems to be customizable, but also follows the strategy of selectively escaping special characters (rather than simply \W).

@mlhetland
Copy link
Contributor

Escaping whitespace does seem to make sense (though we might as well use \s) – as it'll be ignored if we're using the x modifier. Same with #, for comments.

@mlhetland
Copy link
Contributor

So, including these extra precautionary characters from the Python version (but escaping all whitespace, not just the characters listed):

function quotemeta(s::AbstractString)
    res = replace(s, r"([()[\]{}?*+\-|^\$\\.&~#\s=!<>|:])" => s"\\\1")
    replace(res, "\0" => "\\0")
end

@mlhetland
Copy link
Contributor

… though I guess if PCRE doesn't ignore other whitespace than the above in x-mode, then the use of \s is just an unnecessary performance hit, and we might as well list the specific offenders.

@vtjnash vtjnash added the strings "Strings!" label Oct 10, 2018
@amellnik
Copy link
Contributor

@mlhetland Did you want to make a PR for this? If not, I can. Cheers!

@mlhetland
Copy link
Contributor

Feel free to!-) I’d suggest looking into that last issue – i.e. what PCRE ignores in x-mode vs what is matched by \s. If the latter is a strict superset, we might just go with the former (listed explicitly).

@erezsh
Copy link

erezsh commented Aug 10, 2019

Ping! Why is the PR still on hold?

@StefanKarpinski
Copy link
Member

See also: #29643

@vtjnash
Copy link
Member

vtjnash commented Feb 3, 2021

Closing as we have a version of this (though perhaps not the final version, per that PR)

@vtjnash vtjnash closed this as completed Feb 3, 2021
@erezsh
Copy link

erezsh commented Feb 3, 2021

I don't understand the practice of closing an issue before the PR has been merged.

@mlhetland
Copy link
Contributor

I think the point is that there is another function (in Base) that solves the problem, even though it's up in the air whether the one from the PR should be added in addition.

@mlhetland
Copy link
Contributor

Se, specifically, this comment, about Base.wrap_string, which provides the functionality that is the topic of this issue.

@erezsh
Copy link

erezsh commented Feb 3, 2021

That comment was left two days ago, for a PR that's been open since 2018. wrap_string isn't even documented yet.

Does wrap_string escape things like [ or .? If not, then it's nothing like regex escaping.

It's another thing if you said "we don't care about this feature." Then it would make more sense to me.

@mlhetland
Copy link
Contributor

I have nothing to do with the decisions, here. I just implemented the function in the PR, because I also needed it (two years ago). Just tried to explain my impression of why the issue was closed – but I agree that until wrap_string is at least documented, the issue could well be left open.

As for what wrap_string does, I think it uses the Perl approach of regexp escape, which essentially escapes everything; but it does seem mostly for internal use (with a mysterious second integer argument…).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

8 participants