Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function to escape strings for use in regular expressions #29643

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

amellnik
Copy link
Contributor

This uses the implementation from @mlhetland that was mentioned in #6124.

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that you can't do correct escaping with regex replacement. I could be wrong, but I suspect you need a state machine to do this fully correctly.

@mlhetland
Copy link
Contributor

Regexen and state machines are equivalent, of course, but I guess your objection goes beyond that. However, look at (e.g.) the Python implementation – it just uses straight-up character substitution:
https://github.com/python/cpython/blob/3.7/Lib/re.py

@amellnik
Copy link
Contributor Author

I think the travis failure is unrelated -- maybe some temporary network issue since the failing tests are related to Sockets?

We've been using a slightly simpler version (almost exactly what Python does) in an internal project without issues for a while. Are there any other inputs we should test this with? I threw a few strings containing metacharacters at it as well as some strange stuff from BLoNS and it seems to be working.

@erezsh
Copy link

erezsh commented Aug 10, 2019

Hey guys, what's holding this up? It seems to have had popular demand, and I found myself re-implementing this myself lately.

Also, wouldn't a better name be escape_regex, to conform to escape_string?

@StefanKarpinski
Copy link
Member

Needs to be rebased (it's quite conflicted at this point) and reviewed. Although note that a different escaping approach was used in #23422 and we should probably use the same approach for this kind of functionality.

@erezsh
Copy link

erezsh commented Aug 12, 2019

It's about 20 lines of code, include documentation, it should be fairly easy to merge or just rewrite.

I would do it myself, but I don't know much about Julia's conventions yet (but I'm willing to do it with guidance).

@mlhetland
Copy link
Contributor

It seems it might need to be replaced, if it is to conform to #23422?

@mlhetland
Copy link
Contributor

There’s also the issue I mentioned with my original code: We might not want to use \s, at least if it does a Unicode lookup; it would be a performance hit with no real benefit. Instead, we might want to use specifically the characters skipped by PCRE in x mode. (Not that this is a dealbreaker; I just thought going with \s should be a conscious decision, as I wasn’t entirely happy with it in my implementation :-))


## escaping ##
"""
regex_escape(s::AbstractString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better call this escape_regex, for consistency with escape_string?

@nalimilan nalimilan added the strings "Strings!" label Aug 20, 2019
The `regex_escape` function allows you to escape a string for use in constructing a regular
expression. All whitespace and PCRE metacharacters are escaped.

```julia-repl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a doctest?

@rfourquet
Copy link
Member

I hadn't noticed this PR before, but #31989 now implements escaping with the same logic as in #23422.

@amellnik
Copy link
Contributor Author

I think #31989 is further along now, I'll close this PR in favor of that

@amellnik amellnik closed this Aug 21, 2019
@mlhetland
Copy link
Contributor

mlhetland commented Aug 22, 2019

One disadvantage of dropping this for #31989 is that we no longer cover the original functionality, of producing a string where regex stuff is escaped. It only covers the cases where you want a fully formed regex, which isn’t what I needed, and which isn’t what other languages do.

The new functionality of composing regexen etc. helps, of course, but there may still be cases where you’ll want to simply escape the various character sequences without actually producing a regex object, in which case you’d have to reimplement this.

@amellnik
Copy link
Contributor Author

@mlhetland Good point -- I should have looked more closely at that PR.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want this, given that we also now have (undocumented) Base.wrap_string?

```
"""
function regex_escape(s::AbstractString)
res = replace(s, r"([()[\]{}?*+\-|^\$\\.&~#\s=!<>|:])" => s"\\\1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why =!<>|: (which aren't in the python list)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my original was based on PHP, but adding some precautionary characters from Python, and escaping whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But I haven’t looked at wrap_string. Documenting that might be a reasonable alternative. 🤷‍♂️)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My comments end with this version, BTW.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like wrap_string uses the Perl strategy (with \Q)? I just thought that was overkill, but given that it’s already used, that might be the way to go, dropping this version. (Just my two cents.)

@aplavin
Copy link
Contributor

aplavin commented Sep 23, 2024

Any updates on this? Very short function, but nontrivial to get it right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants