Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function to escape strings for use in regular expressions #29643

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,7 @@ export
findprev,
match,
occursin,
regex_escape,
searchsorted,
searchsortedfirst,
searchsortedlast,
Expand Down
24 changes: 24 additions & 0 deletions base/regex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -449,3 +449,27 @@ function hash(r::Regex, h::UInt)
h = hash(r.compile_options, h)
h = hash(r.match_options, h)
end

## escaping ##
"""
regex_escape(s::AbstractString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better call this escape_regex, for consistency with escape_string?


Sanitize a string to make it safe for use in regular expression pattern construction. Any
regular expression metacharacters are escaped along with whitespace.

# Examples
```jldoctest
julia> regex_escape("Bang!")
"Bang\\!"

julia> regex_escape(" ( [ { . ? *")
"\\ \\ \\(\\ \\[\\ \\{\\ \\.\\ \\?\\ \\*"

julia> regex_escape("/^[a-z0-9_-]{3,16}\$/")
"/\\^\\[a\\-z0\\-9_\\-\\]\\{3,16\\}\\\$/"
```
"""
function regex_escape(s::AbstractString)
res = replace(s, r"([()[\]{}?*+\-|^\$\\.&~#\s=!<>|:])" => s"\\\1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why =!<>|: (which aren't in the python list)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my original was based on PHP, but adding some precautionary characters from Python, and escaping whitespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(But I haven’t looked at wrap_string. Documenting that might be a reasonable alternative. 🤷‍♂️)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My comments end with this version, BTW.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like wrap_string uses the Perl strategy (with \Q)? I just thought that was overkill, but given that it’s already used, that might be the way to go, dropping this version. (Just my two cents.)

replace(res, "\0" => "\\0")
end
1 change: 1 addition & 0 deletions doc/src/base/strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Base.isvalid(::Any, ::Any)
Base.isvalid(::AbstractString, ::Integer)
Base.match
Base.eachmatch
Base.regex_escape
Base.isless(::AbstractString, ::AbstractString)
Base.:(==)(::AbstractString, ::AbstractString)
Base.cmp(::AbstractString, ::AbstractString)
Expand Down
11 changes: 11 additions & 0 deletions doc/src/manual/strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -936,6 +936,17 @@ ERROR: syntax: invalid escape sequence
Triple-quoted regex strings, of the form `r"""..."""`, are also supported (and may be convenient
for regular expressions containing quotation marks or newlines).

The `regex_escape` function allows you to escape a string for use in constructing a regular
expression. All whitespace and PCRE metacharacters are escaped.

```julia-repl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a doctest?

julia> regex_escape("Bang!")
"Bang\\!"

julia> regex_escape(" ( [ { . ? *")
"\\ \\ \\(\\ \\[\\ \\{\\ \\.\\ \\?\\ \\*"
```

## [Byte Array Literals](@id man-byte-array-literals)

Another useful non-standard string literal is the byte-array string literal: `b"..."`. This
Expand Down
13 changes: 13 additions & 0 deletions test/regex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,16 @@ end
# 'a' flag to disable UCP
@test match(r"\w+", "Düsseldorf").match == "Düsseldorf"
@test match(r"\w+"a, "Düsseldorf").match == "D"

# Test escaping strings for use in regular expressions. We take some regular expressions and
# make sure we can construct a regular expression from them that matches the original string.
test_strings = [
".**",
raw"/^[a-z0-9_-]{3,16}$/",
raw"/^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/"
]
for s in test_strings
r = Regex(string("(", regex_escape(s), ")"))
m = match(r, s)
@test length(m.captures) == 1
end