-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement replace on String for multiple patterns #40484
Conversation
The multi-replace method is being added in JuliaLang/julia#40484
The multi-replace method is being added in JuliaLang/julia#40484
The multi-replace method is being added in JuliaLang/julia#40484
Thanks, this is great! I just tried another benchmark and confirmed that this is faster than a regex approach: s = "elephant giraffe antelope lion cheetah, "^1000;
const reg = r"elephant|giraffe|antelope|lion|cheetah"
const rep = Dict("elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow")
@btime replace($s, $(reg=>(s->rep[s])));
@btime replace($s, "elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow"); gives 696µs (regex) and 440µs (PR). However, for single-character replacements: using Random
sr = randstring(1000);
const upper = Dict(c=>uppercase(c) for c in 'a':'z');
@btime uppercase($sr);
@btime map($(c -> 'a' ≤ c ≤ 'z' ? c-32 : c), $sr);
@btime replace($sr, $(r"[a-z]"=>uppercase));
@btime replace($sr, $(Set('a':'z')=>uppercase));
@btime replace($sr, $(r"[a-z]"=>(c->upper[c[1]])));
@btime replace($sr, $(r"a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z"=>(c->upper[c[1]])));
@btime replace($sr, $(Set('a':'z')=>(c->upper[c])));
@btime replace($sr, 'a' => 'A', 'b' => 'B', 'c' => 'C', 'd' => 'D', 'e' => 'E', 'f' => 'F', 'g' => 'G', 'h' => 'H', 'i' => 'I', 'j' => 'J', 'k' => 'K', 'l' => 'L', 'm' => 'M', 'n' => 'N', 'o' => 'O', 'p' => 'P', 'q' => 'Q', 'r' => 'R', 's' => 'S', 't' => 'T', 'u' => 'U', 'v' => 'V', 'w' => 'W', 'x' => 'X', 'y' => 'Y', 'z' => 'Z'); gives 6.2µs, 5.7µs, 71µs, 39µs, 51µs, 60µs, 44µs, and 1.6ms, respectively. This is a bit more concerning — in particular, for the multi-arg
Is it because the argument tuple is too long? |
Yeah, at some point tuple operations switch automatically to using Arrays, as that's a policy decision made elsewhere. Though, as you showed, generally |
It would be nice if the implementation could switch to I can easily see people splatting a large number of replacements here. |
Of course, if you have a really large number of replacements then you would probably want to use a different data structure for |
@@ -572,6 +590,13 @@ If `pat` is a regular expression and `r` is a [`SubstitutionString`](@ref), then | |||
references in `r` are replaced with the corresponding matched text. | |||
To remove instances of `pat` from `string`, set `r` to the empty `String` (`""`). | |||
|
|||
Multiple patterns can be specified, and they will be applied left-to-right | |||
simultaneously, so only one pattern will be applied to any character, and the | |||
patterns will only be applied to the input text, not the replacements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is misleading - the text seems to sugest only patterns of the form 'x' => "replacement"
or 'x' => 'a'
are allowed, but the tests further down seem to suggest "original" => "replacement"
and "original" => 'a'
should also work.
Additionally, what should the result of replace("atest", "atest" => 'a', "at" => 'b')
be? Simply "a"
, since the patterns are applied left to right, even though "simultaneously" applying them leads to inconsistency (and imo both are reasonable choices)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patterns are "not applied to the replacements". Each character is only matched by at most one pattern, though a pattern may match multiple characters (or even none).
I like the performance, but I'm not convinced the behaviour is best in all cases. I've said it before, but I don't think Consistency with |
triage thinks this is good |
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396
fixnext API is awkward
6467f2f
to
3575dfe
Compare
Python doesn't seem to have a function for this. The top link discussing the issue seems to be here on stackoverflow, and the top suggestions are either to make multiple Ruby also lacks such a feature, and the main suggestion seems to be the regex solution (also here or here). Similarly in Perl, people seem to suggest the regex solution (e.g. here or here). In PostgreSQL, I found a suggestion that people use Perl via an extension and employ the regex solution. In Rust, I again see people proposing the regex solution. In Swift, I've also seen the regex solution proposed. In short, questions of how to do multiple replacements in a "single pass" seem to pop up in many languages, but I've yet to find a language that provides a "built-in" solution. In every other language, the "efficient" way to do this is commonly believed to be a regex combined with a match-mapping function. Most people seem to be happy with the semantics of the regex solution, and are generally unconcerned with the question of overlapping matches. |
Good references. So do you think we should hold off on merging this? |
This currently implements a generalization of the behavior of converting all of the arguments to a single regex, if a regex could tell you which part of the regex was matched. So it seems like positive encouragement for that being the generally accepted workaround for other languages. We already knew Julia was better than them :) |
Yeah it seems good to me, was just wondering what Steven meant since I expect long posts to be complaints 😛 |
And just for amusement, here's a random use that just came to mind: julia> titlecaseish(s) = replace(s, r"(?<=\s|^)." => uppercase, r"." => lowercase);
julia> titlecaseish("ab c bbb de Fajk JEJ's.")
"Ab C Bbb De Fajk Jej's." |
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or much too complicated. This aims to be simple, and still seems to beat most of the "handwritten" benchmark cases.
The core of this works by partially unrolling the pattern tests using map on tuple. This seems to outperform nearly all of the performance benchmarks I could find (particularly those of @stevengj that he'd written of past attempts at this), including most of the hand-written ones and especially any that involved using a regex (even a fairly optimally-designed regex). There was one that this was a factor-of-two away, but it is mostly due to the unreliable
findnext
API design and apparently optimizer deficiencies with unravelling the resulting morass of Unions (it leaves behind some big phi nodes tagged with pi nodes that should have killed them), and not for any particular reason that this should have been discernibly slower.