implement replace on String for multiple patterns #40484

vtjnash · 2021-04-14T20:21:27Z

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or much too complicated. This aims to be simple, and still seems to beat most of the "handwritten" benchmark cases.

The core of this works by partially unrolling the pattern tests using map on tuple. This seems to outperform nearly all of the performance benchmarks I could find (particularly those of @stevengj that he'd written of past attempts at this), including most of the hand-written ones and especially any that involved using a regex (even a fairly optimally-designed regex). There was one that this was a factor-of-two away, but it is mostly due to the unreliable findnext API design and apparently optimizer deficiencies with unravelling the resulting morass of Unions (it leaves behind some big phi nodes tagged with pi nodes that should have killed them), and not for any particular reason that this should have been discernibly slower.

The multi-replace method is being added in JuliaLang/julia#40484

base/strings/util.jl

stevengj · 2021-04-15T12:40:58Z

Thanks, this is great!

I just tried another benchmark and confirmed that this is faster than a regex approach:

s = "elephant giraffe antelope lion cheetah, "^1000;
const reg = r"elephant|giraffe|antelope|lion|cheetah"
const rep = Dict("elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow")

@btime replace($s, $(reg=>(s->rep[s])));
@btime replace($s, "elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow");

gives 696µs (regex) and 440µs (PR).

However, for single-character replacements:

using Random
sr = randstring(1000);
const upper = Dict(c=>uppercase(c) for c in 'a':'z');

@btime uppercase($sr);
@btime map($(c -> 'a' ≤ c ≤ 'z' ? c-32 : c), $sr);
@btime replace($sr, $(r"[a-z]"=>uppercase));
@btime replace($sr, $(Set('a':'z')=>uppercase));
@btime replace($sr, $(r"[a-z]"=>(c->upper[c[1]])));
@btime replace($sr, $(r"a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z"=>(c->upper[c[1]])));
@btime replace($sr, $(Set('a':'z')=>(c->upper[c])));

@btime replace($sr, 'a' => 'A', 'b' => 'B', 'c' => 'C', 'd' => 'D', 'e' => 'E', 'f' => 'F', 'g' => 'G', 'h' => 'H', 'i' => 'I', 'j' => 'J', 'k' => 'K', 'l' => 'L', 'm' => 'M', 'n' => 'N', 'o' => 'O', 'p' => 'P', 'q' => 'Q', 'r' => 'R', 's' => 'S', 't' => 'T', 'u' => 'U', 'v' => 'V', 'w' => 'W', 'x' => 'X', 'y' => 'Y', 'z' => 'Z');

gives 6.2µs, 5.7µs, 71µs, 39µs, 51µs, 60µs, 44µs, and 1.6ms, respectively.

This is a bit more concerning — in particular, for the multi-arg replace I'm getting an excessive number of allocations for the single-char benchmark:

  1.566 ms (20826 allocations: 987.97 KiB)

Is it because the argument tuple is too long?

vtjnash · 2021-04-15T19:42:10Z

Yeah, at some point tuple operations switch automatically to using Arrays, as that's a policy decision made elsewhere. Though, as you showed, generally map may be more appropriate for such a situations anyways, as it is a similar penalty for map -> set (10x) as it is for set -> many arguments (30x).

stevengj · 2021-04-15T19:49:57Z

Yeah, at some point tuple operations switch automatically to using Arrays, as that's a policy decision made elsewhere.

It would be nice if the implementation could switch to map! in that case so that it can update the array in-place rather than reallocating it each time…

I can easily see people splatting a large number of replacements here.

stevengj · 2021-04-16T01:24:12Z

Of course, if you have a really large number of replacements then you would probably want to use a different data structure for rs, e.g. a heap, and in that case you would probably want to pass the replacement pairs as an iterable collection rather than splatting.

Seelengrab · 2021-04-21T07:56:41Z

base/strings/util.jl

@@ -572,6 +590,13 @@ If `pat` is a regular expression and `r` is a [`SubstitutionString`](@ref), then
 references in `r` are replaced with the corresponding matched text.
 To remove instances of `pat` from `string`, set `r` to the empty `String` (`""`).

+Multiple patterns can be specified, and they will be applied left-to-right
+simultaneously, so only one pattern will be applied to any character, and the
+patterns will only be applied to the input text, not the replacements.


I think this section is misleading - the text seems to sugest only patterns of the form 'x' => "replacement" or 'x' => 'a' are allowed, but the tests further down seem to suggest "original" => "replacement" and "original" => 'a' should also work.

Additionally, what should the result of replace("atest", "atest" => 'a', "at" => 'b') be? Simply "a", since the patterns are applied left to right, even though "simultaneously" applying them leads to inconsistency (and imo both are reasonable choices)?

The patterns are "not applied to the replacements". Each character is only matched by at most one pattern, though a pattern may match multiple characters (or even none).

Seelengrab · 2021-04-21T08:05:22Z

I like the performance, but I'm not convinced the behaviour is best in all cases. I've said it before, but I don't think replace(::String, ::String => x, ::String => y, ...) should exist. Just looking through discourse and reading posts from people looking for this method, you get a bunch of conflicting desired behaviours. Because of that, I think almost all interpretations of how this API should behave are reasonable and making a definitive choice here feels weird to me (and would probably lead to confusion from people expecting it to behave a certain way...). How do other languages handle this and how have they solved those problems?

Consistency with replace(::AbstractArray, pat => rep, pat => rep) doesn't feel like a good argument, since in those cases replacement only ever concerns one index, whereas replace(::String, ::String => x, ::String => y) covers potentially multiple indices, leading to the problems described in the linked post.

vtjnash · 2021-05-27T20:36:38Z

triage thinks this is good

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396

fixnext API is awkward

stevengj · 2021-06-07T18:17:11Z

How do other languages handle this and how have they solved those problems?

Python doesn't seem to have a function for this. The top link discussing the issue seems to be here on stackoverflow, and the top suggestions are either to make multiple replace passes or to do a single pass with a regex. All note that that the problem is somewhat ambiguous for overlapping matches.

Ruby also lacks such a feature, and the main suggestion seems to be the regex solution (also here or here).

Similarly in Perl, people seem to suggest the regex solution (e.g. here or here). In PostgreSQL, I found a suggestion that people use Perl via an extension and employ the regex solution.

In Rust, I again see people proposing the regex solution. In Swift, I've also seen the regex solution proposed.

In short, questions of how to do multiple replacements in a "single pass" seem to pop up in many languages, but I've yet to find a language that provides a "built-in" solution. In every other language, the "efficient" way to do this is commonly believed to be a regex combined with a match-mapping function. Most people seem to be happy with the semantics of the regex solution, and are generally unconcerned with the question of overlapping matches.

JeffBezanson · 2021-06-07T19:26:55Z

Good references. So do you think we should hold off on merging this?

vtjnash · 2021-06-07T21:14:22Z

This currently implements a generalization of the behavior of converting all of the arguments to a single regex, if a regex could tell you which part of the regex was matched. So it seems like positive encouragement for that being the generally accepted workaround for other languages. We already knew Julia was better than them :)

JeffBezanson · 2021-06-07T21:17:08Z

Yeah it seems good to me, was just wondering what Steven meant since I expect long posts to be complaints 😛

vtjnash · 2021-06-07T21:31:29Z

And just for amusement, here's a random use that just came to mind:

julia> titlecaseish(s) = replace(s, r"(?<=\s|^)." => uppercase, r"." => lowercase);

julia> titlecaseish("ab c bbb  de   Fajk JEJ's.")
"Ab C Bbb  De   Fajk Jej's."

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396

vtjnash added the strings "Strings!" label Apr 14, 2021

vtjnash requested a review from stevengj April 14, 2021 20:21

vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021

use appropriate optimizations in regex_dna

8fb1626

The multi-replace method is being added in JuliaLang/julia#40484

vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021

use appropriate optimizations in regex_dna

c5b94b3

The multi-replace method is being added in JuliaLang/julia#40484

vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021

use appropriate optimizations in regex_dna

94dba74

The multi-replace method is being added in JuliaLang/julia#40484

stevengj reviewed Apr 15, 2021

View reviewed changes

base/strings/util.jl Show resolved Hide resolved

Seelengrab suggested changes Apr 21, 2021

View reviewed changes

vtjnash added this to the 1.7 milestone Jun 1, 2021

This was referenced Jun 4, 2021

add method to replace that takes a function RegexMatch -> String #41051

Open

replace: allow transform function as first argument? #24598

Open

vtjnash added 3 commits June 7, 2021 13:15

fixup: bugs in original tests

a309437

fix performance bug

3575dfe

fixnext API is awkward

JeffBezanson force-pushed the jn/multireplacer branch from 6467f2f to 3575dfe Compare June 7, 2021 17:15

JeffBezanson merged commit 70771b2 into master Jun 7, 2021

JeffBezanson deleted the jn/multireplacer branch June 7, 2021 23:02

laborg mentioned this pull request Feb 4, 2022

Inconsistent behaviour for replace #28967

Closed

szcf-weiya mentioned this pull request Jan 4, 2023

MethodError of replace in Julia 1.6.3 szcf-weiya/LaTeXTables.jl#1

Closed

stevengj mentioned this pull request Feb 10, 2023

slowdown in regex replacement #48630

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement replace on String for multiple patterns #40484

implement replace on String for multiple patterns #40484

vtjnash commented Apr 14, 2021

stevengj commented Apr 15, 2021 •

edited

Loading

vtjnash commented Apr 15, 2021

stevengj commented Apr 15, 2021 •

edited

Loading

stevengj commented Apr 16, 2021 •

edited

Loading

Seelengrab Apr 21, 2021

vtjnash Apr 21, 2021

Seelengrab commented Apr 21, 2021 •

edited

Loading

vtjnash commented May 27, 2021

stevengj commented Jun 7, 2021 •

edited

Loading

JeffBezanson commented Jun 7, 2021

vtjnash commented Jun 7, 2021

JeffBezanson commented Jun 7, 2021

vtjnash commented Jun 7, 2021

implement replace on String for multiple patterns #40484

implement replace on String for multiple patterns #40484

Conversation

vtjnash commented Apr 14, 2021

stevengj commented Apr 15, 2021 • edited Loading

vtjnash commented Apr 15, 2021

stevengj commented Apr 15, 2021 • edited Loading

stevengj commented Apr 16, 2021 • edited Loading

Seelengrab Apr 21, 2021

Choose a reason for hiding this comment

vtjnash Apr 21, 2021

Choose a reason for hiding this comment

Seelengrab commented Apr 21, 2021 • edited Loading

vtjnash commented May 27, 2021

stevengj commented Jun 7, 2021 • edited Loading

JeffBezanson commented Jun 7, 2021

vtjnash commented Jun 7, 2021

JeffBezanson commented Jun 7, 2021

vtjnash commented Jun 7, 2021

stevengj commented Apr 15, 2021 •

edited

Loading

stevengj commented Apr 15, 2021 •

edited

Loading

stevengj commented Apr 16, 2021 •

edited

Loading

Seelengrab commented Apr 21, 2021 •

edited

Loading

stevengj commented Jun 7, 2021 •

edited

Loading