replace with multiple pattern-repl pairs #25396

KlausC · 2018-01-04T20:38:15Z

KlausC · 2018-01-04T22:38:28Z

A conflicting change, which introduced findnext replacing searchnext caused me some extra work. As for me its all done. Waiting for reviews.

KlausC · 2018-01-05T20:05:44Z

@StefanKarpinski is there a chance to get that reviewed.

StefanKarpinski · 2018-01-05T20:48:21Z

Not by me, I'm afraid. Maybe @rfourquet or @nalimilan?

rfourquet · 2018-01-06T08:11:08Z

Many thanks for your contribution! I'm not a string guy, and this PR looks rather technically involved, so I'm afraid I won't be able to review it any time soon. I have now very limited free time, which I need to dedicate on things to be done for 0.7. But if no-one can review this in time for 0.7, I'm confident that it will get incorporated for 1.1.

nalimilan

Thanks! Looks mostly good, the main point to discuss is the behavior when several pairs match.

nalimilan · 2018-01-07T11:29:39Z

base/strings/util.jl

+        return str
+    end
+    count < 0 && throw(DomainError(count, "`count` must be non-negative."))
+    pat_repls = pairmap.(pat_repls)


Is this type-stable?

Yes, I rely on the broadcasting dot.
Looks so, for me. Input is a tuple of Pair, as output is:

julia> pp(p::Pair{Char}) = equalto(p.first)=>p.second pp (generic function with 3 methods) julia> pp(p::Pair) = p.first=>p.second pp (generic function with 3 methods) julia> pp.( ('a'=>'b', "xyz"=>"") ) (Base.EqualTo{Char}('a') => 'b', "xyz" => "") julia> typeof(pp.( ('a'=>'b', "xyz"=>"") )) Tuple{Pair{Base.EqualTo{Char},Char},Pair{String,String}} julia> pp.( ('a'=>'b', "xyz"=>"") ) isa NTuple{N,Pair} where N true

That should work, but run @code_warntype on the replace method to make sure that's the case.

@nalimilan , you wanted to see the complete output.

julia> using Unicode julia> replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase) "ÄBCde" julia> @code_warntype replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase) Variables: str::String pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} count<optimized out> n::Int64 e<optimized out> a<optimized out> i::Any r::Any pattern::Any repl::Any j::Any k::Any out<optimized out> #temp#@_23::Any px<optimized out> py<optimized out> R<optimized out> #temp#@_28::Bool pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}} Body: begin pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} = pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} #= line 429 =# goto 4 4: 5: goto 8 #= line 430 =# 8: #= line 432 =# goto 11 11: #= line 433 =# # meta: location broadcast.jl broadcast 241 # meta: location tuple.jl map 151 # meta: location tuple.jl getindex 21 SSAValue(62) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 1, true)::Pair{Char,Char} # meta: pop location # meta: location strings/util.jl predicatepair 392 # meta: location sysimg.jl getproperty 8 SSAValue(67) = (Base.getfield)(SSAValue(62), :first)::Char # meta: pop location # meta: location operators.jl Type 831 SSAValue(70) = $(Expr(:new, Base.EqualTo{Char}, SSAValue(67))) # meta: pop location # meta: location sysimg.jl getproperty 8 SSAValue(71) = (Base.getfield)(SSAValue(62), :second)::Char # meta: pop location # meta: location pair.jl Type 32 # meta: location pair.jl Type 32 SSAValue(77) = $(Expr(:new, Pair{Base.EqualTo{Char},Char}, SSAValue(70), SSAValue(71))) # meta: pop locations (3) # meta: location tuple.jl getindex 21 SSAValue(78) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 2, true)::Pair{Array{Char,1},typeof(uppercase)} # meta: pop location # meta: location strings/util.jl predicatepair 396 # meta: location sysimg.jl getproperty 8 SSAValue(83) = (Base.getfield)(SSAValue(78), :first)::Array{Char,1} # meta: pop location # meta: location operators.jl Type 850 SSAValue(86) = $(Expr(:new, Base.OccursIn{Array{Char,1}}, SSAValue(83))) # meta: pop location # meta: location pair.jl Type 32 # meta: location pair.jl Type 32 SSAValue(92) = $(Expr(:new, Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}, SSAValue(86), Base.Unicode.uppercase)) # meta: pop locations (3) SSAValue(61) = (Core.tuple)(SSAValue(77), SSAValue(92))::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}} # meta: pop locations (2) pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}} = SSAValue(61) #= line 434 =# n::Int64 = 1 #= line 435 =# # meta: location strings/basic.jl endof 145 # meta: location strings/string.jl ncodeunits 74 SSAValue(97) = (Core.sizeof)(str::String)::Int64 # meta: pop location SSAValue(95) = $(Expr(:invoke, MethodInstance for thisind(::String, ::Int64), :(Base.thisind), :(str), SSAValue(97)))::Int64 # meta: pop location #= line 436 =# i::Any = 1 #= line 437 =# SSAValue(5) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(i::Int64)))::Tuple{Any,Pair} # meta: location tuple.jl indexed_next 58 # meta: location tuple.jl getindex 21 SSAValue(101) = (Base.getfield)(SSAValue(5), 1, true)::Any # meta: pop locations (2) r::Any = SSAValue(101) # meta: location tuple.jl indexed_next 58 # meta: location tuple.jl getindex 21 SSAValue(105) = (Base.getfield)(SSAValue(5), 2, true)::Pair # meta: pop locations (2) # meta: location pair.jl indexed_next 42 SSAValue(106) = (Base.getfield)(SSAValue(105), 1)::Any # meta: pop location pattern::Any = SSAValue(106) # meta: location pair.jl indexed_next 42 SSAValue(109) = (Base.getfield)(SSAValue(105), 2)::Any # meta: pop location repl::Any = SSAValue(109) #= line 438 =# SSAValue(11) = (Base.first)(r::Any)::Any SSAValue(12) = (Base.last)(r::Any)::Any j::Any = SSAValue(11) k::Any = SSAValue(12) #= line 439 =# # meta: location strings/basic.jl sizeof 144 # meta: location strings/string.jl ncodeunits 74 SSAValue(117) = (Core.sizeof)(str::String)::Int64 # meta: pop location # meta: location int.jl * 54 SSAValue(118) = (Base.mul_int)(SSAValue(117), 1)::Int64 # meta: pop locations (2) # meta: location promotion.jl * 291 # meta: location promotion.jl promote 261 # meta: location promotion.jl _promote 219 # meta: location number.jl convert 7 # meta: location float.jl Type 57 SSAValue(131) = (Base.sitofp)(Float64, SSAValue(118))::Float64 # meta: pop locations (4) # meta: location float.jl * 397 SSAValue(142) = (Base.mul_float)(1.2, SSAValue(131))::Float64 # meta: pop locations (2) # meta: location float.jl floor 351 # meta: location float.jl floor 362 SSAValue(145) = (Base.floor_llvm)(SSAValue(142))::Float64 # meta: pop location # meta: location float.jl trunc 679 # meta: location float.jl <= 452 SSAValue(150) = (Base.le_float)(-9.223372036854776e18, SSAValue(145))::Bool # meta: pop location unless SSAValue(150) goto 116 # meta: location float.jl < 450 SSAValue(151) = (Base.lt_float)(SSAValue(145), 9.223372036854776e18)::Bool # meta: pop location #temp#@_28::Bool = SSAValue(151) goto 118 116: #temp#@_28::Bool = false 118: unless #temp#@_28::Bool goto 125 #= line 680 =# # meta: location float.jl unsafe_trunc 298 SSAValue(152) = (Base.fptosi)(Int64, SSAValue(145))::Int64 # meta: pop location goto 129 125: #= line 682 =# SSAValue(148) = $(Expr(:invoke, MethodInstance for InexactError(::Symbol, ::Any, ::Any), :(Base.InexactError), :(:trunc), Int64, SSAValue(145)))::InexactError (Base.throw)(SSAValue(148))::Union{} 129: # meta: pop locations (2) # meta: location iobuffer.jl StringVector 30 # meta: location strings/string.jl _string_n 52 # meta: location essentials.jl cconvert 327 # meta: location number.jl convert 7 # meta: location boot.jl Type 681 # meta: location boot.jl toUInt64 651 # meta: location boot.jl check_top_bit 540 # meta: location boot.jl is_top_bit_set 530 SSAValue(170) = (Core.lshr_int)(SSAValue(152), 63)::Int64 SSAValue(171) = (Core.trunc_int)(Core.Int8, SSAValue(170))::Int8 SSAValue(173) = (Core.eq_int)(SSAValue(171), 1)::Bool # meta: pop location unless SSAValue(173) goto 145 $(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(152)))::Union{} 145: # meta: pop location SSAValue(164) = (Core.bitcast)(Core.UInt64, SSAValue(152))::UInt64 # meta: pop locations (4) SSAValue(158) = $(Expr(:foreigncall, :(:jl_alloc_string), Ref{String}, svec(UInt64), :(:ccall), 1, SSAValue(164), SSAValue(164))) # meta: pop location # meta: location strings/string.jl unsafe_wrap 63 SSAValue(174) = $(Expr(:foreigncall, :(:jl_string_to_array), Ref{Array{UInt8,1}}, svec(Any), :(:ccall), 1, SSAValue(158))) # meta: pop locations (2) # meta: location iobuffer.jl Type 63 # meta: location iobuffer.jl Type 63 # meta: location iobuffer.jl Type 26 # meta: location iobuffer.jl Type 19 # meta: location array.jl length 138 SSAValue(192) = (Base.arraylen)(SSAValue(174))::Int64 # meta: pop location SSAValue(191) = $(Expr(:new, Base.GenericIOBuffer{Array{UInt8,1}}, SSAValue(174), true, true, true, false, SSAValue(192), 9223372036854775807, 1, -1)) # meta: pop locations (4) #= line 440 =# goto 165 165: 166: # meta: location sysimg.jl setproperty! 9 (Base.setfield!)(SSAValue(191), :size, 0)::Int64 # meta: pop location #= line 441 =# goto 172 172: 173: # meta: location sysimg.jl setproperty! 9 (Base.setfield!)(SSAValue(191), :ptr, 1)::Int64 # meta: pop location #= line 442 =# 178: SSAValue(27) = (j::Any != 0)::Any unless SSAValue(27) goto 264 #= line 443 =# SSAValue(28) = (i::Any == 1)::Any unless SSAValue(28) goto 186 #temp#@_23::Any = SSAValue(28) goto 188 186: #temp#@_23::Any = (i::Any <= k::Any)::Any 188: unless #temp#@_23::Any goto 197 #= line 444 =# SSAValue(30) = (Base.pointer)(str::String, i::Any)::Ptr{UInt8} SSAValue(31) = (j::Any - i::Any)::Any SSAValue(32) = (Base.UInt)(SSAValue(31))::Any (Base.unsafe_write)(SSAValue(191), SSAValue(30), SSAValue(32))::Int64 #= line 445 =# (Base._replace)(SSAValue(191), repl::Any, str::String, r::Any, pattern::Any)::Any 197: #= line 447 =# SSAValue(33) = (k::Any < j::Any)::Any unless SSAValue(33) goto 211 #= line 448 =# i::Any = j::Any #= line 449 =# SSAValue(34) = (j::Any > SSAValue(95))::Any unless SSAValue(34) goto 207 goto 264 207: #= line 450 =# k::Any = (Base.nextind)(str::String, j::Any)::Int64 goto 216 211: #= line 452 =# SSAValue(35) = (Base.nextind)(str::String, k::Any)::Int64 k::Any = SSAValue(35) i::Any = SSAValue(35) 216: #= line 454 =# SSAValue(36) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(k::Int64)))::Tuple{Any,Pair} # meta: location tuple.jl indexed_next 58 # meta: location tuple.jl getindex 21 SSAValue(204) = (Base.getfield)(SSAValue(36), 1, true)::Any # meta: pop locations (2) r::Any = SSAValue(204) # meta: location tuple.jl indexed_next 58 # meta: location tuple.jl getindex 21 SSAValue(208) = (Base.getfield)(SSAValue(36), 2, true)::Pair # meta: pop locations (2) # meta: location pair.jl indexed_next 42 SSAValue(209) = (Base.getfield)(SSAValue(208), 1)::Any # meta: pop location pattern::Any = SSAValue(209) # meta: location pair.jl indexed_next 42 SSAValue(212) = (Base.getfield)(SSAValue(208), 2)::Any # meta: pop location repl::Any = SSAValue(212) #= line 455 =# SSAValue(42) = r::Any # meta: location range.jl colon 5 # meta: location range.jl Type 153 SSAValue(221) = $(Expr(:new, UnitRange{Int64}, 0, -1)) # meta: pop locations (2) SSAValue(44) = (SSAValue(42) == SSAValue(221))::Any unless SSAValue(44) goto 245 goto 251 245: # meta: location promotion.jl == 386 SSAValue(222) = (n::Int64 === 9223372036854775807)::Bool # meta: pop location unless SSAValue(222) goto 251 goto 264 251: #= line 456 =# SSAValue(46) = (Base.first)(r::Any)::Any SSAValue(47) = (Base.last)(r::Any)::Any j::Any = SSAValue(46) k::Any = SSAValue(47) #= line 457 =# # meta: location int.jl + 53 SSAValue(223) = (Base.add_int)(n::Int64, 1)::Int64 # meta: pop location n::Int64 = SSAValue(223) 262: goto 178 264: #= line 459 =# SSAValue(49) = (Base.SubString)(str::String, i::Any)::SubString{String} # meta: location strings/io.jl write 139 # meta: location sysimg.jl getproperty 8 SSAValue(233) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64 # meta: pop location # meta: location int.jl <= 405 SSAValue(234) = (Base.sle_int)(SSAValue(233), 0)::Bool # meta: pop location unless SSAValue(234) goto 276 goto 342 276: # meta: location sysimg.jl getproperty 8 SSAValue(235) = (Base.getfield)(SSAValue(49), :string)::String # meta: pop location # meta: location sysimg.jl getproperty 8 SSAValue(236) = (Base.getfield)(SSAValue(49), :offset)::Int64 # meta: pop location # meta: location int.jl + 53 SSAValue(237) = (Base.add_int)(SSAValue(236), 1)::Int64 # meta: pop location # meta: location strings/string.jl pointer 72 # meta: location strings/string.jl pointer 71 # meta: location pointer.jl unsafe_convert 59 # meta: location pointer.jl pointer_from_objref 142 goto 291 291: #= line 143 =# SSAValue(250) = $(Expr(:foreigncall, :(:jl_value_ptr), Ptr{Nothing}, svec(Any), :(:ccall), 1, SSAValue(235))) # meta: pop location # meta: location pointer.jl + 154 # meta: location boot.jl Type 691 SSAValue(257) = (Core.bitcast)(Core.UInt, SSAValue(250))::UInt64 # meta: pop location SSAValue(255) = (add_ptr)(SSAValue(257), 0x0000000000000008)::UInt64 # meta: location essentials.jl oftype 301 # meta: location pointer.jl convert 26 # meta: location boot.jl Type 696 SSAValue(263) = (Core.bitcast)(Ptr{Nothing}, SSAValue(255))::Ptr{Nothing} # meta: pop locations (4) # meta: location pointer.jl convert 30 SSAValue(265) = (Base.bitcast)(Ptr{UInt8}, SSAValue(263)) # meta: pop locations (3) # meta: location int.jl - 52 SSAValue(266) = (Base.sub_int)(SSAValue(237), 1)::Int64 # meta: pop location # meta: location pointer.jl + 154 # meta: location boot.jl Type 691 SSAValue(273) = (Core.bitcast)(Core.UInt, SSAValue(265))::UInt64 # meta: pop location # meta: location int.jl rem 445 SSAValue(274) = (Base.bitcast)(UInt64, SSAValue(266)) # meta: pop location SSAValue(271) = (add_ptr)(SSAValue(273), SSAValue(274))::UInt64 # meta: location essentials.jl oftype 301 # meta: location pointer.jl convert 26 # meta: location boot.jl Type 696 SSAValue(280) = (Core.bitcast)(Ptr{UInt8}, SSAValue(271))::Ptr{UInt8} # meta: pop locations (5) # meta: location sysimg.jl getproperty 8 SSAValue(281) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64 # meta: pop location # meta: location boot.jl Type 681 # meta: location boot.jl toUInt64 651 # meta: location boot.jl check_top_bit 540 # meta: location boot.jl is_top_bit_set 530 SSAValue(291) = (Core.lshr_int)(SSAValue(281), 63)::Int64 SSAValue(292) = (Core.trunc_int)(Core.Int8, SSAValue(291))::Int8 SSAValue(294) = (Core.eq_int)(SSAValue(292), 1)::Bool # meta: pop location unless SSAValue(294) goto 337 $(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(281)))::Union{} 337: # meta: pop location SSAValue(285) = (Core.bitcast)(Core.UInt64, SSAValue(281))::UInt64 # meta: pop locations (2) $(Expr(:invoke, MethodInstance for unsafe_write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64), :(Base.unsafe_write), SSAValue(191), SSAValue(280), SSAValue(285)))::Int64 342: # meta: pop location #= line 460 =# SSAValue(50) = $(Expr(:invoke, MethodInstance for take!(::Base.GenericIOBuffer{Array{UInt8,1}}), :(Base.take!), SSAValue(191)))::Array{UInt8,1} # meta: location strings/string.jl Type 30 SSAValue(295) = $(Expr(:foreigncall, :(:jl_array_to_string), Ref{String}, svec(Any), :(:ccall), 1, SSAValue(50))) # meta: pop location return SSAValue(295) end::String

Ah, damn keyword arguments, they hide the body of the function. I guess the simplest approach is to remove the count argument for the check.

I updated the output of @code_warntype after removing the keyword argument.

There are test cases missing, which use this argument :_)

OK, looks good.

Wait, the output looks incomplete. Some variables do not appear to be inferred as concrete types, but they don't appear in the body.

nalimilan · 2018-01-07T11:30:59Z

base/strings/util.jl

+    i = a = start(str)
+    r, (pattern, repl) = findnextall(pat_repls, str, i)
+    j, k = first(r), last(r)
+    out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)


Maybe add a comment to explain the choice of this constant? Why not start with the size of the input string?

Ok, this code is inherited. I can only speculate. Maybe the author made extensive probability studies?

Ah, OK. Let's leave it as-is then, unless somebody can explain this choice.

nalimilan · 2018-01-07T11:31:58Z

base/strings/util.jl

@@ -427,16 +497,19 @@ function replace(str::String, pat_repl::Pair; count::Integer=typemax(Int))
 end

 """
-    replace(s::AbstractString, pat=>r; [count::Integer])
+    replace(s::AbstractString, pat=>r[, pat=>r ...]; [count::Integer])


Just add ... after the first pat => r, no need to repeat it.

nalimilan · 2018-01-07T11:35:54Z

base/strings/util.jl

+# find the first of a list of patterns.
+# if several match at the same start position prefer the longer one
+# if also the length is equal, prefer the first in list
+# return value of the successful search and the pair


"value" -> "index range"

Also, better not add a blank line after the comment so that it's obvious it's attached to the function. Same below.

nalimilan · 2018-01-07T11:37:23Z

base/strings/util.jl

+# return value of the successful search and the pair
+
+function findnextall(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer=1) where N
+    ps = start(pairs)


For tuples start is always 1. You don't need to call next either, you can just do for i in 1:N.

The first I will include the as an optimization.
I want to continue with while/done/next because it looks more elegant and performant to me than

N = length(pairs) for i = 1:N p = pairs[i] .... if fr > 0 ... for j = i+1:N p = pairs[j]

More performant: definitely not. More elegant: can always be discussed, but it's not obvious. Note that N is already defined as a type parameter.

ok, you convinced me. I copied my pattern from somewhere in base or stdlib (don't remember where exactly), and thought performance was the driver.
Thanks for the hint regarding N.
Is it better to introduce a second variable j for the inner loop or can I re-use the i ?

I'd say it's clearer to use j.

Using next/done is needed to support generic iterators, so that might be the reason why it was used.

nalimilan · 2018-01-07T11:40:59Z

base/strings/util.jl

+pairmap(p::Pair) = p
+
+# find the first of a list of patterns.
+# if several match at the same start position prefer the longer one


Wouldn't it be better to just retain the first matching pair? At least that sounds simpler to explain. Or is that to match the behavior of other implementations?

I think, this is special for strings. In case of ties, I do not simply take the first match, but give precedence to the longer match, because it is more specific. Maybe the example demonstrates its usefulness:

replace("one house, two houses", "house"=>"Haus", "houses"=>"Häuser", "one"=>"ein", "two"=>"zwei")

But can't the user just choose the appropriate order for the pairs? That won't work for regular expressions (since the length of the match may not be known in advance), but people would better use a single complex regular expression instead I think.

Another potential issue with not simply following the order in which pairs are provided is performance. One of the motivations for allowing for multiple pairs was to be more efficient than calling replace multiple times with different pairs. Doesn't having to match all pairs and compare them make the function slower? Or is that cost only paid when patterns overlap?

The cost may increase only, if shorter patterns, which match are followed by longer patterns, which match at the same position. In other cases, I can't know in advance, it a later pattern does not match at an earlier position.
BTW I would prefer to use a modified version of findnext, which has an extra parameter does_not_start_after, which would allow not to search the full string for all patterns, but stop earlier for the later patterns, if there was a match before.

The docu text for replace has been updated. That should make the rule clear to the user of the new feature:

510 In the case of multiple patterns, the precedence is by low start position of matched pattern, 511 then longer pattern, then order of the input list.

What do you think about the additional optional argument for findnext(,::AbstractString,...) ?
I think the does_not_start_after could reduce search-times in a few cases. It would have another effect, than simply replacing the input string by a shortened substring.
The contract of this modified findnext(pattern, input, start, not_after would sound like:
"Give me the position of the next pattern match after start, but don't try input beyond not_after "

nalimilan · 2018-01-07T11:42:39Z

base/strings/util.jl

-        count::Integer=typemax(Int)) =
-    replace(str, occursin(first(pat_repl)) => last(pat_repl), count)
+# replace Char with function that compares with this Char
+pairmap(p::Pair{Char}) = equalto(p.first) => p.second


pairmap sounds too general. Maybe predicatepair or any other term which doesn't risk being used elsewhere?

changed ot predicatepair.

stevengj · 2018-01-07T17:08:12Z

base/strings/util.jl

+            minstart, minend, minp = fr, last(r), p
+            while !done(pairs, ps)
+                p, ps = next(pairs, ps)
+                r = findnext(p.first, s, st)


I have to say that I'm skeptical of the whole approach in this PR. First converting the user inputs to a tuple of Pair{Function} and then calling findnext on the function seems almost guaranteed to be inefficient, in several ways.

For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions.

For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

I think, I should disagree :-).
Most of your objections came also to my mind, but I could handle it.

the code was copied from the original implementation of replace, which has not been removed, so there is no worse performance than before for the case with one pair.

the use of findnext had been inserted to the code base shortly before this PR. It replaced searchnext, which could handle Char and collections of Char as patterns, which findmax does not do. So for those cases, the equalto and occursin need used. I think, the authors of findmax proved, that performance is not worse than for searchnext. I could measure a relevant improvement.

Replacing Char by equalsto(::Char) is only done once for each call to replace, while in the old approach with searchnext the specialization had to be done for each call of searchnext, which has to be done up to length(s) times for each call to replace.

my function findnextall is a slight generalization of findmax, which is thought to make simultaneous pattern searches. Given the complexity of findnext it was thought better to re-use that than to re-implement the multiple find next from scratch. The rest of the code of replace was re-used, replacing findnextwith findnextall.

I see an improvement possibility, by extending the findnext function by a new optional argument, which can prevent to search the full input string, when already a bound for the starting point of the match is known. That could decrease search times by up to 50%.

I do believe, the effort used by this implementation is appropriate for the complexity of the problem, which is solved. I am waiting for a proof of the contrary (ask Prof. Knuth).

stevengj · 2018-01-07T17:30:06Z

For the case where the replacement patterns are a bunch of strings or sets of characters, can't we just create a single regular expression and a dictionary of replacements and then call the regex replace function? For example, for the special case of replacing a bunch of strings with other strings,

function myreplace(s::AbstractString, p::Pair{<:AbstractString,<:AbstractString}...)
    d = Dict(p...)
    buf = IOBuffer()
    frst = true
    for pair in p
        frst || print(buf, '|')
        frst = false
        escape_string(buf, first(pair), "[\\^\$.|?*+()")
    end
    r = Regex(String(take!(buf)))
    replace(s, r => (s -> d[s]))
end

This allows us to exploit the efficiency of regex search for multiple patterns. This code could easily be extended to support characters and sets of characters as well as replacement functions.

Extending it to sets of regexes would be trickier (e.g. you can't just do a dictionary lookup to get the replacement), but honestly I'm not sure that case is worth implementing. If the user is constructing a bunch of regexes, it seems reasonable to ask them to combine it into a single regex themselves. Still, it's not too bad if you are willing to do a linear search on the replacements instead of (or in addition to) a dictionary lookup. I think it should be sufficient to put parens around each regex (when they are concatenated with |) to keep different regexes from interfering. (Correction: parens aren't sufficient for regexes with backreferences to captured expressions, unless we also "hygenicize" the regexes by renaming their captures with unique names; ugh.)

stevengj · 2018-01-07T17:40:10Z

Copying my comment from above, since it is hidden (attached to an outdated commit):

I have to say that I'm skeptical of the whole approach in this PR. First converting the user inputs to a tuple of Pair{Function} and then calling findnext on each function seems almost guaranteed to be inefficient, in several ways. For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions. For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

KlausC · 2018-01-07T18:05:50Z

@stevengj , I am just exhausted from defending my own approach - please read that.
For your myreplace I consider that a special solution for in the case all patterns are strings, probably extensible to Char, collections of Char, and maybe regexes. I must admit, I did not think about the use of regular expressions for the general case.
I think I will implement along your approach and see what is the performance impact.

nalimilan · 2018-01-07T18:14:00Z

I guess only benchmarking can tell, but it's not obvious to me that creating a regexp will always be more efficient, as they are significantly more complex than simple replacements of chars or strings. Maybe we could use it as a fallback for complex cases, but use simpler method for special cases.

For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions.

Except if the data structure is a tuple. In some cases at least, the compiler is able to unroll a loop over a tuple and the functions could be inlined. Of course this has to be checked.

For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

For the simple case of matching a Char, the operation can be made very efficient by checking for equality for all pairs at the same time (just like the replace method on arrays does). For other cases it's harder, but maybe doable. We can always fall back to using regexes, and add optimized methods later.

KlausC · 2018-01-07T18:37:45Z

For this case I do not see a replacement using Regex.

julia> replace("aRcD", very_complex_selection_algorithm=>"BOOM", isupper=>"")
"aBOOMc"

stevengj · 2018-01-07T19:07:06Z

@KlausC, fair enough, your approach is more general than trying to construct a single regex, which obviously cannot handle arbitrary user-defined character predicates. But I'm still concerned that it is potentially very inefficient. For example, it looks like:

n = 10^6
s = "a"^n * "b"
replace(s, "a"=>"A", "b"=>"c")

will have Θ(n²) complexity in your current approach. The reason is that your implementation, which calls findnextall repeatedly after each match is found (which will happen on every character), will search through the entire remaining string to find the "b" match after every "a" character.

To mitigate this, you would need to implement a more complicated logic: if one of the predicate searches finds a match beyond the end of the current match (or finds no match at all), then this search result can in principle be re-used ~~(with an index offset given by the difference in sizeof between the replacement string and the current match)~~.

stevengj · 2018-01-07T19:14:29Z

Note that (in a quick test in Julia 0.6) the performance of my myreplace version is considerably improved by adding a _replace(io, repl::Dict, str, r, pattern) = print(io, repl[SubString(str, first(r), last(r))]) method, which allows you to directly pass a dictionary d instead of s -> d[s] for the replacement string. I'm not sure why, though; maybe it makes no difference in Julia 0.7 if there is more aggressive inlining? (It could possibly be further improved if there were a specialized method to look up a substring key in a string-based dictionary, without constructing the SubString object first, but since improvements to stack-allocation of temporary immutable objects are in the works, maybe the latter improvement isn't worth it.)

KlausC · 2018-01-07T20:26:08Z

I see, that your approach is better by factor 128 in time, 9 in memory allocations, 7 in memory size.

julia> n = 10^6
1000000
julia> s = "a"^n * "b";
.....

julia> @time replace(s, "a"=>"A", "b"=>"B");
 19.315202 seconds (9.00 M allocations: 214.755 MiB, 0.17% gc time)
julia> function myreplace(s::AbstractString, p::Pair{<:AbstractString,<:AbstractString}...)
           d = Dict(p...)
           buf = IOBuffer()
           frst = true
           for pair in p
               frst || print(buf, '|')
               frst = false
               escape_string(buf, first(pair), "[\\^\$.|?*+()")
           end
           r = Regex(String(take!(buf)))
           replace(s, r => (s -> d[s]))
       end
myreplace (generic function with 1 method)
julia> @time myreplace(s, "a"=>"A", "b"=>"B");
  0.253329 seconds (1.08 M allocations: 36.384 MiB, 5.24% gc time)
julia> @time myreplace(s, "a"=>"A", "b"=>"B");
  0.154606 seconds (1.00 M allocations: 31.664 MiB, 3.65% gc time)

But, I did not use the improved version of findnext, which will also avoid, that the "b" is searched for in n²/2 cases in vain. That number would reduce to n and make the difference. I will try that as well.

KlausC · 2018-01-07T21:25:27Z

@stevengj But this modification will not safe me. Just changing "a" and "b" at one place would give same O(n²).
I have an O(n) solution in my mind, but not this evening any more.

stevengj · 2018-01-07T21:33:43Z

@KlausC, I'm not sure why changing "a" to "b" at one place in the string would make my suggested mitigation strategy O(n²). My suggestion would only re-do a "b" search once the replace loop passes the location of the previous "b" search in the string, and similarly for "a", which I think is O(n) regardless of the pattern of "a" and "b" in the string.

stevengj · 2018-01-08T01:39:03Z

For example, isn't this modification of your code O(n)?

import Base: _replace, replace, StringVector

# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second

# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
    occursin(p.first) => p.second
end
predicatepair(p::Pair) = p

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
    minstart=minend=endof(s)+1
    minp=""=>""
    for n = 1:N
        p = pairs[n]
        fr = first(rangecache[n])
        if fr ≥ st
            r = rangecache[n] # re-use previous search result
            last(r) == 0 && continue # not found
        else
            r = rangecache[n] = findnext(p.first, s, st)
            fr = first(r)
        end
        if fr > 0
            e = last(r)
            if fr < minstart || (fr == minstart && e > minend)
                minstart, minend, minp = fr, e, p
            end
        else # not found
            rangecache[n] = endof(s)+1:0
        end
    end
    return minstart:minend, minp
end

# cover the multiple pairs case
function replace(str::String, pat_repls::Pair...; count::Integer=typemax(Int))
    if count == 0 || length(pat_repls) == 0
        return str
    end
    count < 0 && throw(DomainError(count, "`count` must be non-negative."))
    pat_repls = predicatepair.(pat_repls)
    n = 1
    e = endof(str)
    i = a = start(str)
    rangecache = [0:0 for i=1:length(pat_repls)]
    r, (pattern, repl) = findnextall!(pat_repls, str, i, rangecache)
    j, k = first(r), last(r)
    out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
    out.size = 0
    out.ptr = 1
    while j ≤ endof(str)
        if i == a || i <= k
            unsafe_write(out, pointer(str, i), UInt(j-i))
            _replace(out, repl, str, r, pattern)
        end
        if k < j
            i = j
            j > e && break
            k = nextind(str, j)
        else
            i = k = nextind(str, k)
        end
        r, (pattern, repl) = findnextall!(pat_repls, str, k, rangecache)
        r == 0:-1 || n == count && break
        j, k = first(r), last(r)
        n += 1
    end
    write(out, SubString(str, i))
    String(take!(out))
end

It gets the string benchmark above down to 0.92 seconds (still worse than regex, though).

stevengj · 2018-01-08T01:43:32Z

Another benchmark (using the modified O(N) version of your replace function above):

julia> s = "alpha beta gamma"^100000;

julia> @time replace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ");
  0.250454 seconds (3.90 M allocations: 116.269 MiB, 4.51% gc time)

julia> @time myreplace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ");
  0.054157 seconds (300.03 k allocations: 10.988 MiB, 4.05% gc time)

The huge number of allocations in replace doesn't help — one contributor to this is the fact that findnextall (and my findnextall!) is type-unstable.

(The allocations for myreplace are nearly all creations of SubString objects, which will hopefully be eliminated [converted to stack allocations] with future compiler improvements as mentioned above.)

stevengj · 2018-01-08T02:07:22Z

This version eliminates a couple of type instabilities, cutting down the number of allocations by 25% or so in the above benchmarks, but it is only slightly faster:

import Base: _replace, replace, StringVector

# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second

# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
    occursin(p.first) => p.second
end
predicatepair(p::Pair) = p

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
    minstart=minend=endof(s)+1
    nmin=0
    for n = 1:N
        p = pairs[n]
        fr = first(rangecache[n])
        if fr ≥ st
            r = rangecache[n] # re-use previous search result
            last(r) == 0 && continue # not found
        else
            r = rangecache[n] = findnext(p.first, s, st)
            fr = first(r)
        end
        if fr > 0
            e = last(r)
            if fr < minstart || (fr == minstart && e > minend)
                minstart, minend, nmin = fr, e, n
            end
        else # not found
            rangecache[n] = endof(s)+1:0
        end
    end
    return minstart:minend, nmin
end

# cover the multiple pairs case
function replace(str::String, pat_repls_::Pair...; count::Integer=typemax(Int))
    if count == 0 || isempty(pat_repls_)
        return str
    end
    count < 0 && throw(DomainError(count, "`count` must be non-negative."))
    pat_repls = predicatepair.(pat_repls_)
    n = 1
    e = endof(str)
    i = a = start(str)
    rangecache = [0:0 for i=1:length(pat_repls)]
    r, npat = findnextall!(pat_repls, str, i, rangecache)
    j, k = first(r), last(r)
    out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
    out.size = 0
    out.ptr = 1
    while npat > 0 && n ≤ count
        if i == a || i <= k
            unsafe_write(out, pointer(str, i), UInt(j-i))
            pattern, repl = pat_repls[npat]
            _replace(out, repl, str, r, pattern)
        end
        if k < j
            i = j
            j > e && break
            k = nextind(str, j)
        else
            i = k = nextind(str, k)
        end
        r, npat = findnextall!(pat_repls, str, k, rangecache)
        j, k = first(r), last(r)
        n += 1
    end
    write(out, SubString(str, i))
    String(take!(out))
end

As I said in my first comment above, I think it will be quite tricky to make this kind of "list of functions" approach fast, and will be especially hard to beat regex search (except for special cases like Char-only replacement). Yes, @nalimilan is correct that, in principle, iteration over heterogeneous tuples can sometimes be done completely at compile-time and hence be type-specialized, but that usually requires much trickier code than is being employed here.

And yes, the regex approach doesn't support user-defined character predicates, but we have map for that… I'm not sure it's necessary to support such functions here. Of course, one option would be to keep the slower general version and add specialized versions (e.g. regex-based) for particular cases, but that has the drawback of much greater code complexity.

StefanKarpinski · 2018-01-08T04:25:03Z

I would just like to interject that the fact that this API affords various clever specializations that can be so much faster than naive repeated replacement indicates (to me at least), that this is a good API which allows the user to easily express their intention in a simple way which the implementation can nevertheless make fast in various clever ways.

stevengj · 2018-01-08T13:22:14Z

@StefanKarpinski, I was actually coming to the opposite conclusion. Take the "alpha"=>"α", "beta"=>"β", "gamma"=>"γ" replacement, for example. If you have to apply this replacement to many small strings, you should definitely not use replace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ"). Instead, you should do

d = Dict("alpha"=>"α", "beta"=>"β", "gamma"=>"γ")
r = r"alpha|beta|gamma"  # or Regex(join(keys(d), '|'))
replace(s, r => (s -> d[s]))

or we could even support replace(s, r=>d) as I mentioned above. This is not that much longer than a multi-argument replace, and is much more efficient for small strings because d and r only need to be constructed once (6x faster for a length-80 string in a quick test). We could easily add this example to the documentation for replace.

As another example, take @KlausC's example, replace(s, 'c'=>"BOOM", isupper=>""). If, instead, you do

buf = IOBuffer()
for c in s
    if c == 'c'
        print(buf, "BOOM")
    elseif !isupper(c)
        print(buf, c)
    end
end
String(take!(buf))

it is 30x faster on my machine for randstring(10^5).

Basically, if you are doing multiple replacements, there is almost always going to be a vastly faster way to do it than any generic multi-arg replace function we are likely to write. Character-predicate loops and the regex-alternative method, which probably cover the two most common uses of this function, are not even that hard to write.

By providing a multi-arg string replace function, we arguably discourage people from writing more efficient code for this problem, because they will assume that the standard-library implementation is probably better than anything mere mortals could write. And if they don't care about performance, nested replace calls are fine too.

nalimilan · 2018-01-08T14:45:30Z

@stevengj I don't understand your position. Nothing prevents us from writing efficient methods for cases where the pair's LHS are only chars or predicates, of for when they are only strings. If we are really concerned about trapping users with slow methods, we could even avoid supporting complex cases for which we have no efficient solution at the moment (notably, those mixing all three types of LHS).

StefanKarpinski · 2018-01-08T15:24:02Z

@stevengj I don't understand your position.

Agree. We can always turn a replacement of multiple string keys into a regex with a dictionary lookup in the implementation. We could even have a replacer function which generates a compiled version of the above and can then be called when needed for optimal performance.

stevengj · 2018-01-08T15:43:00Z

@nalimilan, the current API makes it almost impossible to write efficient code for e.g. the "alpha"=>"α", "beta"=>"β", "gamma"=>"γ" case applied to many short strings, because there is no way to cache the regex or the dict between searches. The same applies to any case where precomputation is helpful.

In contrast, giving the user an example of how to use | in a regex replacement teaches the user how to easily write efficient code for the most common cases of multiple replacements.

It's true that we could have a replacer API for precomputed cases. I'm just not sure it is worth it — how hard is it to get people to use |?

Also, a replacer precomputed-replacement API is precisely the sort of thing that could go into a package (because it would create its own Replacer type that could be passed e.g. to new Base.replace methods without type piracy).

stevengj · 2018-01-08T15:54:21Z

My position is that it is counterproductive to put an API into base with the rationale, "Use this API if you care about performance, instead of calling replace multiple times … but don't use this API if you actually care about performance, because it will often be 10x slower than writing your own regex | or character loop."

nalimilan · 2018-01-08T16:01:42Z

OK, I see your point about the inefficiency of rebuilding a regex and a dictionary on every call. But that doesn't apply to situations where the LHS is a Char or a predicate. We could at least add support for these, for consistency with the generic API for collections (which operates on one entry at a time). The methods where the LHS is a string or a regex could be added later if we find an efficient way to implement them (I see no reason why we shouldn't be able to match the performance of PCRE).

stevengj · 2018-01-08T16:11:10Z

@nalimilan, I'm not so opposed to a multi-arg replace function restricted to multiple single-character predicates. I think it should probably be a separate PR, though, since an efficient implementation for that special case will share no code with the current PR.

It might be confusing to support multi-arg replace for only a small subset of the supported string-replacement patterns, of course. Also, I'm not sure how useful it is: how often do people actually do multi-single-char string replacements where they wouldn't just use map or their own loop?

I see no reason why we shouldn't be able to match the performance of PCRE

To do that, you might have to precompute a search tree (to efficiently handle multiple strings with common prefixes), similar to PCRE, so you are still hurt by the impossibility of separate precomputation with the current API.

stevengj · 2018-01-08T16:15:06Z

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char. Honestly, I think that most people doing complicated character mappings would be better off just writing a loop.

nalimilan · 2018-01-08T16:25:50Z

It might be confusing to support multi-arg replace for only a small subset of the supported string-replacement patterns, of course. Also, I'm not sure how useful it is: how often do people actually do multi-single-char string replacements where they wouldn't just use map or their own loop?

No idea, but supporting generic APIs sounds like a good idea in general. We've had at least one complaint about this already: https://discourse.julialang.org/t/inconsistency-between-replace-on-an-abstractstring-vs-vector-char/8206 (though note the example should be replace("foobar", 'o' => 'O', 'b' => 'B') for it to be a real inconsistency with Vector{Char}).

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char.

Mmm, indeed. Looks like we should change this API before 0.7 as it's indeed inefficient and goes against the other methods for general iterables. It's a bit annoying, but we should probably pass a Char when the LHS is a Char or a predicate, and a SubString when the LHS is a string or a regex.

KlausC · 2018-01-09T20:52:24Z

I added the following changes due to the discussion between @stevengj and @nalimilan:

the algorithm is now O(n)
if all patterns are strings or chars, they are aggregated to a single regex
if all patterns are chars or predicate functions, they are composed to a single special function
patterns like 2./3. can arbitrarily be composed with regex, predicates, chars, and strings. If possible they are partially aggregated.
if the consolidation leads to a single regex, the standard function with one pat-repl-pair is called.
the preparation effort, including regex compilation, can be stored in a re-usable cache object (kwa 'cache' and struct ReplaceCache were added).
test cases for 2. - 4.

nalimilan · 2018-01-29T22:05:29Z

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char.

@stevengj I wrote above that this was indeed a problem, but now I can't find the replace method we were talking about. Does it actually exist? AFAICT we do not currently support a replacement function with strings, and if we added this function for consistency with arrays it should take a Char.

KlausC · 2018-01-30T10:44:58Z

In v0.6 we had:

  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> replace("abc", "bc", uppercase)
"aBC"

in v0.7 we have:

  | | |_| | | | (_| |  |  Version 0.7.0-DEV.3647 (2018-01-30 07:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 6faa23bab0 (0 days old master)
|__/                   |  x86_64-redhat-linux

julia> replace("abc", "bc" => uppercase)
"aBC"

Implementation is here:

strings/utils.jl:391   _replace(io, repl::Function, str, r, pattern) = 
                                print(io, repl(SubString(str, first(r), last(r))))

So if repl is a Function, we call it on SubString(...).
Maybe str[r] would be better in general.
If length(r) == 1 calling repl(str[first[r]) would call the Char-function, which is much more efficient for the standard character conversion functions (Unicode.uppercase ...).

nalimilan · 2018-01-30T11:08:46Z

Ah, of course, I was looking for something like replace(f, s).

Maybe str[r] would be better in general.
If length(r) == 1 calling repl(str[first[r]) would call the Char-function, which is much more efficient for the standard character conversion functions (Unicode.uppercase ...).

Yes, though to ensure type stability the type of the argument should not depend on the length of the match, but just on the type of the LHS: if it's a Char or a set thereof, pass a Char to the function, else pass a SubString. Would you make a PR?

KlausC · 2018-01-30T11:36:39Z

I willl do. I try to add method:

_replace(io, repl::Function, str, r, pattern::Function) = 
                                print(io, repl(getindex(str, first(r))))

After the transformation of input patterns, that will cover all single-character types, leaving String and Regex as has been.

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396

KlausC and others added 5 commits January 4, 2018 21:23

replace with multiple pattern-repl pairs

c7e2235

Merge branch 'master' into replace-multiple-pairs

e19d243

resolved deprecation conflicts

c3640e4

resolved merge conflicts

b58ab28

consistently use findnext

93e2bc3

nalimilan reviewed Jan 7, 2018

View reviewed changes

review results for replace multiple pairs

fe86734

stevengj reviewed Jan 7, 2018

View reviewed changes

added test cases with count

f3df10a

KlausC added 2 commits January 8, 2018 23:04

implemented O(n) algorithm and regex optimizations

58d61d2

optimizations and test cases added

61d9b81

KlausC mentioned this pull request Jan 24, 2018

replace for strings with multiple patterns #25732

Closed

KlausC closed this Jan 24, 2018

KlausC mentioned this pull request Jan 30, 2018

use Char replacement function for Charpattern #25815

Merged

nalimilan mentioned this pull request Dec 21, 2018

Allow multiple pair arguments in replace on strings. #30457

Closed

tkf mentioned this pull request Apr 1, 2020

replace does not handle multiple patterns for String #35327

Closed

stevengj mentioned this pull request Apr 12, 2020

[RFC] support multiple s=>r Pairs in strings replace method #35414

Closed

replace with multiple pattern-repl pairs #25396

replace with multiple pattern-repl pairs #25396

Conversation

KlausC commented Jan 4, 2018

KlausC commented Jan 4, 2018

KlausC commented Jan 5, 2018

StefanKarpinski commented Jan 5, 2018

rfourquet commented Jan 6, 2018

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KlausC Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KlausC Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KlausC Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KlausC Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KlausC Jan 7, 2018 • edited Loading

Choose a reason for hiding this comment

stevengj commented Jan 7, 2018 • edited Loading

stevengj commented Jan 7, 2018 • edited Loading

KlausC commented Jan 7, 2018 • edited by nalimilan Loading

nalimilan commented Jan 7, 2018

KlausC commented Jan 7, 2018 • edited Loading

stevengj commented Jan 7, 2018 • edited Loading

stevengj commented Jan 7, 2018 • edited Loading

KlausC commented Jan 7, 2018

KlausC commented Jan 7, 2018

stevengj commented Jan 7, 2018

stevengj commented Jan 8, 2018

stevengj commented Jan 8, 2018 • edited Loading

stevengj commented Jan 8, 2018 • edited Loading

StefanKarpinski commented Jan 8, 2018

stevengj commented Jan 8, 2018 • edited Loading

nalimilan commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

stevengj commented Jan 8, 2018 • edited Loading

stevengj commented Jan 8, 2018

nalimilan commented Jan 8, 2018

stevengj commented Jan 8, 2018 • edited Loading

stevengj commented Jan 8, 2018 • edited Loading

nalimilan commented Jan 8, 2018

KlausC commented Jan 9, 2018

nalimilan commented Jan 29, 2018

KlausC commented Jan 30, 2018

nalimilan commented Jan 30, 2018

KlausC commented Jan 30, 2018 • edited Loading

KlausC Jan 7, 2018 •

edited

Loading

KlausC Jan 7, 2018 •

edited

Loading

KlausC Jan 7, 2018 •

edited

Loading

KlausC Jan 7, 2018 •

edited

Loading

KlausC Jan 7, 2018 •

edited

Loading

stevengj commented Jan 7, 2018 •

edited

Loading

stevengj commented Jan 7, 2018 •

edited

Loading

KlausC commented Jan 7, 2018 •

edited by nalimilan

Loading

KlausC commented Jan 7, 2018 •

edited

Loading

stevengj commented Jan 7, 2018 •

edited

Loading

stevengj commented Jan 7, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

stevengj commented Jan 8, 2018 •

edited

Loading

KlausC commented Jan 30, 2018 •

edited

Loading