-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace with multiple pattern-repl pairs #25396
Conversation
A conflicting change, which introduced |
@StefanKarpinski is there a chance to get that reviewed. |
Not by me, I'm afraid. Maybe @rfourquet or @nalimilan? |
Many thanks for your contribution! I'm not a string guy, and this PR looks rather technically involved, so I'm afraid I won't be able to review it any time soon. I have now very limited free time, which I need to dedicate on things to be done for 0.7. But if no-one can review this in time for 0.7, I'm confident that it will get incorporated for 1.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks mostly good, the main point to discuss is the behavior when several pairs match.
base/strings/util.jl
Outdated
return str | ||
end | ||
count < 0 && throw(DomainError(count, "`count` must be non-negative.")) | ||
pat_repls = pairmap.(pat_repls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this type-stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I rely on the broadcasting dot.
Looks so, for me. Input is a tuple of Pair
, as output is:
julia> pp(p::Pair{Char}) = equalto(p.first)=>p.second
pp (generic function with 3 methods)
julia> pp(p::Pair) = p.first=>p.second
pp (generic function with 3 methods)
julia> pp.( ('a'=>'b', "xyz"=>"") )
(Base.EqualTo{Char}('a') => 'b', "xyz" => "")
julia> typeof(pp.( ('a'=>'b', "xyz"=>"") ))
Tuple{Pair{Base.EqualTo{Char},Char},Pair{String,String}}
julia> pp.( ('a'=>'b', "xyz"=>"") ) isa NTuple{N,Pair} where N
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work, but run @code_warntype
on the replace
method to make sure that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan , you wanted to see the complete output.
julia> using Unicode
julia> replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase)
"ÄBCde"
julia> @code_warntype replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase)
Variables:
str::String
pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
count<optimized out>
n::Int64
e<optimized out>
a<optimized out>
i::Any
r::Any
pattern::Any
repl::Any
j::Any
k::Any
out<optimized out>
#temp#@_23::Any
px<optimized out>
py<optimized out>
R<optimized out>
#temp#@_28::Bool
pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}
Body:
begin
pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} = pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
#= line 429 =#
goto 4
4:
5:
goto 8
#= line 430 =#
8:
#= line 432 =#
goto 11
11:
#= line 433 =#
# meta: location broadcast.jl broadcast 241
# meta: location tuple.jl map 151
# meta: location tuple.jl getindex 21
SSAValue(62) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 1, true)::Pair{Char,Char}
# meta: pop location
# meta: location strings/util.jl predicatepair 392
# meta: location sysimg.jl getproperty 8
SSAValue(67) = (Base.getfield)(SSAValue(62), :first)::Char
# meta: pop location
# meta: location operators.jl Type 831
SSAValue(70) = $(Expr(:new, Base.EqualTo{Char}, SSAValue(67)))
# meta: pop location
# meta: location sysimg.jl getproperty 8
SSAValue(71) = (Base.getfield)(SSAValue(62), :second)::Char
# meta: pop location
# meta: location pair.jl Type 32
# meta: location pair.jl Type 32
SSAValue(77) = $(Expr(:new, Pair{Base.EqualTo{Char},Char}, SSAValue(70), SSAValue(71)))
# meta: pop locations (3)
# meta: location tuple.jl getindex 21
SSAValue(78) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 2, true)::Pair{Array{Char,1},typeof(uppercase)}
# meta: pop location
# meta: location strings/util.jl predicatepair 396
# meta: location sysimg.jl getproperty 8
SSAValue(83) = (Base.getfield)(SSAValue(78), :first)::Array{Char,1}
# meta: pop location
# meta: location operators.jl Type 850
SSAValue(86) = $(Expr(:new, Base.OccursIn{Array{Char,1}}, SSAValue(83)))
# meta: pop location
# meta: location pair.jl Type 32
# meta: location pair.jl Type 32
SSAValue(92) = $(Expr(:new, Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}, SSAValue(86), Base.Unicode.uppercase))
# meta: pop locations (3)
SSAValue(61) = (Core.tuple)(SSAValue(77), SSAValue(92))::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}
# meta: pop locations (2)
pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}} = SSAValue(61)
#= line 434 =#
n::Int64 = 1
#= line 435 =#
# meta: location strings/basic.jl endof 145
# meta: location strings/string.jl ncodeunits 74
SSAValue(97) = (Core.sizeof)(str::String)::Int64
# meta: pop location
SSAValue(95) = $(Expr(:invoke, MethodInstance for thisind(::String, ::Int64), :(Base.thisind), :(str), SSAValue(97)))::Int64
# meta: pop location
#= line 436 =#
i::Any = 1
#= line 437 =#
SSAValue(5) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(i::Int64)))::Tuple{Any,Pair}
# meta: location tuple.jl indexed_next 58
# meta: location tuple.jl getindex 21
SSAValue(101) = (Base.getfield)(SSAValue(5), 1, true)::Any
# meta: pop locations (2)
r::Any = SSAValue(101)
# meta: location tuple.jl indexed_next 58
# meta: location tuple.jl getindex 21
SSAValue(105) = (Base.getfield)(SSAValue(5), 2, true)::Pair
# meta: pop locations (2)
# meta: location pair.jl indexed_next 42
SSAValue(106) = (Base.getfield)(SSAValue(105), 1)::Any
# meta: pop location
pattern::Any = SSAValue(106)
# meta: location pair.jl indexed_next 42
SSAValue(109) = (Base.getfield)(SSAValue(105), 2)::Any
# meta: pop location
repl::Any = SSAValue(109)
#= line 438 =#
SSAValue(11) = (Base.first)(r::Any)::Any
SSAValue(12) = (Base.last)(r::Any)::Any
j::Any = SSAValue(11)
k::Any = SSAValue(12)
#= line 439 =#
# meta: location strings/basic.jl sizeof 144
# meta: location strings/string.jl ncodeunits 74
SSAValue(117) = (Core.sizeof)(str::String)::Int64
# meta: pop location
# meta: location int.jl * 54
SSAValue(118) = (Base.mul_int)(SSAValue(117), 1)::Int64
# meta: pop locations (2)
# meta: location promotion.jl * 291
# meta: location promotion.jl promote 261
# meta: location promotion.jl _promote 219
# meta: location number.jl convert 7
# meta: location float.jl Type 57
SSAValue(131) = (Base.sitofp)(Float64, SSAValue(118))::Float64
# meta: pop locations (4)
# meta: location float.jl * 397
SSAValue(142) = (Base.mul_float)(1.2, SSAValue(131))::Float64
# meta: pop locations (2)
# meta: location float.jl floor 351
# meta: location float.jl floor 362
SSAValue(145) = (Base.floor_llvm)(SSAValue(142))::Float64
# meta: pop location
# meta: location float.jl trunc 679
# meta: location float.jl <= 452
SSAValue(150) = (Base.le_float)(-9.223372036854776e18, SSAValue(145))::Bool
# meta: pop location
unless SSAValue(150) goto 116
# meta: location float.jl < 450
SSAValue(151) = (Base.lt_float)(SSAValue(145), 9.223372036854776e18)::Bool
# meta: pop location
#temp#@_28::Bool = SSAValue(151)
goto 118
116:
#temp#@_28::Bool = false
118:
unless #temp#@_28::Bool goto 125
#= line 680 =#
# meta: location float.jl unsafe_trunc 298
SSAValue(152) = (Base.fptosi)(Int64, SSAValue(145))::Int64
# meta: pop location
goto 129
125:
#= line 682 =#
SSAValue(148) = $(Expr(:invoke, MethodInstance for InexactError(::Symbol, ::Any, ::Any), :(Base.InexactError), :(:trunc), Int64, SSAValue(145)))::InexactError
(Base.throw)(SSAValue(148))::Union{}
129:
# meta: pop locations (2)
# meta: location iobuffer.jl StringVector 30
# meta: location strings/string.jl _string_n 52
# meta: location essentials.jl cconvert 327
# meta: location number.jl convert 7
# meta: location boot.jl Type 681
# meta: location boot.jl toUInt64 651
# meta: location boot.jl check_top_bit 540
# meta: location boot.jl is_top_bit_set 530
SSAValue(170) = (Core.lshr_int)(SSAValue(152), 63)::Int64
SSAValue(171) = (Core.trunc_int)(Core.Int8, SSAValue(170))::Int8
SSAValue(173) = (Core.eq_int)(SSAValue(171), 1)::Bool
# meta: pop location
unless SSAValue(173) goto 145
$(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(152)))::Union{}
145:
# meta: pop location
SSAValue(164) = (Core.bitcast)(Core.UInt64, SSAValue(152))::UInt64
# meta: pop locations (4)
SSAValue(158) = $(Expr(:foreigncall, :(:jl_alloc_string), Ref{String}, svec(UInt64), :(:ccall), 1, SSAValue(164), SSAValue(164)))
# meta: pop location
# meta: location strings/string.jl unsafe_wrap 63
SSAValue(174) = $(Expr(:foreigncall, :(:jl_string_to_array), Ref{Array{UInt8,1}}, svec(Any), :(:ccall), 1, SSAValue(158)))
# meta: pop locations (2)
# meta: location iobuffer.jl Type 63
# meta: location iobuffer.jl Type 63
# meta: location iobuffer.jl Type 26
# meta: location iobuffer.jl Type 19
# meta: location array.jl length 138
SSAValue(192) = (Base.arraylen)(SSAValue(174))::Int64
# meta: pop location
SSAValue(191) = $(Expr(:new, Base.GenericIOBuffer{Array{UInt8,1}}, SSAValue(174), true, true, true, false, SSAValue(192), 9223372036854775807, 1, -1))
# meta: pop locations (4)
#= line 440 =#
goto 165
165:
166:
# meta: location sysimg.jl setproperty! 9
(Base.setfield!)(SSAValue(191), :size, 0)::Int64
# meta: pop location
#= line 441 =#
goto 172
172:
173:
# meta: location sysimg.jl setproperty! 9
(Base.setfield!)(SSAValue(191), :ptr, 1)::Int64
# meta: pop location
#= line 442 =#
178:
SSAValue(27) = (j::Any != 0)::Any
unless SSAValue(27) goto 264
#= line 443 =#
SSAValue(28) = (i::Any == 1)::Any
unless SSAValue(28) goto 186
#temp#@_23::Any = SSAValue(28)
goto 188
186:
#temp#@_23::Any = (i::Any <= k::Any)::Any
188:
unless #temp#@_23::Any goto 197
#= line 444 =#
SSAValue(30) = (Base.pointer)(str::String, i::Any)::Ptr{UInt8}
SSAValue(31) = (j::Any - i::Any)::Any
SSAValue(32) = (Base.UInt)(SSAValue(31))::Any
(Base.unsafe_write)(SSAValue(191), SSAValue(30), SSAValue(32))::Int64
#= line 445 =#
(Base._replace)(SSAValue(191), repl::Any, str::String, r::Any, pattern::Any)::Any
197:
#= line 447 =#
SSAValue(33) = (k::Any < j::Any)::Any
unless SSAValue(33) goto 211
#= line 448 =#
i::Any = j::Any
#= line 449 =#
SSAValue(34) = (j::Any > SSAValue(95))::Any
unless SSAValue(34) goto 207
goto 264
207:
#= line 450 =#
k::Any = (Base.nextind)(str::String, j::Any)::Int64
goto 216
211:
#= line 452 =#
SSAValue(35) = (Base.nextind)(str::String, k::Any)::Int64
k::Any = SSAValue(35)
i::Any = SSAValue(35)
216:
#= line 454 =#
SSAValue(36) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(k::Int64)))::Tuple{Any,Pair}
# meta: location tuple.jl indexed_next 58
# meta: location tuple.jl getindex 21
SSAValue(204) = (Base.getfield)(SSAValue(36), 1, true)::Any
# meta: pop locations (2)
r::Any = SSAValue(204)
# meta: location tuple.jl indexed_next 58
# meta: location tuple.jl getindex 21
SSAValue(208) = (Base.getfield)(SSAValue(36), 2, true)::Pair
# meta: pop locations (2)
# meta: location pair.jl indexed_next 42
SSAValue(209) = (Base.getfield)(SSAValue(208), 1)::Any
# meta: pop location
pattern::Any = SSAValue(209)
# meta: location pair.jl indexed_next 42
SSAValue(212) = (Base.getfield)(SSAValue(208), 2)::Any
# meta: pop location
repl::Any = SSAValue(212)
#= line 455 =#
SSAValue(42) = r::Any
# meta: location range.jl colon 5
# meta: location range.jl Type 153
SSAValue(221) = $(Expr(:new, UnitRange{Int64}, 0, -1))
# meta: pop locations (2)
SSAValue(44) = (SSAValue(42) == SSAValue(221))::Any
unless SSAValue(44) goto 245
goto 251
245:
# meta: location promotion.jl == 386
SSAValue(222) = (n::Int64 === 9223372036854775807)::Bool
# meta: pop location
unless SSAValue(222) goto 251
goto 264
251:
#= line 456 =#
SSAValue(46) = (Base.first)(r::Any)::Any
SSAValue(47) = (Base.last)(r::Any)::Any
j::Any = SSAValue(46)
k::Any = SSAValue(47)
#= line 457 =#
# meta: location int.jl + 53
SSAValue(223) = (Base.add_int)(n::Int64, 1)::Int64
# meta: pop location
n::Int64 = SSAValue(223)
262:
goto 178
264:
#= line 459 =#
SSAValue(49) = (Base.SubString)(str::String, i::Any)::SubString{String}
# meta: location strings/io.jl write 139
# meta: location sysimg.jl getproperty 8
SSAValue(233) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64
# meta: pop location
# meta: location int.jl <= 405
SSAValue(234) = (Base.sle_int)(SSAValue(233), 0)::Bool
# meta: pop location
unless SSAValue(234) goto 276
goto 342
276:
# meta: location sysimg.jl getproperty 8
SSAValue(235) = (Base.getfield)(SSAValue(49), :string)::String
# meta: pop location
# meta: location sysimg.jl getproperty 8
SSAValue(236) = (Base.getfield)(SSAValue(49), :offset)::Int64
# meta: pop location
# meta: location int.jl + 53
SSAValue(237) = (Base.add_int)(SSAValue(236), 1)::Int64
# meta: pop location
# meta: location strings/string.jl pointer 72
# meta: location strings/string.jl pointer 71
# meta: location pointer.jl unsafe_convert 59
# meta: location pointer.jl pointer_from_objref 142
goto 291
291:
#= line 143 =#
SSAValue(250) = $(Expr(:foreigncall, :(:jl_value_ptr), Ptr{Nothing}, svec(Any), :(:ccall), 1, SSAValue(235)))
# meta: pop location
# meta: location pointer.jl + 154
# meta: location boot.jl Type 691
SSAValue(257) = (Core.bitcast)(Core.UInt, SSAValue(250))::UInt64
# meta: pop location
SSAValue(255) = (add_ptr)(SSAValue(257), 0x0000000000000008)::UInt64
# meta: location essentials.jl oftype 301
# meta: location pointer.jl convert 26
# meta: location boot.jl Type 696
SSAValue(263) = (Core.bitcast)(Ptr{Nothing}, SSAValue(255))::Ptr{Nothing}
# meta: pop locations (4)
# meta: location pointer.jl convert 30
SSAValue(265) = (Base.bitcast)(Ptr{UInt8}, SSAValue(263))
# meta: pop locations (3)
# meta: location int.jl - 52
SSAValue(266) = (Base.sub_int)(SSAValue(237), 1)::Int64
# meta: pop location
# meta: location pointer.jl + 154
# meta: location boot.jl Type 691
SSAValue(273) = (Core.bitcast)(Core.UInt, SSAValue(265))::UInt64
# meta: pop location
# meta: location int.jl rem 445
SSAValue(274) = (Base.bitcast)(UInt64, SSAValue(266))
# meta: pop location
SSAValue(271) = (add_ptr)(SSAValue(273), SSAValue(274))::UInt64
# meta: location essentials.jl oftype 301
# meta: location pointer.jl convert 26
# meta: location boot.jl Type 696
SSAValue(280) = (Core.bitcast)(Ptr{UInt8}, SSAValue(271))::Ptr{UInt8}
# meta: pop locations (5)
# meta: location sysimg.jl getproperty 8
SSAValue(281) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64
# meta: pop location
# meta: location boot.jl Type 681
# meta: location boot.jl toUInt64 651
# meta: location boot.jl check_top_bit 540
# meta: location boot.jl is_top_bit_set 530
SSAValue(291) = (Core.lshr_int)(SSAValue(281), 63)::Int64
SSAValue(292) = (Core.trunc_int)(Core.Int8, SSAValue(291))::Int8
SSAValue(294) = (Core.eq_int)(SSAValue(292), 1)::Bool
# meta: pop location
unless SSAValue(294) goto 337
$(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(281)))::Union{}
337:
# meta: pop location
SSAValue(285) = (Core.bitcast)(Core.UInt64, SSAValue(281))::UInt64
# meta: pop locations (2)
$(Expr(:invoke, MethodInstance for unsafe_write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64), :(Base.unsafe_write), SSAValue(191), SSAValue(280), SSAValue(285)))::Int64
342:
# meta: pop location
#= line 460 =#
SSAValue(50) = $(Expr(:invoke, MethodInstance for take!(::Base.GenericIOBuffer{Array{UInt8,1}}), :(Base.take!), SSAValue(191)))::Array{UInt8,1}
# meta: location strings/string.jl Type 30
SSAValue(295) = $(Expr(:foreigncall, :(:jl_array_to_string), Ref{String}, svec(Any), :(:ccall), 1, SSAValue(50)))
# meta: pop location
return SSAValue(295)
end::String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, damn keyword arguments, they hide the body of the function. I guess the simplest approach is to remove the count
argument for the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the output of @code_warntype
after removing the keyword argument.
There are test cases missing, which use this argument :_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the output looks incomplete. Some variables do not appear to be inferred as concrete types, but they don't appear in the body.
base/strings/util.jl
Outdated
i = a = start(str) | ||
r, (pattern, repl) = findnextall(pat_repls, str, i) | ||
j, k = first(r), last(r) | ||
out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment to explain the choice of this constant? Why not start with the size of the input string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this code is inherited. I can only speculate. Maybe the author made extensive probability studies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. Let's leave it as-is then, unless somebody can explain this choice.
base/strings/util.jl
Outdated
@@ -427,16 +497,19 @@ function replace(str::String, pat_repl::Pair; count::Integer=typemax(Int)) | |||
end | |||
|
|||
""" | |||
replace(s::AbstractString, pat=>r; [count::Integer]) | |||
replace(s::AbstractString, pat=>r[, pat=>r ...]; [count::Integer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add ...
after the first pat => r
, no need to repeat it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
base/strings/util.jl
Outdated
# find the first of a list of patterns. | ||
# if several match at the same start position prefer the longer one | ||
# if also the length is equal, prefer the first in list | ||
# return value of the successful search and the pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"value" -> "index range"
Also, better not add a blank line after the comment so that it's obvious it's attached to the function. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
base/strings/util.jl
Outdated
# return value of the successful search and the pair | ||
|
||
function findnextall(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer=1) where N | ||
ps = start(pairs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tuples start is always 1
. You don't need to call next
either, you can just do for i in 1:N
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first I will include the as an optimization.
I want to continue with while/done/next
because it looks more elegant and performant to me than
N = length(pairs)
for i = 1:N
p = pairs[i]
....
if fr > 0
...
for j = i+1:N
p = pairs[j]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More performant: definitely not. More elegant: can always be discussed, but it's not obvious. Note that N
is already defined as a type parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, you convinced me. I copied my pattern from somewhere in base or stdlib (don't remember where exactly), and thought performance was the driver.
Thanks for the hint regarding N
.
Is it better to introduce a second variable j
for the inner loop or can I re-use the i
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's clearer to use j
.
Using next
/done
is needed to support generic iterators, so that might be the reason why it was used.
base/strings/util.jl
Outdated
pairmap(p::Pair) = p | ||
|
||
# find the first of a list of patterns. | ||
# if several match at the same start position prefer the longer one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to just retain the first matching pair? At least that sounds simpler to explain. Or is that to match the behavior of other implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this is special for strings. In case of ties, I do not simply take the first match, but give precedence to the longer match, because it is more specific. Maybe the example demonstrates its usefulness:
replace("one house, two houses", "house"=>"Haus", "houses"=>"Häuser", "one"=>"ein", "two"=>"zwei")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can't the user just choose the appropriate order for the pairs? That won't work for regular expressions (since the length of the match may not be known in advance), but people would better use a single complex regular expression instead I think.
Another potential issue with not simply following the order in which pairs are provided is performance. One of the motivations for allowing for multiple pairs was to be more efficient than calling replace
multiple times with different pairs. Doesn't having to match all pairs and compare them make the function slower? Or is that cost only paid when patterns overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost may increase only, if shorter patterns, which match are followed by longer patterns, which match at the same position. In other cases, I can't know in advance, it a later pattern does not match at an earlier position.
BTW I would prefer to use a modified version of findnext
, which has an extra parameter does_not_start_after
, which would allow not to search the full string for all patterns, but stop earlier for the later patterns, if there was a match before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docu text for replace
has been updated. That should make the rule clear to the user of the new feature:
510 In the case of multiple patterns, the precedence is by low start position of matched pattern,
511 then longer pattern, then order of the input list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the additional optional argument for findnext(,::AbstractString,...)
?
I think the does_not_start_after
could reduce search-times in a few cases. It would have another effect, than simply replacing the input string by a shortened substring.
The contract of this modified findnext(pattern, input, start, not_after
would sound like:
"Give me the position of the next pattern match after start
, but don't try input beyond not_after
"
base/strings/util.jl
Outdated
count::Integer=typemax(Int)) = | ||
replace(str, occursin(first(pat_repl)) => last(pat_repl), count) | ||
# replace Char with function that compares with this Char | ||
pairmap(p::Pair{Char}) = equalto(p.first) => p.second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairmap
sounds too general. Maybe predicatepair
or any other term which doesn't risk being used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed ot predicatepair
.
base/strings/util.jl
Outdated
minstart, minend, minp = fr, last(r), p | ||
while !done(pairs, ps) | ||
p, ps = next(pairs, ps) | ||
r = findnext(p.first, s, st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say that I'm skeptical of the whole approach in this PR. First converting the user inputs to a tuple of Pair{Function}
and then calling findnext
on the function seems almost guaranteed to be inefficient, in several ways.
For one thing, storing a bunch of functions in a data structure and then calling findnext
on them probably makes it impossible for the compiler to specialize findnext
by inlining the functions.
For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, I should disagree :-).
Most of your objections came also to my mind, but I could handle it.
- the code was copied from the original implementation of
replace
, which has not been removed, so there is no worse performance than before for the case with one pair. - the use of
findnext
had been inserted to the code base shortly before this PR. It replacedsearchnext
, which could handleChar
and collections ofChar
as patterns, whichfindmax
does not do. So for those cases, theequalto
andoccursin
need used. I think, the authors offindmax
proved, that performance is not worse than forsearchnext
. I could measure a relevant improvement. - Replacing
Char
byequalsto(::Char)
is only done once for each call toreplace
, while in the old approach withsearchnext
the specialization had to be done for each call ofsearchnext
, which has to be done up tolength(s)
times for each call toreplace
. - my function
findnextall
is a slight generalization offindmax
, which is thought to make simultaneous pattern searches. Given the complexity offindnext
it was thought better to re-use that than to re-implement the multiple find next from scratch. The rest of the code ofreplace
was re-used, replacingfindnext
withfindnextall
. - I see an improvement possibility, by extending the
findnext
function by a new optional argument, which can prevent to search the full input string, when already a bound for the starting point of the match is known. That could decrease search times by up to 50%. - I do believe, the effort used by this implementation is appropriate for the complexity of the problem, which is solved. I am waiting for a proof of the contrary (ask Prof. Knuth).
For the case where the replacement patterns are a bunch of strings or sets of characters, can't we just create a single regular expression and a dictionary of replacements and then call the regex replace function? For example, for the special case of replacing a bunch of strings with other strings, function myreplace(s::AbstractString, p::Pair{<:AbstractString,<:AbstractString}...)
d = Dict(p...)
buf = IOBuffer()
frst = true
for pair in p
frst || print(buf, '|')
frst = false
escape_string(buf, first(pair), "[\\^\$.|?*+()")
end
r = Regex(String(take!(buf)))
replace(s, r => (s -> d[s]))
end This allows us to exploit the efficiency of regex search for multiple patterns. This code could easily be extended to support characters and sets of characters as well as replacement functions. Extending it to sets of regexes would be trickier (e.g. you can't just do a dictionary lookup to get the replacement), but honestly I'm not sure that case is worth implementing. If the user is constructing a bunch of regexes, it seems reasonable to ask them to combine it into a single regex themselves. Still, it's not too bad if you are willing to do a linear search on the replacements instead of (or in addition to) a dictionary lookup. I think it should be sufficient to put parens around each regex (when they are concatenated with |
Copying my comment from above, since it is hidden (attached to an outdated commit):
|
@stevengj , I am just exhausted from defending my own approach - please read that. |
I guess only benchmarking can tell, but it's not obvious to me that creating a regexp will always be more efficient, as they are significantly more complex than simple replacements of chars or strings. Maybe we could use it as a fallback for complex cases, but use simpler method for special cases.
Except if the data structure is a tuple. In some cases at least, the compiler is able to unroll a loop over a tuple and the functions could be inlined. Of course this has to be checked.
For the simple case of matching a |
For this case I do not see a replacement using
|
@KlausC, fair enough, your approach is more general than trying to construct a single regex, which obviously cannot handle arbitrary user-defined character predicates. But I'm still concerned that it is potentially very inefficient. For example, it looks like: n = 10^6
s = "a"^n * "b"
replace(s, "a"=>"A", "b"=>"c") will have Θ(n²) complexity in your current approach. The reason is that your implementation, which calls To mitigate this, you would need to implement a more complicated logic: if one of the predicate searches finds a match beyond the end of the current match (or finds no match at all), then this search result can in principle be re-used |
Note that (in a quick test in Julia 0.6) the performance of my |
I see, that your approach is better by factor 128 in time, 9 in memory allocations, 7 in memory size.
But, I did not use the improved version of |
@stevengj But this modification will not safe me. Just changing "a" and "b" at one place would give same O(n²). |
@KlausC, I'm not sure why changing "a" to "b" at one place in the string would make my suggested mitigation strategy O(n²). My suggestion would only re-do a "b" search once the |
For example, isn't this modification of your code O(n)? import Base: _replace, replace, StringVector
# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second
# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
occursin(p.first) => p.second
end
predicatepair(p::Pair) = p
# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
minstart=minend=endof(s)+1
minp=""=>""
for n = 1:N
p = pairs[n]
fr = first(rangecache[n])
if fr ≥ st
r = rangecache[n] # re-use previous search result
last(r) == 0 && continue # not found
else
r = rangecache[n] = findnext(p.first, s, st)
fr = first(r)
end
if fr > 0
e = last(r)
if fr < minstart || (fr == minstart && e > minend)
minstart, minend, minp = fr, e, p
end
else # not found
rangecache[n] = endof(s)+1:0
end
end
return minstart:minend, minp
end
# cover the multiple pairs case
function replace(str::String, pat_repls::Pair...; count::Integer=typemax(Int))
if count == 0 || length(pat_repls) == 0
return str
end
count < 0 && throw(DomainError(count, "`count` must be non-negative."))
pat_repls = predicatepair.(pat_repls)
n = 1
e = endof(str)
i = a = start(str)
rangecache = [0:0 for i=1:length(pat_repls)]
r, (pattern, repl) = findnextall!(pat_repls, str, i, rangecache)
j, k = first(r), last(r)
out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
out.size = 0
out.ptr = 1
while j ≤ endof(str)
if i == a || i <= k
unsafe_write(out, pointer(str, i), UInt(j-i))
_replace(out, repl, str, r, pattern)
end
if k < j
i = j
j > e && break
k = nextind(str, j)
else
i = k = nextind(str, k)
end
r, (pattern, repl) = findnextall!(pat_repls, str, k, rangecache)
r == 0:-1 || n == count && break
j, k = first(r), last(r)
n += 1
end
write(out, SubString(str, i))
String(take!(out))
end It gets the string benchmark above down to 0.92 seconds (still worse than regex, though). |
Another benchmark (using the modified O(N) version of your
The huge number of allocations in (The allocations for |
This version eliminates a couple of type instabilities, cutting down the number of allocations by 25% or so in the above benchmarks, but it is only slightly faster: import Base: _replace, replace, StringVector
# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second
# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
occursin(p.first) => p.second
end
predicatepair(p::Pair) = p
# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
minstart=minend=endof(s)+1
nmin=0
for n = 1:N
p = pairs[n]
fr = first(rangecache[n])
if fr ≥ st
r = rangecache[n] # re-use previous search result
last(r) == 0 && continue # not found
else
r = rangecache[n] = findnext(p.first, s, st)
fr = first(r)
end
if fr > 0
e = last(r)
if fr < minstart || (fr == minstart && e > minend)
minstart, minend, nmin = fr, e, n
end
else # not found
rangecache[n] = endof(s)+1:0
end
end
return minstart:minend, nmin
end
# cover the multiple pairs case
function replace(str::String, pat_repls_::Pair...; count::Integer=typemax(Int))
if count == 0 || isempty(pat_repls_)
return str
end
count < 0 && throw(DomainError(count, "`count` must be non-negative."))
pat_repls = predicatepair.(pat_repls_)
n = 1
e = endof(str)
i = a = start(str)
rangecache = [0:0 for i=1:length(pat_repls)]
r, npat = findnextall!(pat_repls, str, i, rangecache)
j, k = first(r), last(r)
out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
out.size = 0
out.ptr = 1
while npat > 0 && n ≤ count
if i == a || i <= k
unsafe_write(out, pointer(str, i), UInt(j-i))
pattern, repl = pat_repls[npat]
_replace(out, repl, str, r, pattern)
end
if k < j
i = j
j > e && break
k = nextind(str, j)
else
i = k = nextind(str, k)
end
r, npat = findnextall!(pat_repls, str, k, rangecache)
j, k = first(r), last(r)
n += 1
end
write(out, SubString(str, i))
String(take!(out))
end As I said in my first comment above, I think it will be quite tricky to make this kind of "list of functions" approach fast, and will be especially hard to beat regex search (except for special cases like And yes, the regex approach doesn't support user-defined character predicates, but we have |
I would just like to interject that the fact that this API affords various clever specializations that can be so much faster than naive repeated replacement indicates (to me at least), that this is a good API which allows the user to easily express their intention in a simple way which the implementation can nevertheless make fast in various clever ways. |
@StefanKarpinski, I was actually coming to the opposite conclusion. Take the d = Dict("alpha"=>"α", "beta"=>"β", "gamma"=>"γ")
r = r"alpha|beta|gamma" # or Regex(join(keys(d), '|'))
replace(s, r => (s -> d[s])) or we could even support As another example, take @KlausC's example, buf = IOBuffer()
for c in s
if c == 'c'
print(buf, "BOOM")
elseif !isupper(c)
print(buf, c)
end
end
String(take!(buf)) it is 30x faster on my machine for Basically, if you are doing multiple replacements, there is almost always going to be a vastly faster way to do it than any generic multi-arg By providing a multi-arg string |
@stevengj I don't understand your position. Nothing prevents us from writing efficient methods for cases where the pair's LHS are only chars or predicates, of for when they are only strings. If we are really concerned about trapping users with slow methods, we could even avoid supporting complex cases for which we have no efficient solution at the moment (notably, those mixing all three types of LHS). |
Agree. We can always turn a replacement of multiple string keys into a regex with a dictionary lookup in the implementation. We could even have a |
@nalimilan, the current API makes it almost impossible to write efficient code for e.g. the In contrast, giving the user an example of how to use It's true that we could have a Also, a |
My position is that it is counterproductive to put an API into base with the rationale, "Use this API if you care about performance, instead of calling |
OK, I see your point about the inefficiency of rebuilding a regex and a dictionary on every call. But that doesn't apply to situations where the LHS is a |
@nalimilan, I'm not so opposed to a multi-arg It might be confusing to support multi-arg
To do that, you might have to precompute a search tree (to efficiently handle multiple strings with common prefixes), similar to PCRE, so you are still hurt by the impossibility of separate precomputation with the current API. |
Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a |
No idea, but supporting generic APIs sounds like a good idea in general. We've had at least one complaint about this already: https://discourse.julialang.org/t/inconsistency-between-replace-on-an-abstractstring-vs-vector-char/8206 (though note the example should be
Mmm, indeed. Looks like we should change this API before 0.7 as it's indeed inefficient and goes against the other methods for general iterables. It's a bit annoying, but we should probably pass a |
I added the following changes due to the discussion between @stevengj and @nalimilan:
|
@stevengj I wrote above that this was indeed a problem, but now I can't find the |
In v0.6 we had:
in v0.7 we have:
Implementation is here:
So if repl is a |
Ah, of course, I was looking for something like
Yes, though to ensure type stability the type of the argument should not depend on the length of the match, but just on the type of the LHS: if it's a |
I willl do. I try to add method:
After the transformation of input patterns, that will cover all single-character types, leaving |
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. #25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes #35327 fixes #39061 fixes #35414 fixes #29849 fixes #30457 fixes #25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or too complicated. This aims to be simple, and even beats one of the "handwritten" benchmark cases. Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster, but in my tests, this handily bests even simplified regexes. There can be slow Regexes patterns that can cause this to exhibit O(n^2) behavior, but only if the one of the earlier patterns is a partial match for a later pattern Regex and that Regex always matches O(n) of the input stream. This is a case that is hopefully usually avoidable in practice. fixes JuliaLang#35327 fixes JuliaLang#39061 fixes JuliaLang#35414 fixes JuliaLang#29849 fixes JuliaLang#30457 fixes JuliaLang#25396
#25383