Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace with multiple pattern-repl pairs #25396

Closed
wants to merge 9 commits into from
Closed

replace with multiple pattern-repl pairs #25396

wants to merge 9 commits into from

Conversation

KlausC
Copy link
Contributor

@KlausC KlausC commented Jan 4, 2018

@KlausC
Copy link
Contributor Author

KlausC commented Jan 4, 2018

A conflicting change, which introduced findnext replacing searchnext caused me some extra work. As for me its all done. Waiting for reviews.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 5, 2018

@StefanKarpinski is there a chance to get that reviewed.

@StefanKarpinski
Copy link
Member

Not by me, I'm afraid. Maybe @rfourquet or @nalimilan?

@rfourquet
Copy link
Member

Many thanks for your contribution! I'm not a string guy, and this PR looks rather technically involved, so I'm afraid I won't be able to review it any time soon. I have now very limited free time, which I need to dedicate on things to be done for 0.7. But if no-one can review this in time for 0.7, I'm confident that it will get incorporated for 1.1.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks mostly good, the main point to discuss is the behavior when several pairs match.

return str
end
count < 0 && throw(DomainError(count, "`count` must be non-negative."))
pat_repls = pairmap.(pat_repls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this type-stable?

Copy link
Contributor Author

@KlausC KlausC Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I rely on the broadcasting dot.
Looks so, for me. Input is a tuple of Pair, as output is:

julia> pp(p::Pair{Char}) = equalto(p.first)=>p.second
pp (generic function with 3 methods)

julia> pp(p::Pair) = p.first=>p.second
pp (generic function with 3 methods)

julia> pp.( ('a'=>'b', "xyz"=>"") )
(Base.EqualTo{Char}('a') => 'b', "xyz" => "")

julia> typeof(pp.( ('a'=>'b', "xyz"=>"") ))
Tuple{Pair{Base.EqualTo{Char},Char},Pair{String,String}}

julia> pp.( ('a'=>'b', "xyz"=>"") ) isa NTuple{N,Pair} where N
true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should work, but run @code_warntype on the replace method to make sure that's the case.

Copy link
Contributor Author

@KlausC KlausC Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan , you wanted to see the complete output.

julia> using Unicode

julia> replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase)
"ÄBCde"

julia> @code_warntype replace("abcde", 'a'=>'Ä', ['b', 'c']=>uppercase)
Variables:
  str::String
  pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
  count<optimized out>
  n::Int64
  e<optimized out>
  a<optimized out>
  i::Any
  r::Any
  pattern::Any
  repl::Any
  j::Any
  k::Any
  out<optimized out>
  #temp#@_23::Any
  px<optimized out>
  py<optimized out>
  R<optimized out>
  #temp#@_28::Bool
  pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
  pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}

Body:
  begin
      pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}} = pat_repls@_3::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}
      #= line 429 =#
      goto 4
      4: 
      5: 
      goto 8
      #= line 430 =#
      8: 
      #= line 432 =#
      goto 11
      11: 
      #= line 433 =#
      # meta: location broadcast.jl broadcast 241
      # meta: location tuple.jl map 151
      # meta: location tuple.jl getindex 21
      SSAValue(62) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 1, true)::Pair{Char,Char}
      # meta: pop location
      # meta: location strings/util.jl predicatepair 392
      # meta: location sysimg.jl getproperty 8
      SSAValue(67) = (Base.getfield)(SSAValue(62), :first)::Char
      # meta: pop location
      # meta: location operators.jl Type 831
      SSAValue(70) = $(Expr(:new, Base.EqualTo{Char}, SSAValue(67)))
      # meta: pop location
      # meta: location sysimg.jl getproperty 8
      SSAValue(71) = (Base.getfield)(SSAValue(62), :second)::Char
      # meta: pop location
      # meta: location pair.jl Type 32
      # meta: location pair.jl Type 32
      SSAValue(77) = $(Expr(:new, Pair{Base.EqualTo{Char},Char}, SSAValue(70), SSAValue(71)))
      # meta: pop locations (3)
      # meta: location tuple.jl getindex 21
      SSAValue(78) = (Base.getfield)(pat_repls@_31::Tuple{Pair{Char,Char},Pair{Array{Char,1},typeof(uppercase)}}, 2, true)::Pair{Array{Char,1},typeof(uppercase)}
      # meta: pop location
      # meta: location strings/util.jl predicatepair 396
      # meta: location sysimg.jl getproperty 8
      SSAValue(83) = (Base.getfield)(SSAValue(78), :first)::Array{Char,1}
      # meta: pop location
      # meta: location operators.jl Type 850
      SSAValue(86) = $(Expr(:new, Base.OccursIn{Array{Char,1}}, SSAValue(83)))
      # meta: pop location
      # meta: location pair.jl Type 32
      # meta: location pair.jl Type 32
      SSAValue(92) = $(Expr(:new, Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}, SSAValue(86), Base.Unicode.uppercase))
      # meta: pop locations (3)
      SSAValue(61) = (Core.tuple)(SSAValue(77), SSAValue(92))::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}
      # meta: pop locations (2)
      pat_repls@_32::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}} = SSAValue(61)
      #= line 434 =#
      n::Int64 = 1
      #= line 435 =#
      # meta: location strings/basic.jl endof 145
      # meta: location strings/string.jl ncodeunits 74
      SSAValue(97) = (Core.sizeof)(str::String)::Int64
      # meta: pop location
      SSAValue(95) = $(Expr(:invoke, MethodInstance for thisind(::String, ::Int64), :(Base.thisind), :(str), SSAValue(97)))::Int64
      # meta: pop location
      #= line 436 =#
      i::Any = 1
      #= line 437 =#
      SSAValue(5) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(i::Int64)))::Tuple{Any,Pair}
      # meta: location tuple.jl indexed_next 58
      # meta: location tuple.jl getindex 21
      SSAValue(101) = (Base.getfield)(SSAValue(5), 1, true)::Any
      # meta: pop locations (2)
      r::Any = SSAValue(101)
      # meta: location tuple.jl indexed_next 58
      # meta: location tuple.jl getindex 21
      SSAValue(105) = (Base.getfield)(SSAValue(5), 2, true)::Pair
      # meta: pop locations (2)
      # meta: location pair.jl indexed_next 42
      SSAValue(106) = (Base.getfield)(SSAValue(105), 1)::Any
      # meta: pop location
      pattern::Any = SSAValue(106)
      # meta: location pair.jl indexed_next 42
      SSAValue(109) = (Base.getfield)(SSAValue(105), 2)::Any
      # meta: pop location
      repl::Any = SSAValue(109)
      #= line 438 =#
      SSAValue(11) = (Base.first)(r::Any)::Any
      SSAValue(12) = (Base.last)(r::Any)::Any
      j::Any = SSAValue(11)
      k::Any = SSAValue(12)
      #= line 439 =#
      # meta: location strings/basic.jl sizeof 144
      # meta: location strings/string.jl ncodeunits 74
      SSAValue(117) = (Core.sizeof)(str::String)::Int64
      # meta: pop location
      # meta: location int.jl * 54
      SSAValue(118) = (Base.mul_int)(SSAValue(117), 1)::Int64
      # meta: pop locations (2)
      # meta: location promotion.jl * 291
      # meta: location promotion.jl promote 261
      # meta: location promotion.jl _promote 219
      # meta: location number.jl convert 7
      # meta: location float.jl Type 57
      SSAValue(131) = (Base.sitofp)(Float64, SSAValue(118))::Float64
      # meta: pop locations (4)
      # meta: location float.jl * 397
      SSAValue(142) = (Base.mul_float)(1.2, SSAValue(131))::Float64
      # meta: pop locations (2)
      # meta: location float.jl floor 351
      # meta: location float.jl floor 362
      SSAValue(145) = (Base.floor_llvm)(SSAValue(142))::Float64
      # meta: pop location
      # meta: location float.jl trunc 679
      # meta: location float.jl <= 452
      SSAValue(150) = (Base.le_float)(-9.223372036854776e18, SSAValue(145))::Bool
      # meta: pop location
      unless SSAValue(150) goto 116
      # meta: location float.jl < 450
      SSAValue(151) = (Base.lt_float)(SSAValue(145), 9.223372036854776e18)::Bool
      # meta: pop location
      #temp#@_28::Bool = SSAValue(151)
      goto 118
      116: 
      #temp#@_28::Bool = false
      118: 
      unless #temp#@_28::Bool goto 125
      #= line 680 =#
      # meta: location float.jl unsafe_trunc 298
      SSAValue(152) = (Base.fptosi)(Int64, SSAValue(145))::Int64
      # meta: pop location
      goto 129
      125: 
      #= line 682 =#
      SSAValue(148) = $(Expr(:invoke, MethodInstance for InexactError(::Symbol, ::Any, ::Any), :(Base.InexactError), :(:trunc), Int64, SSAValue(145)))::InexactError
      (Base.throw)(SSAValue(148))::Union{}
      129: 
      # meta: pop locations (2)
      # meta: location iobuffer.jl StringVector 30
      # meta: location strings/string.jl _string_n 52
      # meta: location essentials.jl cconvert 327
      # meta: location number.jl convert 7
      # meta: location boot.jl Type 681
      # meta: location boot.jl toUInt64 651
      # meta: location boot.jl check_top_bit 540
      # meta: location boot.jl is_top_bit_set 530
      SSAValue(170) = (Core.lshr_int)(SSAValue(152), 63)::Int64
      SSAValue(171) = (Core.trunc_int)(Core.Int8, SSAValue(170))::Int8
      SSAValue(173) = (Core.eq_int)(SSAValue(171), 1)::Bool
      # meta: pop location
      unless SSAValue(173) goto 145
      $(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(152)))::Union{}
      145: 
      # meta: pop location
      SSAValue(164) = (Core.bitcast)(Core.UInt64, SSAValue(152))::UInt64
      # meta: pop locations (4)
      SSAValue(158) = $(Expr(:foreigncall, :(:jl_alloc_string), Ref{String}, svec(UInt64), :(:ccall), 1, SSAValue(164), SSAValue(164)))
      # meta: pop location
      # meta: location strings/string.jl unsafe_wrap 63
      SSAValue(174) = $(Expr(:foreigncall, :(:jl_string_to_array), Ref{Array{UInt8,1}}, svec(Any), :(:ccall), 1, SSAValue(158)))
      # meta: pop locations (2)
      # meta: location iobuffer.jl Type 63
      # meta: location iobuffer.jl Type 63
      # meta: location iobuffer.jl Type 26
      # meta: location iobuffer.jl Type 19
      # meta: location array.jl length 138
      SSAValue(192) = (Base.arraylen)(SSAValue(174))::Int64
      # meta: pop location
      SSAValue(191) = $(Expr(:new, Base.GenericIOBuffer{Array{UInt8,1}}, SSAValue(174), true, true, true, false, SSAValue(192), 9223372036854775807, 1, -1))
      # meta: pop locations (4)
      #= line 440 =#
      goto 165
      165: 
      166: 
      # meta: location sysimg.jl setproperty! 9
      (Base.setfield!)(SSAValue(191), :size, 0)::Int64
      # meta: pop location
      #= line 441 =#
      goto 172
      172: 
      173: 
      # meta: location sysimg.jl setproperty! 9
      (Base.setfield!)(SSAValue(191), :ptr, 1)::Int64
      # meta: pop location
      #= line 442 =#
      178: 
      SSAValue(27) = (j::Any != 0)::Any
      unless SSAValue(27) goto 264
      #= line 443 =#
      SSAValue(28) = (i::Any == 1)::Any
      unless SSAValue(28) goto 186
      #temp#@_23::Any = SSAValue(28)
      goto 188
      186: 
      #temp#@_23::Any = (i::Any <= k::Any)::Any
      188: 
      unless #temp#@_23::Any goto 197
      #= line 444 =#
      SSAValue(30) = (Base.pointer)(str::String, i::Any)::Ptr{UInt8}
      SSAValue(31) = (j::Any - i::Any)::Any
      SSAValue(32) = (Base.UInt)(SSAValue(31))::Any
      (Base.unsafe_write)(SSAValue(191), SSAValue(30), SSAValue(32))::Int64
      #= line 445 =#
      (Base._replace)(SSAValue(191), repl::Any, str::String, r::Any, pattern::Any)::Any
      197: 
      #= line 447 =#
      SSAValue(33) = (k::Any < j::Any)::Any
      unless SSAValue(33) goto 211
      #= line 448 =#
      i::Any = j::Any
      #= line 449 =#
      SSAValue(34) = (j::Any > SSAValue(95))::Any
      unless SSAValue(34) goto 207
      goto 264
      207: 
      #= line 450 =#
      k::Any = (Base.nextind)(str::String, j::Any)::Int64
      goto 216
      211: 
      #= line 452 =#
      SSAValue(35) = (Base.nextind)(str::String, k::Any)::Int64
      k::Any = SSAValue(35)
      i::Any = SSAValue(35)
      216: 
      #= line 454 =#
      SSAValue(36) = $(Expr(:invoke, MethodInstance for findnextall(::Tuple{Pair{Base.EqualTo{Char},Char},Pair{Base.OccursIn{Array{Char,1}},typeof(uppercase)}}, ::String, ::Int64), :(Base.findnextall), :(pat_repls@_32), :(str), :(k::Int64)))::Tuple{Any,Pair}
      # meta: location tuple.jl indexed_next 58
      # meta: location tuple.jl getindex 21
      SSAValue(204) = (Base.getfield)(SSAValue(36), 1, true)::Any
      # meta: pop locations (2)
      r::Any = SSAValue(204)
      # meta: location tuple.jl indexed_next 58
      # meta: location tuple.jl getindex 21
      SSAValue(208) = (Base.getfield)(SSAValue(36), 2, true)::Pair
      # meta: pop locations (2)
      # meta: location pair.jl indexed_next 42
      SSAValue(209) = (Base.getfield)(SSAValue(208), 1)::Any
      # meta: pop location
      pattern::Any = SSAValue(209)
      # meta: location pair.jl indexed_next 42
      SSAValue(212) = (Base.getfield)(SSAValue(208), 2)::Any
      # meta: pop location
      repl::Any = SSAValue(212)
      #= line 455 =#
      SSAValue(42) = r::Any
      # meta: location range.jl colon 5
      # meta: location range.jl Type 153
      SSAValue(221) = $(Expr(:new, UnitRange{Int64}, 0, -1))
      # meta: pop locations (2)
      SSAValue(44) = (SSAValue(42) == SSAValue(221))::Any
      unless SSAValue(44) goto 245
      goto 251
      245: 
      # meta: location promotion.jl == 386
      SSAValue(222) = (n::Int64 === 9223372036854775807)::Bool
      # meta: pop location
      unless SSAValue(222) goto 251
      goto 264
      251: 
      #= line 456 =#
      SSAValue(46) = (Base.first)(r::Any)::Any
      SSAValue(47) = (Base.last)(r::Any)::Any
      j::Any = SSAValue(46)
      k::Any = SSAValue(47)
      #= line 457 =#
      # meta: location int.jl + 53
      SSAValue(223) = (Base.add_int)(n::Int64, 1)::Int64
      # meta: pop location
      n::Int64 = SSAValue(223)
      262: 
      goto 178
      264: 
      #= line 459 =#
      SSAValue(49) = (Base.SubString)(str::String, i::Any)::SubString{String}
      # meta: location strings/io.jl write 139
      # meta: location sysimg.jl getproperty 8
      SSAValue(233) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64
      # meta: pop location
      # meta: location int.jl <= 405
      SSAValue(234) = (Base.sle_int)(SSAValue(233), 0)::Bool
      # meta: pop location
      unless SSAValue(234) goto 276
      goto 342
      276: 
      # meta: location sysimg.jl getproperty 8
      SSAValue(235) = (Base.getfield)(SSAValue(49), :string)::String
      # meta: pop location
      # meta: location sysimg.jl getproperty 8
      SSAValue(236) = (Base.getfield)(SSAValue(49), :offset)::Int64
      # meta: pop location
      # meta: location int.jl + 53
      SSAValue(237) = (Base.add_int)(SSAValue(236), 1)::Int64
      # meta: pop location
      # meta: location strings/string.jl pointer 72
      # meta: location strings/string.jl pointer 71
      # meta: location pointer.jl unsafe_convert 59
      # meta: location pointer.jl pointer_from_objref 142
      goto 291
      291: 
      #= line 143 =#
      SSAValue(250) = $(Expr(:foreigncall, :(:jl_value_ptr), Ptr{Nothing}, svec(Any), :(:ccall), 1, SSAValue(235)))
      # meta: pop location
      # meta: location pointer.jl + 154
      # meta: location boot.jl Type 691
      SSAValue(257) = (Core.bitcast)(Core.UInt, SSAValue(250))::UInt64
      # meta: pop location
      SSAValue(255) = (add_ptr)(SSAValue(257), 0x0000000000000008)::UInt64
      # meta: location essentials.jl oftype 301
      # meta: location pointer.jl convert 26
      # meta: location boot.jl Type 696
      SSAValue(263) = (Core.bitcast)(Ptr{Nothing}, SSAValue(255))::Ptr{Nothing}
      # meta: pop locations (4)
      # meta: location pointer.jl convert 30
      SSAValue(265) = (Base.bitcast)(Ptr{UInt8}, SSAValue(263))
      # meta: pop locations (3)
      # meta: location int.jl - 52
      SSAValue(266) = (Base.sub_int)(SSAValue(237), 1)::Int64
      # meta: pop location
      # meta: location pointer.jl + 154
      # meta: location boot.jl Type 691
      SSAValue(273) = (Core.bitcast)(Core.UInt, SSAValue(265))::UInt64
      # meta: pop location
      # meta: location int.jl rem 445
      SSAValue(274) = (Base.bitcast)(UInt64, SSAValue(266))
      # meta: pop location
      SSAValue(271) = (add_ptr)(SSAValue(273), SSAValue(274))::UInt64
      # meta: location essentials.jl oftype 301
      # meta: location pointer.jl convert 26
      # meta: location boot.jl Type 696
      SSAValue(280) = (Core.bitcast)(Ptr{UInt8}, SSAValue(271))::Ptr{UInt8}
      # meta: pop locations (5)
      # meta: location sysimg.jl getproperty 8
      SSAValue(281) = (Base.getfield)(SSAValue(49), :ncodeunits)::Int64
      # meta: pop location
      # meta: location boot.jl Type 681
      # meta: location boot.jl toUInt64 651
      # meta: location boot.jl check_top_bit 540
      # meta: location boot.jl is_top_bit_set 530
      SSAValue(291) = (Core.lshr_int)(SSAValue(281), 63)::Int64
      SSAValue(292) = (Core.trunc_int)(Core.Int8, SSAValue(291))::Int8
      SSAValue(294) = (Core.eq_int)(SSAValue(292), 1)::Bool
      # meta: pop location
      unless SSAValue(294) goto 337
      $(Expr(:invoke, MethodInstance for throw_inexacterror(::Symbol, ::Type, ::Int64), :(Core.throw_inexacterror), :(:check_top_bit), Int64, SSAValue(281)))::Union{}
      337: 
      # meta: pop location
      SSAValue(285) = (Core.bitcast)(Core.UInt64, SSAValue(281))::UInt64
      # meta: pop locations (2)
      $(Expr(:invoke, MethodInstance for unsafe_write(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Ptr{UInt8}, ::UInt64), :(Base.unsafe_write), SSAValue(191), SSAValue(280), SSAValue(285)))::Int64
      342: 
      # meta: pop location
      #= line 460 =#
      SSAValue(50) = $(Expr(:invoke, MethodInstance for take!(::Base.GenericIOBuffer{Array{UInt8,1}}), :(Base.take!), SSAValue(191)))::Array{UInt8,1}
      # meta: location strings/string.jl Type 30
      SSAValue(295) = $(Expr(:foreigncall, :(:jl_array_to_string), Ref{String}, svec(Any), :(:ccall), 1, SSAValue(50)))
      # meta: pop location
      return SSAValue(295)
  end::String

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, damn keyword arguments, they hide the body of the function. I guess the simplest approach is to remove the count argument for the check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the output of @code_warntype after removing the keyword argument.

There are test cases missing, which use this argument :_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, the output looks incomplete. Some variables do not appear to be inferred as concrete types, but they don't appear in the body.

i = a = start(str)
r, (pattern, repl) = findnextall(pat_repls, str, i)
j, k = first(r), last(r)
out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment to explain the choice of this constant? Why not start with the size of the input string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this code is inherited. I can only speculate. Maybe the author made extensive probability studies?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. Let's leave it as-is then, unless somebody can explain this choice.

@@ -427,16 +497,19 @@ function replace(str::String, pat_repl::Pair; count::Integer=typemax(Int))
end

"""
replace(s::AbstractString, pat=>r; [count::Integer])
replace(s::AbstractString, pat=>r[, pat=>r ...]; [count::Integer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add ... after the first pat => r, no need to repeat it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return value of the successful search and the pair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"value" -> "index range"

Also, better not add a blank line after the comment so that it's obvious it's attached to the function. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

# return value of the successful search and the pair

function findnextall(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer=1) where N
ps = start(pairs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tuples start is always 1. You don't need to call next either, you can just do for i in 1:N.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first I will include the as an optimization.
I want to continue with while/done/next because it looks more elegant and performant to me than

N = length(pairs)
for i = 1:N
    p = pairs[i]
    ....
    if fr > 0
        ...
        for j = i+1:N
            p = pairs[j]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More performant: definitely not. More elegant: can always be discussed, but it's not obvious. Note that N is already defined as a type parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, you convinced me. I copied my pattern from somewhere in base or stdlib (don't remember where exactly), and thought performance was the driver.
Thanks for the hint regarding N.
Is it better to introduce a second variable j for the inner loop or can I re-use the i ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's clearer to use j.

Using next/done is needed to support generic iterators, so that might be the reason why it was used.

pairmap(p::Pair) = p

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to just retain the first matching pair? At least that sounds simpler to explain. Or is that to match the behavior of other implementations?

Copy link
Contributor Author

@KlausC KlausC Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this is special for strings. In case of ties, I do not simply take the first match, but give precedence to the longer match, because it is more specific. Maybe the example demonstrates its usefulness:

replace("one house, two houses", "house"=>"Haus", "houses"=>"Häuser", "one"=>"ein", "two"=>"zwei")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can't the user just choose the appropriate order for the pairs? That won't work for regular expressions (since the length of the match may not be known in advance), but people would better use a single complex regular expression instead I think.

Another potential issue with not simply following the order in which pairs are provided is performance. One of the motivations for allowing for multiple pairs was to be more efficient than calling replace multiple times with different pairs. Doesn't having to match all pairs and compare them make the function slower? Or is that cost only paid when patterns overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost may increase only, if shorter patterns, which match are followed by longer patterns, which match at the same position. In other cases, I can't know in advance, it a later pattern does not match at an earlier position.
BTW I would prefer to use a modified version of findnext, which has an extra parameter does_not_start_after, which would allow not to search the full string for all patterns, but stop earlier for the later patterns, if there was a match before.

Copy link
Contributor Author

@KlausC KlausC Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docu text for replace has been updated. That should make the rule clear to the user of the new feature:

510 In the case of multiple patterns, the precedence is by low start position of matched pattern,
511 then longer pattern, then order of the input list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the additional optional argument for findnext(,::AbstractString,...) ?
I think the does_not_start_after could reduce search-times in a few cases. It would have another effect, than simply replacing the input string by a shortened substring.
The contract of this modified findnext(pattern, input, start, not_after would sound like:
"Give me the position of the next pattern match after start, but don't try input beyond not_after "

count::Integer=typemax(Int)) =
replace(str, occursin(first(pat_repl)) => last(pat_repl), count)
# replace Char with function that compares with this Char
pairmap(p::Pair{Char}) = equalto(p.first) => p.second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairmap sounds too general. Maybe predicatepair or any other term which doesn't risk being used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed ot predicatepair.

minstart, minend, minp = fr, last(r), p
while !done(pairs, ps)
p, ps = next(pairs, ps)
r = findnext(p.first, s, st)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say that I'm skeptical of the whole approach in this PR. First converting the user inputs to a tuple of Pair{Function} and then calling findnext on the function seems almost guaranteed to be inefficient, in several ways.

For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions.

For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

Copy link
Contributor Author

@KlausC KlausC Jan 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, I should disagree :-).
Most of your objections came also to my mind, but I could handle it.

  1. the code was copied from the original implementation of replace, which has not been removed, so there is no worse performance than before for the case with one pair.
  2. the use of findnext had been inserted to the code base shortly before this PR. It replaced searchnext, which could handle Char and collections of Char as patterns, which findmax does not do. So for those cases, the equalto and occursin need used. I think, the authors of findmax proved, that performance is not worse than for searchnext. I could measure a relevant improvement.
  3. Replacing Char by equalsto(::Char) is only done once for each call to replace, while in the old approach with searchnext the specialization had to be done for each call of searchnext, which has to be done up to length(s) times for each call to replace.
  4. my function findnextall is a slight generalization of findmax, which is thought to make simultaneous pattern searches. Given the complexity of findnext it was thought better to re-use that than to re-implement the multiple find next from scratch. The rest of the code of replace was re-used, replacing findnextwith findnextall.
  5. I see an improvement possibility, by extending the findnext function by a new optional argument, which can prevent to search the full input string, when already a bound for the starting point of the match is known. That could decrease search times by up to 50%.
  6. I do believe, the effort used by this implementation is appropriate for the complexity of the problem, which is solved. I am waiting for a proof of the contrary (ask Prof. Knuth).

@stevengj
Copy link
Member

stevengj commented Jan 7, 2018

For the case where the replacement patterns are a bunch of strings or sets of characters, can't we just create a single regular expression and a dictionary of replacements and then call the regex replace function? For example, for the special case of replacing a bunch of strings with other strings,

function myreplace(s::AbstractString, p::Pair{<:AbstractString,<:AbstractString}...)
    d = Dict(p...)
    buf = IOBuffer()
    frst = true
    for pair in p
        frst || print(buf, '|')
        frst = false
        escape_string(buf, first(pair), "[\\^\$.|?*+()")
    end
    r = Regex(String(take!(buf)))
    replace(s, r => (s -> d[s]))
end

This allows us to exploit the efficiency of regex search for multiple patterns. This code could easily be extended to support characters and sets of characters as well as replacement functions.

Extending it to sets of regexes would be trickier (e.g. you can't just do a dictionary lookup to get the replacement), but honestly I'm not sure that case is worth implementing. If the user is constructing a bunch of regexes, it seems reasonable to ask them to combine it into a single regex themselves. Still, it's not too bad if you are willing to do a linear search on the replacements instead of (or in addition to) a dictionary lookup. I think it should be sufficient to put parens around each regex (when they are concatenated with |) to keep different regexes from interfering. (Correction: parens aren't sufficient for regexes with backreferences to captured expressions, unless we also "hygenicize" the regexes by renaming their captures with unique names; ugh.)

@stevengj
Copy link
Member

stevengj commented Jan 7, 2018

Copying my comment from above, since it is hidden (attached to an outdated commit):

I have to say that I'm skeptical of the whole approach in this PR. First converting the user inputs to a tuple of Pair{Function} and then calling findnext on each function seems almost guaranteed to be inefficient, in several ways. For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions. For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 7, 2018

@stevengj , I am just exhausted from defending my own approach - please read that.
For your myreplace I consider that a special solution for in the case all patterns are strings, probably extensible to Char, collections of Char, and maybe regexes. I must admit, I did not think about the use of regular expressions for the general case.
I think I will implement along your approach and see what is the performance impact.

@nalimilan
Copy link
Member

I guess only benchmarking can tell, but it's not obvious to me that creating a regexp will always be more efficient, as they are significantly more complex than simple replacements of chars or strings. Maybe we could use it as a fallback for complex cases, but use simpler method for special cases.

For one thing, storing a bunch of functions in a data structure and then calling findnext on them probably makes it impossible for the compiler to specialize findnext by inlining the functions.

Except if the data structure is a tuple. In some cases at least, the compiler is able to unroll a loop over a tuple and the functions could be inlined. Of course this has to be checked.

For another thing, this means that the search procedure is completely re-done for each pair, when I think that a lot of computations could probably be shared.

For the simple case of matching a Char, the operation can be made very efficient by checking for equality for all pairs at the same time (just like the replace method on arrays does). For other cases it's harder, but maybe doable. We can always fall back to using regexes, and add optimized methods later.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 7, 2018

For this case I do not see a replacement using Regex.

julia> replace("aRcD", very_complex_selection_algorithm=>"BOOM", isupper=>"")
"aBOOMc"

@stevengj
Copy link
Member

stevengj commented Jan 7, 2018

@KlausC, fair enough, your approach is more general than trying to construct a single regex, which obviously cannot handle arbitrary user-defined character predicates. But I'm still concerned that it is potentially very inefficient. For example, it looks like:

n = 10^6
s = "a"^n * "b"
replace(s, "a"=>"A", "b"=>"c")

will have Θ(n²) complexity in your current approach. The reason is that your implementation, which calls findnextall repeatedly after each match is found (which will happen on every character), will search through the entire remaining string to find the "b" match after every "a" character.

To mitigate this, you would need to implement a more complicated logic: if one of the predicate searches finds a match beyond the end of the current match (or finds no match at all), then this search result can in principle be re-used (with an index offset given by the difference in sizeof between the replacement string and the current match).

@stevengj
Copy link
Member

stevengj commented Jan 7, 2018

Note that (in a quick test in Julia 0.6) the performance of my myreplace version is considerably improved by adding a _replace(io, repl::Dict, str, r, pattern) = print(io, repl[SubString(str, first(r), last(r))]) method, which allows you to directly pass a dictionary d instead of s -> d[s] for the replacement string. I'm not sure why, though; maybe it makes no difference in Julia 0.7 if there is more aggressive inlining? (It could possibly be further improved if there were a specialized method to look up a substring key in a string-based dictionary, without constructing the SubString object first, but since improvements to stack-allocation of temporary immutable objects are in the works, maybe the latter improvement isn't worth it.)

@KlausC
Copy link
Contributor Author

KlausC commented Jan 7, 2018

I see, that your approach is better by factor 128 in time, 9 in memory allocations, 7 in memory size.

julia> n = 10^6
1000000
julia> s = "a"^n * "b";
.....

julia> @time replace(s, "a"=>"A", "b"=>"B");
 19.315202 seconds (9.00 M allocations: 214.755 MiB, 0.17% gc time)
julia> function myreplace(s::AbstractString, p::Pair{<:AbstractString,<:AbstractString}...)
           d = Dict(p...)
           buf = IOBuffer()
           frst = true
           for pair in p
               frst || print(buf, '|')
               frst = false
               escape_string(buf, first(pair), "[\\^\$.|?*+()")
           end
           r = Regex(String(take!(buf)))
           replace(s, r => (s -> d[s]))
       end
myreplace (generic function with 1 method)
julia> @time myreplace(s, "a"=>"A", "b"=>"B");
  0.253329 seconds (1.08 M allocations: 36.384 MiB, 5.24% gc time)
julia> @time myreplace(s, "a"=>"A", "b"=>"B");
  0.154606 seconds (1.00 M allocations: 31.664 MiB, 3.65% gc time)

But, I did not use the improved version of findnext, which will also avoid, that the "b" is searched for in n²/2 cases in vain. That number would reduce to n and make the difference. I will try that as well.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 7, 2018

@stevengj But this modification will not safe me. Just changing "a" and "b" at one place would give same O(n²).
I have an O(n) solution in my mind, but not this evening any more.

@stevengj
Copy link
Member

stevengj commented Jan 7, 2018

@KlausC, I'm not sure why changing "a" to "b" at one place in the string would make my suggested mitigation strategy O(n²). My suggestion would only re-do a "b" search once the replace loop passes the location of the previous "b" search in the string, and similarly for "a", which I think is O(n) regardless of the pattern of "a" and "b" in the string.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

For example, isn't this modification of your code O(n)?

import Base: _replace, replace, StringVector

# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second

# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
    occursin(p.first) => p.second
end
predicatepair(p::Pair) = p

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
    minstart=minend=endof(s)+1
    minp=""=>""
    for n = 1:N
        p = pairs[n]
        fr = first(rangecache[n])
        if fr  st
            r = rangecache[n] # re-use previous search result
            last(r) == 0 && continue # not found
        else
            r = rangecache[n] = findnext(p.first, s, st)
            fr = first(r)
        end
        if fr > 0
            e = last(r)
            if fr < minstart || (fr == minstart && e > minend)
                minstart, minend, minp = fr, e, p
            end
        else # not found
            rangecache[n] = endof(s)+1:0
        end
    end
    return minstart:minend, minp
end

# cover the multiple pairs case
function replace(str::String, pat_repls::Pair...; count::Integer=typemax(Int))
    if count == 0 || length(pat_repls) == 0
        return str
    end
    count < 0 && throw(DomainError(count, "`count` must be non-negative."))
    pat_repls = predicatepair.(pat_repls)
    n = 1
    e = endof(str)
    i = a = start(str)
    rangecache = [0:0 for i=1:length(pat_repls)]
    r, (pattern, repl) = findnextall!(pat_repls, str, i, rangecache)
    j, k = first(r), last(r)
    out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
    out.size = 0
    out.ptr = 1
    while j  endof(str)
        if i == a || i <= k
            unsafe_write(out, pointer(str, i), UInt(j-i))
            _replace(out, repl, str, r, pattern)
        end
        if k < j
            i = j
            j > e && break
            k = nextind(str, j)
        else
            i = k = nextind(str, k)
        end
        r, (pattern, repl) = findnextall!(pat_repls, str, k, rangecache)
        r == 0:-1 || n == count && break
        j, k = first(r), last(r)
        n += 1
    end
    write(out, SubString(str, i))
    String(take!(out))
end

It gets the string benchmark above down to 0.92 seconds (still worse than regex, though).

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

Another benchmark (using the modified O(N) version of your replace function above):

julia> s = "alpha beta gamma"^100000;

julia> @time replace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ");
  0.250454 seconds (3.90 M allocations: 116.269 MiB, 4.51% gc time)

julia> @time myreplace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ");
  0.054157 seconds (300.03 k allocations: 10.988 MiB, 4.05% gc time)

The huge number of allocations in replace doesn't help — one contributor to this is the fact that findnextall (and my findnextall!) is type-unstable.

(The allocations for myreplace are nearly all creations of SubString objects, which will hopefully be eliminated [converted to stack allocations] with future compiler improvements as mentioned above.)

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

This version eliminates a couple of type instabilities, cutting down the number of allocations by 25% or so in the above benchmarks, but it is only slightly faster:

import Base: _replace, replace, StringVector

# replace Char with function that compares with this Char
predicatepair(p::Pair{Char}) = equalto(p.first) => p.second

# replace collection of Char with function that checks occurrence in this collection
function predicatepair(p::Pair{<:Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}})
    occursin(p.first) => p.second
end
predicatepair(p::Pair) = p

# find the first of a list of patterns.
# if several match at the same start position prefer the longer one
# if also the length is equal, prefer the first in list
# return index-range of the successful search and the pair
function findnextall!(pairs::NTuple{N,Pair}, s::AbstractString, st::Integer, rangecache) where N
    minstart=minend=endof(s)+1
    nmin=0
    for n = 1:N
        p = pairs[n]
        fr = first(rangecache[n])
        if fr  st
            r = rangecache[n] # re-use previous search result
            last(r) == 0 && continue # not found
        else
            r = rangecache[n] = findnext(p.first, s, st)
            fr = first(r)
        end
        if fr > 0
            e = last(r)
            if fr < minstart || (fr == minstart && e > minend)
                minstart, minend, nmin = fr, e, n
            end
        else # not found
            rangecache[n] = endof(s)+1:0
        end
    end
    return minstart:minend, nmin
end

# cover the multiple pairs case
function replace(str::String, pat_repls_::Pair...; count::Integer=typemax(Int))
    if count == 0 || isempty(pat_repls_)
        return str
    end
    count < 0 && throw(DomainError(count, "`count` must be non-negative."))
    pat_repls = predicatepair.(pat_repls_)
    n = 1
    e = endof(str)
    i = a = start(str)
    rangecache = [0:0 for i=1:length(pat_repls)]
    r, npat = findnextall!(pat_repls, str, i, rangecache)
    j, k = first(r), last(r)
    out = IOBuffer(StringVector(floor(Int, 1.2sizeof(str))), true, true)
    out.size = 0
    out.ptr = 1
    while npat > 0 && n  count
        if i == a || i <= k
            unsafe_write(out, pointer(str, i), UInt(j-i))
            pattern, repl = pat_repls[npat]
            _replace(out, repl, str, r, pattern)
        end
        if k < j
            i = j
            j > e && break
            k = nextind(str, j)
        else
            i = k = nextind(str, k)
        end
        r, npat = findnextall!(pat_repls, str, k, rangecache)
        j, k = first(r), last(r)
        n += 1
    end
    write(out, SubString(str, i))
    String(take!(out))
end

As I said in my first comment above, I think it will be quite tricky to make this kind of "list of functions" approach fast, and will be especially hard to beat regex search (except for special cases like Char-only replacement). Yes, @nalimilan is correct that, in principle, iteration over heterogeneous tuples can sometimes be done completely at compile-time and hence be type-specialized, but that usually requires much trickier code than is being employed here.

And yes, the regex approach doesn't support user-defined character predicates, but we have map for that… I'm not sure it's necessary to support such functions here. Of course, one option would be to keep the slower general version and add specialized versions (e.g. regex-based) for particular cases, but that has the drawback of much greater code complexity.

@StefanKarpinski
Copy link
Member

I would just like to interject that the fact that this API affords various clever specializations that can be so much faster than naive repeated replacement indicates (to me at least), that this is a good API which allows the user to easily express their intention in a simple way which the implementation can nevertheless make fast in various clever ways.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

@StefanKarpinski, I was actually coming to the opposite conclusion. Take the "alpha"=>"α", "beta"=>"β", "gamma"=>"γ" replacement, for example. If you have to apply this replacement to many small strings, you should definitely not use replace(s, "alpha"=>"α", "beta"=>"β", "gamma"=>"γ"). Instead, you should do

d = Dict("alpha"=>"α", "beta"=>"β", "gamma"=>"γ")
r = r"alpha|beta|gamma"  # or Regex(join(keys(d), '|'))
replace(s, r => (s -> d[s]))

or we could even support replace(s, r=>d) as I mentioned above. This is not that much longer than a multi-argument replace, and is much more efficient for small strings because d and r only need to be constructed once (6x faster for a length-80 string in a quick test). We could easily add this example to the documentation for replace.

As another example, take @KlausC's example, replace(s, 'c'=>"BOOM", isupper=>""). If, instead, you do

buf = IOBuffer()
for c in s
    if c == 'c'
        print(buf, "BOOM")
    elseif !isupper(c)
        print(buf, c)
    end
end
String(take!(buf))

it is 30x faster on my machine for randstring(10^5).

Basically, if you are doing multiple replacements, there is almost always going to be a vastly faster way to do it than any generic multi-arg replace function we are likely to write. Character-predicate loops and the regex-alternative method, which probably cover the two most common uses of this function, are not even that hard to write.

By providing a multi-arg string replace function, we arguably discourage people from writing more efficient code for this problem, because they will assume that the standard-library implementation is probably better than anything mere mortals could write. And if they don't care about performance, nested replace calls are fine too.

@nalimilan
Copy link
Member

@stevengj I don't understand your position. Nothing prevents us from writing efficient methods for cases where the pair's LHS are only chars or predicates, of for when they are only strings. If we are really concerned about trapping users with slow methods, we could even avoid supporting complex cases for which we have no efficient solution at the moment (notably, those mixing all three types of LHS).

@StefanKarpinski
Copy link
Member

@stevengj I don't understand your position.

Agree. We can always turn a replacement of multiple string keys into a regex with a dictionary lookup in the implementation. We could even have a replacer function which generates a compiled version of the above and can then be called when needed for optimal performance.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

@nalimilan, the current API makes it almost impossible to write efficient code for e.g. the "alpha"=>"α", "beta"=>"β", "gamma"=>"γ" case applied to many short strings, because there is no way to cache the regex or the dict between searches. The same applies to any case where precomputation is helpful.

In contrast, giving the user an example of how to use | in a regex replacement teaches the user how to easily write efficient code for the most common cases of multiple replacements.

It's true that we could have a replacer API for precomputed cases. I'm just not sure it is worth it — how hard is it to get people to use |?

Also, a replacer precomputed-replacement API is precisely the sort of thing that could go into a package (because it would create its own Replacer type that could be passed e.g. to new Base.replace methods without type piracy).

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

My position is that it is counterproductive to put an API into base with the rationale, "Use this API if you care about performance, instead of calling replace multiple times … but don't use this API if you actually care about performance, because it will often be 10x slower than writing your own regex | or character loop."

@nalimilan
Copy link
Member

OK, I see your point about the inefficiency of rebuilding a regex and a dictionary on every call. But that doesn't apply to situations where the LHS is a Char or a predicate. We could at least add support for these, for consistency with the generic API for collections (which operates on one entry at a time). The methods where the LHS is a string or a regex could be added later if we find an efficient way to implement them (I see no reason why we shouldn't be able to match the performance of PCRE).

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

@nalimilan, I'm not so opposed to a multi-arg replace function restricted to multiple single-character predicates. I think it should probably be a separate PR, though, since an efficient implementation for that special case will share no code with the current PR.

It might be confusing to support multi-arg replace for only a small subset of the supported string-replacement patterns, of course. Also, I'm not sure how useful it is: how often do people actually do multi-single-char string replacements where they wouldn't just use map or their own loop?

I see no reason why we shouldn't be able to match the performance of PCRE

To do that, you might have to precompute a search tree (to efficiently handle multiple strings with common prefixes), similar to PCRE, so you are still hurt by the impossibility of separate precomputation with the current API.

@stevengj
Copy link
Member

stevengj commented Jan 8, 2018

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char. Honestly, I think that most people doing complicated character mappings would be better off just writing a loop.

@nalimilan
Copy link
Member

It might be confusing to support multi-arg replace for only a small subset of the supported string-replacement patterns, of course. Also, I'm not sure how useful it is: how often do people actually do multi-single-char string replacements where they wouldn't just use map or their own loop?

No idea, but supporting generic APIs sounds like a good idea in general. We've had at least one complaint about this already: https://discourse.julialang.org/t/inconsistency-between-replace-on-an-abstractstring-vs-vector-char/8206 (though note the example should be replace("foobar", 'o' => 'O', 'b' => 'B') for it to be a real inconsistency with Vector{Char}).

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char.

Mmm, indeed. Looks like we should change this API before 0.7 as it's indeed inefficient and goes against the other methods for general iterables. It's a bit annoying, but we should probably pass a Char when the LHS is a Char or a predicate, and a SubString when the LHS is a string or a regex.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 9, 2018

I added the following changes due to the discussion between @stevengj and @nalimilan:

  1. the algorithm is now O(n)
  2. if all patterns are strings or chars, they are aggregated to a single regex
  3. if all patterns are chars or predicate functions, they are composed to a single special function
  4. patterns like 2./3. can arbitrarily be composed with regex, predicates, chars, and strings. If possible they are partially aggregated.
  5. if the consolidation leads to a single regex, the standard function with one pat-repl-pair is called.
  6. the preparation effort, including regex compilation, can be stored in a re-usable cache object (kwa 'cache' and struct ReplaceCache were added).
  7. test cases for 2. - 4.

@nalimilan
Copy link
Member

Even for single-character replacement, it's pretty hard to write an efficient implementation using the current API, because the current API requires that a substring be passed to the replacement function, rather than a Char.

@stevengj I wrote above that this was indeed a problem, but now I can't find the replace method we were talking about. Does it actually exist? AFAICT we do not currently support a replacement function with strings, and if we added this function for consistency with arrays it should take a Char.

@KlausC
Copy link
Contributor Author

KlausC commented Jan 30, 2018

In v0.6 we had:

  | | |_| | | | (_| |  |  Version 0.6.1 (2017-10-24 22:15 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> replace("abc", "bc", uppercase)
"aBC"

in v0.7 we have:

  | | |_| | | | (_| |  |  Version 0.7.0-DEV.3647 (2018-01-30 07:37 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 6faa23bab0 (0 days old master)
|__/                   |  x86_64-redhat-linux

julia> replace("abc", "bc" => uppercase)
"aBC"

Implementation is here:

strings/utils.jl:391   _replace(io, repl::Function, str, r, pattern) = 
                                print(io, repl(SubString(str, first(r), last(r))))

So if repl is a Function, we call it on SubString(...).
Maybe str[r] would be better in general.
If length(r) == 1 calling repl(str[first[r]) would call the Char-function, which is much more efficient for the standard character conversion functions (Unicode.uppercase ...).

@nalimilan
Copy link
Member

Ah, of course, I was looking for something like replace(f, s).

Maybe str[r] would be better in general.
If length(r) == 1 calling repl(str[first[r]) would call the Char-function, which is much more efficient for the standard character conversion functions (Unicode.uppercase ...).

Yes, though to ensure type stability the type of the argument should not depend on the length of the match, but just on the type of the LHS: if it's a Char or a set thereof, pass a Char to the function, else pass a SubString. Would you make a PR?

@KlausC
Copy link
Contributor Author

KlausC commented Jan 30, 2018

I willl do. I try to add method:

_replace(io, repl::Function, str, r, pattern::Function) = 
                                print(io, repl(getindex(str, first(r))))

After the transformation of input patterns, that will cover all single-character types, leaving String and Regex as has been.

vtjnash added a commit to vtjnash/julia that referenced this pull request Apr 14, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
JeffBezanson pushed a commit that referenced this pull request Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
JeffBezanson pushed a commit that referenced this pull request Jun 7, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants