Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize nextind and prevind for String #16648

Merged
merged 1 commit into from
Jun 8, 2016

Conversation

TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented May 29, 2016

Now that there is only one String type in Base, it might be worth optimizing it. The specializations here get around two-fold performance boost compared to the generic variants:

In the benchmarks below, Base.{next|prev}ind refers to the old version, and {next|prev}ind to the new version.

julia> @benchmark sum(prevind("Hello World", i) for i in -1:11)
Trial(130.00 ns)

julia> @benchmark sum(Base.prevind("Hello World", i) for i in -1:11)
Trial(238.00 ns)

julia> @benchmark sum(nextind("Hello World", i) for i in -1:11)
Trial(122.00 ns)

julia> @benchmark sum(Base.nextind("Hello World", i) for i in -1:11)
Trial(286.00 ns)

julia> @benchmark sum(prevind("αβγδϵζ😄🍕", i) for i in 0:21)
Trial(224.00 ns)

julia> @benchmark sum(Base.prevind("αβγδϵζ😄🍕", i) for i in 0:21)
Trial(490.00 ns)

julia> @benchmark sum(nextind("αβγδϵζ😄🍕", i) for i in 0:21)
Trial(199.00 ns)

julia> @benchmark sum(Base.nextind("αβγδϵζ😄🍕", i) for i in 0:21)
Trial(556.00 ns)

@TotalVerb
Copy link
Contributor Author

Despite tests passing locally, it seems that some behaviour is broken. Closing temporarily.

@TotalVerb TotalVerb closed this May 29, 2016
@TotalVerb
Copy link
Contributor Author

TotalVerb commented May 29, 2016

I am reopening because I am of the opinion that the changed behaviour is probably inconsequential. There isn't a strong reason to prefer the current behaviour over the new behaviour. In fact, the new behaviour is monotonic, which might even be more elegant (not that it matters in these cases).

julia> const test = "🍕"
"🍕"

julia> test.data
4-element Array{UInt8,1}:
 0xf0
 0x9f
 0x8d
 0x95

julia> Base.nextind(test, 1)
5

julia> Base.nextind(test, 2)
3

julia> Base.nextind(test, 3)
4

julia> Base.nextind(test, 4)
5

julia> nextind(test, 1)
5

julia> nextind(test, 2)
5

julia> nextind(test, 3)
5

julia> nextind(test, 4)
5

@TotalVerb TotalVerb reopened this May 29, 2016
@nalimilan
Copy link
Member

At least, the current behavior is consistent in returning i+1 when i > endof(s). OTOH, the one you suggest will always return the end of underlying array (which is an implementation detail), except when i is higher than that.

Regarding performance, I would have thought the two definition would be essentially identical after inlining. It would be interesting to compare the generated code. Also, why are you taking the sum in your benchmarks?

@TotalVerb
Copy link
Contributor Author

TotalVerb commented May 29, 2016

I can't see what the advantage is in returning end+1. The end of the underlying array is already revealed by nextind on endof.

I think part of the speed bonus comes from avoiding unnecessary bounds checks in isvalid and not doing an expensive endof computation unless necessary. Computing endof is potentially expensive for strings that end in a long character, and has locality issues.

As for sum, it's a way to prevent the compiler from optimizing out computations that aren't used. I don't know if this optimization is actually performed in this case.

@nalimilan
Copy link
Member

I can't see what the advantage is in returning end+1. The end of the underlying array is already revealed by nextind on endof.

I'm not saying that it matters a lot, but you said the new behavior was better.

I think part of the speed bonus comes from avoiding unnecessary bounds checks in isvalid and not doing an expensive endof computation unless necessary. Computing endof is potentially expensive for strings that end in a long character, and has locality issues.

Makes sense.

if i > stop
return endof(s)
end
i -= oftype(i, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should call oftype here. The index type for String is Int at the moment, we only accept Integer as input for convenience. Anyway, it doesn't make sense to be more general here than for AsbtractString below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the functions should be made type stable. Currently if a bigger type than machine Int is passed in, the return type is a Union and that's not good. I'll make both functions (including the generic ones) return Int always.

@nalimilan
Copy link
Member

I even wonder whether it's a good idea to provide default nextind and prevind methods for AbstractString, since in many cases they will have the same performance issue due to endof. We could only make this both generic and efficient by introducing a companion to endof which would return the last possible index (since this is generally faster than endof).

@TotalVerb
Copy link
Contributor Author

I think the current nextind and prevind are reasonably fast. They should be made faster for String because it's the standard type, but they're good enough for user-defined types.

Variable-length encodings can be efficiency problems in different ways. Our strlen (length) is as fast as C's, for example, which means for large strings it's very, very slow. I don't think we can avoid the programmer having to keep in mind the complexity of every operation.

nextind(s::DirectIndexString, i::Integer) = i+1
nextind(s::AbstractArray , i::Integer) = i+1
prevind(s::DirectIndexString, i::Integer) = convert(Int, i)-1
prevind(s::AbstractArray , i::Integer) = convert(Int, i)-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int(i-1) would be a little more concise - the alignment also looks funny here, though it was that way before your change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to do ::Int on the function instead but this doesn't seem to work yet. Used Int(x)-1 as a stopgap in the meantime.

Copy link
Contributor

@tkelman tkelman May 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int(i-1) would have better behavior near overflow corner cases IMO

actually could go either way, the smaller sizes would be better to convert before subtracting, larger sizes would be better to convert after. smaller sizes are probably more likely to be seen near overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable to keep it like this then. If a big integer type would overflow then it's not a good index anyhow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change convert(Int, ...) to Int so that we can merge the PR?

@TotalVerb
Copy link
Contributor Author

Thanks for the review. The issues have been addressed and I have squashed commits. Anything more?

@ViralBShah ViralBShah added the strings "Strings!" label Jun 4, 2016
@JeffBezanson JeffBezanson merged commit 11e4031 into JuliaLang:master Jun 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants