Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of length on UTF8String and UTF16String #11107

Merged
merged 6 commits into from
May 6, 2015

Conversation

ScottPJones
Copy link
Contributor

This adds an optimized version to get the length of a UTF16String, it is about an order of magnitude faster (it is still O(n) like UTF8String, instead of O(1) like ASCIIString or UTF32String, though)

* @param[in] iStr UTF-8 encoded string
* @param[in] iLen Length of the string in bytes
*
* @return number of logical characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to say "codepoints" instead of "logical character".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I documented it! ;-) From where I'm coming from, I thought logical character was actually clearer... a lot of people get codepoints vs. the value of the byte/word/32-bit confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"logical character" can mean either code point or grapheme cluster, I don't think it's clearer. A Google search for the former does not give anything useful, while a search for the latter immediately leads to the definition of Wikipedia.

@nalimilan
Copy link
Member

Do you really need to rely on a C function to do that? I'm not sure why this is currently the case for UTF-8, but I think it may have to do with the special status of UTF8String, which shouldn't be an issue for UTF16String.

@ScottPJones
Copy link
Contributor Author

It is for speed, it simply was not nearly fast enough in Julia... about 3x slower... the Julia team will need to look at that, but for now, given that u8_charnum is already there, I don't think it should prevent this from being merged in.

@ScottPJones
Copy link
Contributor Author

@nalimilan Better now? Thanks very much for the input!

@vtjnash
Copy link
Member

vtjnash commented May 2, 2015

It is for speed, it simply was not nearly fast enough in Julia... about 3x slower...

[citation needed]

@carnaval
Copy link
Contributor

carnaval commented May 2, 2015

We should definitely be able to get a simple loop like that down to the same IR as, e.g., clang would. If not I'd consider that a compiler bug.

@ScottPJones
Copy link
Contributor Author

@carnaval I'm all for that, and I've said earlier, that the moment the Julia version is within say 5-10% of the C version, then the C version could be dropped... Unfortunately, that is not the case at present, and the people who would be able to improve it are probably working 90+ hour weeks as it is!
I need decent performance for these things, if I intend to use Julia for my string processing / database tasks... I don't think it is unreasonable to have some C code to get that performance, at least for the time being.

@ScottPJones
Copy link
Contributor Author

@vtjnash I'll try to dig that code up, or recreate it... as I've said before, I am not opposed to having everything all written in Julia... but I need the performance now, not next year, and I don't think it is unreasonable to add a few lines of C to Julia, that are exactly like the code that is already there,
but for UTF-8 [I actually was about to send a message to @JeffBezanson about removing some C code from utf8.c, that is no longer used, and potentially unsafe..., so the net amount of C code in Julia would actually go down, not up!]

@carnaval
Copy link
Contributor

carnaval commented May 2, 2015

(disclaimer : I know nothing about string encoding)
Using a literal translation of your C code I get equivalent speed (sometimes a bit faster, sometimes a bit slower) between the C version and this

@inline is_surrogate_trail(c :: UInt16) = (c & ~UInt16(0x3ff)) == UInt16(0xdc00)
function length2(s::UTF16String)
    d = s.data
    len = length(d) - 1
    len == 0 && return 0
    cnum = 0
    for i = 1:len
        @inbounds cnum += !is_surrogate_trail(d[i])
    end
    cnum
end

Note : my LLVM version (3.6) auto-vectorizes this nicely so it may be that your C compiler is doing it but not the LLVM version you are using with julia.

@ScottPJones
Copy link
Contributor Author

@carnaval This is what I got: for 64K strings, the Julia code was 2-2.2x slower, for 16 byte strings, it was between 41-72% slower... that's a very significant difference... The difference was < 3x now, because my Julia code is better than it was a couple of weeks ago ;-)

@ScottPJones
Copy link
Contributor Author

@carnaval That's good... but I couldn't find the @inline macro in the documentation for v0.3.x... is that new to v0.4?
Also, about the LLVM, I have a feeling that not everybody is up to that version... here is the version I have with OS X 10.10.4:

LLVM (http://llvm.org/):
LLVM version 3.3
Optimized build.
Built Apr 27 2015 (22:46:30).
Default target: x86_64-apple-darwin14.3.0
Host CPU: core-avx2

Here is a gist with my results and (really bad) benchmarking code...
https://gist.github.com/ScottPJones/bb712f7b85d1d8d91a9a

[I really want to know how in Julia to best to get the 2 numbers from @time, save them for a whole set of results, and then load them back in to calculate the differences in timings and memory utilization... I miss having a multidimensional associate array structure, both persistent and local, to do that with a single command..]

@carnaval
Copy link
Contributor

carnaval commented May 2, 2015

As I said, you have to check if your C version is getting vectorized (likely) for example by running gdb src/support/utf8.o -ex 'disassemble /m u16_charnum', and if it is the case for your julia version, for example by running code_native(length, (UTF16String,)) or code_llvm if you prefer reading IR than assembly.

@ScottPJones
Copy link
Contributor Author

@carnaval I would think it's the same LLVM... the C code is in utf8.c and is built as part of Julia, or have I missed something?

@carnaval
Copy link
Contributor

carnaval commented May 2, 2015

And now that I look at your code, I'm pretty sure that the bound checks are not being ellided which prevents vectorization because of the possibility of diverging control flow. See the @inbounds in my example (which again, performs exactly as the C version on my machine).

@carnaval
Copy link
Contributor

carnaval commented May 2, 2015

There is no particular reason for the C source to be compiled by clang using the same LLVM version we use for the codegen. Any C compiler does the job.

@ScottPJones
Copy link
Contributor Author

@carnaval OK, very helpful! As I think I've said, my very first attempt at this was in Julia, but then I saw that it was still slower than the UTF-8 version (and should have been faster), and enough slower that I felt I wouldn't be able to use it... and also, @JeffBezanson had it as a C function, which I had already sped up last week by a good amount, and if it was good enough for somebody like him, one of the Jedi Knights of Julia, I thought it had to be good enough for me!

@ScottPJones
Copy link
Contributor Author

@carnaval Also good to know, that it won't necessarily be using the same LLVM as Julia... I often am running with the pre-release Xcode versions on my machine... Thanks!

@ScottPJones
Copy link
Contributor Author

@carnaval Thanks so much... I just ran with your @inline and @inbound change, with my full set of different types of Unicode strings (i.e. only ASCII characters, some Latin1 or other 2-byte ones (for UTF8), 3-byte and 4 byte [i.e. surrogates for UTF-16). It is 9-19% slower than the C version for 16 character strings... I need to test with more sizes I think... but it is 12-37% faster for 64K strings...

@ScottPJones
Copy link
Contributor Author

My question now to the peanut gallery... since this is a mixed bag, should I change this to the pure Julia version, it is a bit slower than what I'd considered to be my pain threshold, but only for short strings... (but those are more common), and put my faith in the Julia team that they will look into that case... [it's a challenge for them], or see if the C version should be accepted for now?
If you all want to merge in the Julia version, shall I check it in with the C versions of both u8_charnum and u16_charnum removed?

Thanks so much for the help!

@vtjnash
Copy link
Member

vtjnash commented May 3, 2015

any idea why the @inline annotation is needed? or was just @inbounds needed? the inlining heuristics shouldn't have had any issues with such a simple function

yes, the julia-only version can be merged

@ScottPJones
Copy link
Contributor Author

@carnaval About the Julia code, I'd just copied that from somebody else's Julia code in one of the utfXX.jl modules, and just changed the testing in the inner loop [to simply assume that it was valid UTF-8 or UTF-16, and then simply not count the trailing surrogates or continuation bytes]
Now I wonder if a lot of those loops I saw in the utfXX.jl modules should be liberally sprinkled with
@inbounds.

@vtjnash
Copy link
Member

vtjnash commented May 3, 2015

(note, i believe that is_surrogate_trail is already implemented as utf16_is_trail)

@ScottPJones
Copy link
Contributor Author

@vtjnash I haven't had time to test all the permutations... but I'm very excited at the results!
This is more like what I was hoping to see out of Julia...

@ScottPJones
Copy link
Contributor Author

@vtjnash I made a copy so I could put the @inline on it... I don't know if that works outside a module... and also I wanted to avoid the time taken to rebuild Julia... too excited to wait!

@vtjnash
Copy link
Member

vtjnash commented May 3, 2015

Now I wonder if a lot of those loops I saw in the utfXX.jl modules should be liberally sprinkled with
@inbounds.

yes, if it helps performance. otherwise, no, since it can make bugs much harder to find. Automated bounds checking elision is something that is very difficult for a compiler, but not too difficult for a human.

@ScottPJones
Copy link
Contributor Author

Also, I think I'll just replace the UTF16String version of length, because the old Julia code was always much worse than this... I'll leave the optimized UTF8String version for now, so as not to inadvertently slow things down, right after the change I put in last week to speed it up!

@ScottPJones
Copy link
Contributor Author

@vtjnash :-) You don't know me yet... but I am a bit OCD about rigorously benchmarking stuff... of course I'll only put an @inbounds if it noticeably affects performance...

@quinnj
Copy link
Member

quinnj commented May 4, 2015

@ScottPJones, please put backticks ( ) around code, especially macro names (in your last comment: @inline). Without backticks, this is actually referencing/notifying/emailing the GitHub user "inline" that he/she was mentioned in this thread.

@ScottPJones
Copy link
Contributor Author

@quinnj Sorry, someone had already warned me about that, and I've tried to be careful... Julia makes it a bit hard though! ;-)

*
* @return Number of logical characters (or codepoints) in string (or substring)
*/
DLLEXPORT size_t u8_charnum(const char *iStr, size_t iLen);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using different argument names here. Also, I don't think the docs should be repeated in the headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... I'll clean that up, thanks!

@nalimilan
Copy link
Member

Thanks, this looks like a nice improvement. Though I'm not sure anything has been decided about the documentation format for C code.

@carnaval
Copy link
Contributor

carnaval commented May 4, 2015

@JeffBezanson I know, but the C code I've written reloads it every time also. In that case it should be quite easy for it to see that the load is loop invariant. I was trying to understand why can't LLVM (really, our set of passes) figure it out on its own.

@ScottPJones
Copy link
Contributor Author

Thanks, this looks like a nice improvement. Though I'm not sure anything has been decided about the documentation format for C code.

@nalimilan About the documentation format... is there a problem with doing something? (doxygen with Markdown markup)? Julia has been around for many years now, and there's been no decision on documenation standards? Sorry if it seems snarky, it's just very frustrating to deal with code that isn't documented, and from previous experience with doxygen, it does a pretty good job...

@nalimilan
Copy link
Member

@ScottPJones That's not for me to decide. I agree having a standardized documentation format is a good idea, but better think about it beforehand. :-)

@tkelman
Copy link
Contributor

tkelman commented May 4, 2015

is there a problem with doing something? (doxygen with Markdown markup)? Julia has been around for many years now, and there's been no decision on documenation standards? Sorry if it seems snarky, it's just very frustrating to deal with code that isn't documented

No there has not been a decision. There have however been consistent, well-warranted complaints that the C code and some core parts of the standard library are under-documented and difficult to understand by anyone who didn't write them. I would actually be highly in favor of a concerted effort to go through and document what's there, if there are people who are willing to help with that. The only issue, and I think it's a minor one, is that there are parts of the codebase that change rapidly, and keeping the documentation up-to-date with the code means a bit of extra work. I do think Julia's a large enough project now with enough people regularly trying to look at and work with the internals that it's past time for that to be a good enough reason to resist documenting things.

(should move this documentation conversation to a new issue)

that too

@IainNZ
Copy link
Member

IainNZ commented May 4, 2015

(should move this documentation conversation to a new issue)

@ScottPJones
Copy link
Contributor Author

Would the PTB want me to remove the doxygen style documentation that slipped in?
(that happened because I had originally had that function, that I had written to replace Jeff's, as part of a totally separate C module, that had my C version of UTF conversion functions as well... but since now I have pure Julia versions of the UTF conversion functions, as well as these length functions, the C module is no longer needed)
I'd also be happy to add comments to the files I've been working on (char.jl, utf*.jl), but that's a harder issue, which I already brought up on julia-dev... the @doc macro doesn't work for code in Base... it has dependencies on Markdown, which has dependencies all over the place). For now, in those files, I just put #= and =# around the whole @doc section...
Personally, I feel bad putting in code without documentation... it feels half done...

@ScottPJones
Copy link
Contributor Author

@nalimilan Another thing... doxygen style (using Markdown) comments have already been accepted in the utf8proc module (I hadn't noticed it at first, since it is only in utf8proc.h, and not in utf8proc.c), and that was discussed and accepted... (see issues 26 and 29 in JuliaLang/utf8proc) [btw, how do you directly reference issues that are not in JuliaLang/julia? thx!]

@pao
Copy link
Member

pao commented May 4, 2015

@ScottPJones JuliaLang/utf8proc#26

@mbauman mbauman added the unicode Related to unicode characters and encodings label May 5, 2015
@jakebolewski
Copy link
Member

@ScottPJones it would be great if you could organize your performance tests a bit and start contributing some of that code to test/perf. That way we can objectively look at the performance gains over time and make sure there are no regressions as you work on the string code.

jakebolewski added a commit that referenced this pull request May 6, 2015
Improve performance of length on UTF8String and UTF16String
@jakebolewski jakebolewski merged commit 18b7b34 into JuliaLang:master May 6, 2015
@ScottPJones
Copy link
Contributor Author

@jakebolewski Thanks! I wasn't aware of test/perf, will investigate now.

@ScottPJones ScottPJones deleted the spj/u16_charnum branch May 7, 2015 23:55
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.