Improve performance of length on UTF8String and UTF16String #11107

ScottPJones · 2015-05-02T21:36:08Z

This adds an optimized version to get the length of a UTF16String, it is about an order of magnitude faster (it is still O(n) like UTF8String, instead of O(1) like ASCIIString or UTF32String, though)

nalimilan · 2015-05-02T21:45:20Z

src/support/utf8.c

+ * @param[in]  iStr UTF-8 encoded string
+ * @param[in]  iLen Length of the string in bytes
+ *
+ * @return number of logical characters


Probably better to say "codepoints" instead of "logical character".

At least I documented it! ;-) From where I'm coming from, I thought logical character was actually clearer... a lot of people get codepoints vs. the value of the byte/word/32-bit confused.

"logical character" can mean either code point or grapheme cluster, I don't think it's clearer. A Google search for the former does not give anything useful, while a search for the latter immediately leads to the definition of Wikipedia.

nalimilan · 2015-05-02T21:47:45Z

Do you really need to rely on a C function to do that? I'm not sure why this is currently the case for UTF-8, but I think it may have to do with the special status of UTF8String, which shouldn't be an issue for UTF16String.

ScottPJones · 2015-05-02T21:49:26Z

It is for speed, it simply was not nearly fast enough in Julia... about 3x slower... the Julia team will need to look at that, but for now, given that u8_charnum is already there, I don't think it should prevent this from being merged in.

ScottPJones · 2015-05-02T22:08:36Z

@nalimilan Better now? Thanks very much for the input!

vtjnash · 2015-05-02T22:43:01Z

It is for speed, it simply was not nearly fast enough in Julia... about 3x slower...

[citation needed]

carnaval · 2015-05-02T22:48:07Z

We should definitely be able to get a simple loop like that down to the same IR as, e.g., clang would. If not I'd consider that a compiler bug.

ScottPJones · 2015-05-02T22:52:03Z

@carnaval I'm all for that, and I've said earlier, that the moment the Julia version is within say 5-10% of the C version, then the C version could be dropped... Unfortunately, that is not the case at present, and the people who would be able to improve it are probably working 90+ hour weeks as it is!
I need decent performance for these things, if I intend to use Julia for my string processing / database tasks... I don't think it is unreasonable to have some C code to get that performance, at least for the time being.

ScottPJones · 2015-05-02T22:56:17Z

@vtjnash I'll try to dig that code up, or recreate it... as I've said before, I am not opposed to having everything all written in Julia... but I need the performance now, not next year, and I don't think it is unreasonable to add a few lines of C to Julia, that are exactly like the code that is already there,
but for UTF-8 [I actually was about to send a message to @JeffBezanson about removing some C code from utf8.c, that is no longer used, and potentially unsafe..., so the net amount of C code in Julia would actually go down, not up!]

carnaval · 2015-05-02T23:07:46Z

(disclaimer : I know nothing about string encoding)
Using a literal translation of your C code I get equivalent speed (sometimes a bit faster, sometimes a bit slower) between the C version and this

@inline is_surrogate_trail(c :: UInt16) = (c & ~UInt16(0x3ff)) == UInt16(0xdc00)
function length2(s::UTF16String)
    d = s.data
    len = length(d) - 1
    len == 0 && return 0
    cnum = 0
    for i = 1:len
        @inbounds cnum += !is_surrogate_trail(d[i])
    end
    cnum
end

Note : my LLVM version (3.6) auto-vectorizes this nicely so it may be that your C compiler is doing it but not the LLVM version you are using with julia.

ScottPJones · 2015-05-02T23:34:14Z

@carnaval This is what I got: for 64K strings, the Julia code was 2-2.2x slower, for 16 byte strings, it was between 41-72% slower... that's a very significant difference... The difference was < 3x now, because my Julia code is better than it was a couple of weeks ago ;-)

ScottPJones · 2015-05-02T23:45:05Z

@carnaval That's good... but I couldn't find the @inline macro in the documentation for v0.3.x... is that new to v0.4?
Also, about the LLVM, I have a feeling that not everybody is up to that version... here is the version I have with OS X 10.10.4:

LLVM (http://llvm.org/):
LLVM version 3.3
Optimized build.
Built Apr 27 2015 (22:46:30).
Default target: x86_64-apple-darwin14.3.0
Host CPU: core-avx2

Here is a gist with my results and (really bad) benchmarking code...
https://gist.github.com/ScottPJones/bb712f7b85d1d8d91a9a

[I really want to know how in Julia to best to get the 2 numbers from @time, save them for a whole set of results, and then load them back in to calculate the differences in timings and memory utilization... I miss having a multidimensional associate array structure, both persistent and local, to do that with a single command..]

carnaval · 2015-05-02T23:47:47Z

As I said, you have to check if your C version is getting vectorized (likely) for example by running gdb src/support/utf8.o -ex 'disassemble /m u16_charnum', and if it is the case for your julia version, for example by running code_native(length, (UTF16String,)) or code_llvm if you prefer reading IR than assembly.

ScottPJones · 2015-05-02T23:47:54Z

@carnaval I would think it's the same LLVM... the C code is in utf8.c and is built as part of Julia, or have I missed something?

carnaval · 2015-05-02T23:49:15Z

And now that I look at your code, I'm pretty sure that the bound checks are not being ellided which prevents vectorization because of the possibility of diverging control flow. See the @inbounds in my example (which again, performs exactly as the C version on my machine).

carnaval · 2015-05-02T23:51:28Z

There is no particular reason for the C source to be compiled by clang using the same LLVM version we use for the codegen. Any C compiler does the job.

ScottPJones · 2015-05-02T23:52:21Z

@carnaval OK, very helpful! As I think I've said, my very first attempt at this was in Julia, but then I saw that it was still slower than the UTF-8 version (and should have been faster), and enough slower that I felt I wouldn't be able to use it... and also, @JeffBezanson had it as a C function, which I had already sped up last week by a good amount, and if it was good enough for somebody like him, one of the Jedi Knights of Julia, I thought it had to be good enough for me!

ScottPJones · 2015-05-02T23:54:11Z

@carnaval Also good to know, that it won't necessarily be using the same LLVM as Julia... I often am running with the pre-release Xcode versions on my machine... Thanks!

ScottPJones · 2015-05-03T00:16:53Z

@carnaval Thanks so much... I just ran with your @inline and @inbound change, with my full set of different types of Unicode strings (i.e. only ASCII characters, some Latin1 or other 2-byte ones (for UTF8), 3-byte and 4 byte [i.e. surrogates for UTF-16). It is 9-19% slower than the C version for 16 character strings... I need to test with more sizes I think... but it is 12-37% faster for 64K strings...

ScottPJones · 2015-05-03T00:21:51Z

My question now to the peanut gallery... since this is a mixed bag, should I change this to the pure Julia version, it is a bit slower than what I'd considered to be my pain threshold, but only for short strings... (but those are more common), and put my faith in the Julia team that they will look into that case... [it's a challenge for them], or see if the C version should be accepted for now?
If you all want to merge in the Julia version, shall I check it in with the C versions of both u8_charnum and u16_charnum removed?

Thanks so much for the help!

vtjnash · 2015-05-03T00:25:05Z

any idea why the @inline annotation is needed? or was just @inbounds needed? the inlining heuristics shouldn't have had any issues with such a simple function

yes, the julia-only version can be merged

ScottPJones · 2015-05-03T00:25:59Z

@carnaval About the Julia code, I'd just copied that from somebody else's Julia code in one of the utfXX.jl modules, and just changed the testing in the inner loop [to simply assume that it was valid UTF-8 or UTF-16, and then simply not count the trailing surrogates or continuation bytes]
Now I wonder if a lot of those loops I saw in the utfXX.jl modules should be liberally sprinkled with
@inbounds.

vtjnash · 2015-05-03T00:27:07Z

(note, i believe that is_surrogate_trail is already implemented as utf16_is_trail)

ScottPJones · 2015-05-03T00:27:19Z

@vtjnash I haven't had time to test all the permutations... but I'm very excited at the results!
This is more like what I was hoping to see out of Julia...

ScottPJones · 2015-05-03T00:28:29Z

@vtjnash I made a copy so I could put the @inline on it... I don't know if that works outside a module... and also I wanted to avoid the time taken to rebuild Julia... too excited to wait!

vtjnash · 2015-05-03T00:30:29Z

Now I wonder if a lot of those loops I saw in the utfXX.jl modules should be liberally sprinkled with
@inbounds.

yes, if it helps performance. otherwise, no, since it can make bugs much harder to find. Automated bounds checking elision is something that is very difficult for a compiler, but not too difficult for a human.

ScottPJones · 2015-05-03T00:31:34Z

Also, I think I'll just replace the UTF16String version of length, because the old Julia code was always much worse than this... I'll leave the optimized UTF8String version for now, so as not to inadvertently slow things down, right after the change I put in last week to speed it up!

ScottPJones · 2015-05-03T00:32:39Z

@vtjnash :-) You don't know me yet... but I am a bit OCD about rigorously benchmarking stuff... of course I'll only put an @inbounds if it noticeably affects performance...

quinnj · 2015-05-04T13:10:46Z

@ScottPJones, please put backticks ( ) around code, especially macro names (in your last comment: @inline). Without backticks, this is actually referencing/notifying/emailing the GitHub user "inline" that he/she was mentioned in this thread.

ScottPJones · 2015-05-04T13:46:16Z

@quinnj Sorry, someone had already warned me about that, and I've tried to be careful... Julia makes it a bit hard though! ;-)

nalimilan · 2015-05-04T14:17:11Z

src/support/utf8.h

+ *
+ * @return     Number of logical characters (or codepoints) in string (or substring)
+ */
+DLLEXPORT size_t u8_charnum(const char *iStr, size_t iLen);


You're using different argument names here. Also, I don't think the docs should be repeated in the headers.

OK... I'll clean that up, thanks!

nalimilan · 2015-05-04T14:20:23Z

Thanks, this looks like a nice improvement. Though I'm not sure anything has been decided about the documentation format for C code.

carnaval · 2015-05-04T14:45:44Z

@JeffBezanson I know, but the C code I've written reloads it every time also. In that case it should be quite easy for it to see that the load is loop invariant. I was trying to understand why can't LLVM (really, our set of passes) figure it out on its own.

ScottPJones · 2015-05-04T15:07:35Z

Thanks, this looks like a nice improvement. Though I'm not sure anything has been decided about the documentation format for C code.

@nalimilan About the documentation format... is there a problem with doing something? (doxygen with Markdown markup)? Julia has been around for many years now, and there's been no decision on documenation standards? Sorry if it seems snarky, it's just very frustrating to deal with code that isn't documented, and from previous experience with doxygen, it does a pretty good job...

nalimilan · 2015-05-04T15:33:36Z

@ScottPJones That's not for me to decide. I agree having a standardized documentation format is a good idea, but better think about it beforehand. :-)

tkelman · 2015-05-04T17:28:38Z

is there a problem with doing something? (doxygen with Markdown markup)? Julia has been around for many years now, and there's been no decision on documenation standards? Sorry if it seems snarky, it's just very frustrating to deal with code that isn't documented

No there has not been a decision. There have however been consistent, well-warranted complaints that the C code and some core parts of the standard library are under-documented and difficult to understand by anyone who didn't write them. I would actually be highly in favor of a concerted effort to go through and document what's there, if there are people who are willing to help with that. The only issue, and I think it's a minor one, is that there are parts of the codebase that change rapidly, and keeping the documentation up-to-date with the code means a bit of extra work. I do think Julia's a large enough project now with enough people regularly trying to look at and work with the internals that it's past time for that to be a good enough reason to resist documenting things.

(should move this documentation conversation to a new issue)

that too

IainNZ · 2015-05-04T17:41:48Z

(should move this documentation conversation to a new issue)

ScottPJones · 2015-05-04T18:11:48Z

Would the PTB want me to remove the doxygen style documentation that slipped in?
(that happened because I had originally had that function, that I had written to replace Jeff's, as part of a totally separate C module, that had my C version of UTF conversion functions as well... but since now I have pure Julia versions of the UTF conversion functions, as well as these length functions, the C module is no longer needed)
I'd also be happy to add comments to the files I've been working on (char.jl, utf*.jl), but that's a harder issue, which I already brought up on julia-dev... the @doc macro doesn't work for code in Base... it has dependencies on Markdown, which has dependencies all over the place). For now, in those files, I just put #= and =# around the whole @doc section...
Personally, I feel bad putting in code without documentation... it feels half done...

ScottPJones · 2015-05-04T18:24:13Z

@nalimilan Another thing... doxygen style (using Markdown) comments have already been accepted in the utf8proc module (I hadn't noticed it at first, since it is only in utf8proc.h, and not in utf8proc.c), and that was discussed and accepted... (see issues 26 and 29 in JuliaLang/utf8proc) [btw, how do you directly reference issues that are not in JuliaLang/julia? thx!]

pao · 2015-05-04T19:05:46Z

@ScottPJones JuliaLang/utf8proc#26

jakebolewski · 2015-05-06T10:38:35Z

@ScottPJones it would be great if you could organize your performance tests a bit and start contributing some of that code to test/perf. That way we can objectively look at the performance gains over time and make sure there are no regressions as you work on the string code.

Improve performance of length on UTF8String and UTF16String

ScottPJones · 2015-05-06T10:52:19Z

@jakebolewski Thanks! I wasn't aware of test/perf, will investigate now.

…String, using pure Julia implementation

…String

…String, using pure Julia implementation

…String

Improve performance of length on a UTF16String

0dad086

nalimilan reviewed May 2, 2015
View reviewed changes

JuliaLang#11107 Improve performance of length on a UTF16String

63be5dc

nalimilan reviewed May 4, 2015
View reviewed changes

ScottPJones mentioned this pull request May 4, 2015

doxygen + markdown to document C/C++ code #11123

Closed

JuliaLang#11107 Removed doxygen style documentation

31889d3

mbauman added the unicode Related to unicode characters and encodings label May 5, 2015

jakebolewski added a commit that referenced this pull request May 6, 2015

Merge pull request #11107 from ScottPJones/spj/u16_charnum

18b7b34

Improve performance of length on UTF8String and UTF16String

jakebolewski merged commit 18b7b34 into JuliaLang:master May 6, 2015

ScottPJones deleted the spj/u16_charnum branch May 7, 2015 23:55

mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length on a UTF16String

750b3e0

mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length of UTF8String and UTF16…

5bb4558

…String, using pure Julia implementation

mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length of UTF8String and UTF16…

4910341

…String

mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Removed unnecessary @inline macros

93933be

mbauman pushed a commit to mbauman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Removed doxygen style documentation

1d1cc3d

tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length on a UTF16String

d091933

tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length of UTF8String and UTF16…

7334ddd

…String, using pure Julia implementation

tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Improve performance of length of UTF8String and UTF16…

6637284

…String

tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Removed unnecessary @inline macros

9003916

tkelman pushed a commit to tkelman/julia that referenced this pull request Jun 6, 2015

JuliaLang#11107 Removed doxygen style documentation

b8e0095

Improve performance of length on UTF8String and UTF16String #11107

Improve performance of length on UTF8String and UTF16String #11107

Conversation

ScottPJones commented May 2, 2015

nalimilan May 2, 2015

Choose a reason for hiding this comment

ScottPJones May 2, 2015

Choose a reason for hiding this comment

nalimilan May 3, 2015

Choose a reason for hiding this comment

nalimilan commented May 2, 2015

ScottPJones commented May 2, 2015

ScottPJones commented May 2, 2015

vtjnash commented May 2, 2015

carnaval commented May 2, 2015

ScottPJones commented May 2, 2015

ScottPJones commented May 2, 2015

carnaval commented May 2, 2015

ScottPJones commented May 2, 2015

ScottPJones commented May 2, 2015

carnaval commented May 2, 2015

ScottPJones commented May 2, 2015

carnaval commented May 2, 2015

carnaval commented May 2, 2015

ScottPJones commented May 2, 2015

ScottPJones commented May 2, 2015

ScottPJones commented May 3, 2015

ScottPJones commented May 3, 2015

vtjnash commented May 3, 2015

ScottPJones commented May 3, 2015

vtjnash commented May 3, 2015

ScottPJones commented May 3, 2015

ScottPJones commented May 3, 2015

vtjnash commented May 3, 2015

ScottPJones commented May 3, 2015

ScottPJones commented May 3, 2015

quinnj commented May 4, 2015

ScottPJones commented May 4, 2015

nalimilan May 4, 2015

Choose a reason for hiding this comment

ScottPJones May 4, 2015

Choose a reason for hiding this comment

nalimilan commented May 4, 2015

carnaval commented May 4, 2015

ScottPJones commented May 4, 2015

nalimilan commented May 4, 2015

tkelman commented May 4, 2015

IainNZ commented May 4, 2015

ScottPJones commented May 4, 2015

ScottPJones commented May 4, 2015

pao commented May 4, 2015

jakebolewski commented May 6, 2015

ScottPJones commented May 6, 2015