-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of length on UTF8String and UTF16String #11107
Conversation
* @param[in] iStr UTF-8 encoded string | ||
* @param[in] iLen Length of the string in bytes | ||
* | ||
* @return number of logical characters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to say "codepoints" instead of "logical character".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least I documented it! ;-) From where I'm coming from, I thought logical character was actually clearer... a lot of people get codepoints vs. the value of the byte/word/32-bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"logical character" can mean either code point or grapheme cluster, I don't think it's clearer. A Google search for the former does not give anything useful, while a search for the latter immediately leads to the definition of Wikipedia.
Do you really need to rely on a C function to do that? I'm not sure why this is currently the case for UTF-8, but I think it may have to do with the special status of |
It is for speed, it simply was not nearly fast enough in Julia... about 3x slower... the Julia team will need to look at that, but for now, given that u8_charnum is already there, I don't think it should prevent this from being merged in. |
@nalimilan Better now? Thanks very much for the input! |
[citation needed] |
We should definitely be able to get a simple loop like that down to the same IR as, e.g., clang would. If not I'd consider that a compiler bug. |
@carnaval I'm all for that, and I've said earlier, that the moment the Julia version is within say 5-10% of the C version, then the C version could be dropped... Unfortunately, that is not the case at present, and the people who would be able to improve it are probably working 90+ hour weeks as it is! |
@vtjnash I'll try to dig that code up, or recreate it... as I've said before, I am not opposed to having everything all written in Julia... but I need the performance now, not next year, and I don't think it is unreasonable to add a few lines of C to Julia, that are exactly like the code that is already there, |
(disclaimer : I know nothing about string encoding) @inline is_surrogate_trail(c :: UInt16) = (c & ~UInt16(0x3ff)) == UInt16(0xdc00)
function length2(s::UTF16String)
d = s.data
len = length(d) - 1
len == 0 && return 0
cnum = 0
for i = 1:len
@inbounds cnum += !is_surrogate_trail(d[i])
end
cnum
end Note : my LLVM version (3.6) auto-vectorizes this nicely so it may be that your C compiler is doing it but not the LLVM version you are using with julia. |
@carnaval This is what I got: for 64K strings, the Julia code was 2-2.2x slower, for 16 byte strings, it was between 41-72% slower... that's a very significant difference... The difference was < 3x now, because my Julia code is better than it was a couple of weeks ago ;-) |
@carnaval That's good... but I couldn't find the
Here is a gist with my results and (really bad) benchmarking code... [I really want to know how in Julia to best to get the 2 numbers from |
As I said, you have to check if your C version is getting vectorized (likely) for example by running |
@carnaval I would think it's the same LLVM... the C code is in utf8.c and is built as part of Julia, or have I missed something? |
And now that I look at your code, I'm pretty sure that the bound checks are not being ellided which prevents vectorization because of the possibility of diverging control flow. See the |
There is no particular reason for the C source to be compiled by clang using the same LLVM version we use for the codegen. Any C compiler does the job. |
@carnaval OK, very helpful! As I think I've said, my very first attempt at this was in Julia, but then I saw that it was still slower than the UTF-8 version (and should have been faster), and enough slower that I felt I wouldn't be able to use it... and also, @JeffBezanson had it as a C function, which I had already sped up last week by a good amount, and if it was good enough for somebody like him, one of the Jedi Knights of Julia, I thought it had to be good enough for me! |
@carnaval Also good to know, that it won't necessarily be using the same LLVM as Julia... I often am running with the pre-release Xcode versions on my machine... Thanks! |
@carnaval Thanks so much... I just ran with your |
My question now to the peanut gallery... since this is a mixed bag, should I change this to the pure Julia version, it is a bit slower than what I'd considered to be my pain threshold, but only for short strings... (but those are more common), and put my faith in the Julia team that they will look into that case... [it's a challenge for them], or see if the C version should be accepted for now? Thanks so much for the help! |
any idea why the yes, the julia-only version can be merged |
@carnaval About the Julia code, I'd just copied that from somebody else's Julia code in one of the utfXX.jl modules, and just changed the testing in the inner loop [to simply assume that it was valid UTF-8 or UTF-16, and then simply not count the trailing surrogates or continuation bytes] |
(note, i believe that |
@vtjnash I haven't had time to test all the permutations... but I'm very excited at the results! |
@vtjnash I made a copy so I could put the |
yes, if it helps performance. otherwise, no, since it can make bugs much harder to find. Automated bounds checking elision is something that is very difficult for a compiler, but not too difficult for a human. |
Also, I think I'll just replace the UTF16String version of length, because the old Julia code was always much worse than this... I'll leave the optimized UTF8String version for now, so as not to inadvertently slow things down, right after the change I put in last week to speed it up! |
@ScottPJones, please put backticks ( |
@quinnj Sorry, someone had already warned me about that, and I've tried to be careful... Julia makes it a bit hard though! ;-) |
* | ||
* @return Number of logical characters (or codepoints) in string (or substring) | ||
*/ | ||
DLLEXPORT size_t u8_charnum(const char *iStr, size_t iLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using different argument names here. Also, I don't think the docs should be repeated in the headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK... I'll clean that up, thanks!
Thanks, this looks like a nice improvement. Though I'm not sure anything has been decided about the documentation format for C code. |
@JeffBezanson I know, but the C code I've written reloads it every time also. In that case it should be quite easy for it to see that the load is loop invariant. I was trying to understand why can't LLVM (really, our set of passes) figure it out on its own. |
@nalimilan About the documentation format... is there a problem with doing something? (doxygen with Markdown markup)? Julia has been around for many years now, and there's been no decision on documenation standards? Sorry if it seems snarky, it's just very frustrating to deal with code that isn't documented, and from previous experience with doxygen, it does a pretty good job... |
@ScottPJones That's not for me to decide. I agree having a standardized documentation format is a good idea, but better think about it beforehand. :-) |
No there has not been a decision. There have however been consistent, well-warranted complaints that the C code and some core parts of the standard library are under-documented and difficult to understand by anyone who didn't write them. I would actually be highly in favor of a concerted effort to go through and document what's there, if there are people who are willing to help with that. The only issue, and I think it's a minor one, is that there are parts of the codebase that change rapidly, and keeping the documentation up-to-date with the code means a bit of extra work. I do think Julia's a large enough project now with enough people regularly trying to look at and work with the internals that it's past time for that to be a good enough reason to resist documenting things.
that too |
(should move this documentation conversation to a new issue) |
Would the PTB want me to remove the doxygen style documentation that slipped in? |
@nalimilan Another thing... doxygen style (using Markdown) comments have already been accepted in the utf8proc module (I hadn't noticed it at first, since it is only in utf8proc.h, and not in utf8proc.c), and that was discussed and accepted... (see issues 26 and 29 in JuliaLang/utf8proc) [btw, how do you directly reference issues that are not in JuliaLang/julia? thx!] |
@ScottPJones |
@ScottPJones it would be great if you could organize your performance tests a bit and start contributing some of that code to |
Improve performance of length on UTF8String and UTF16String
@jakebolewski Thanks! I wasn't aware of |
…String, using pure Julia implementation
…String, using pure Julia implementation
This adds an optimized version to get the length of a UTF16String, it is about an order of magnitude faster (it is still O(n) like UTF8String, instead of O(1) like ASCIIString or UTF32String, though)