-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #10959 bugs with UTF-16 conversions #11551
Conversation
FWIW I think this is a big improvement over #11004, I really like the break up into multiple files. Honestly, I'm glad this is a separate PR and away from the 500 comment behemoth; getting directed to that on a git bisect would be terrible. +1 |
Could you summarize what's different here relative to the status of where #11004 left off? The diff appears to be only about 100 lines shorter here, but I'm having a hard time seeing exactly where. |
function convert(::Type{UTF16String}, str::UTF8String) | ||
dat = str.data | ||
# handle zero length string quickly | ||
sizeof(dat) == 0 && return empty_utf16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is best to return an explicit copy here, not an alias to a global. The empty UTF16String is immutable, but its best not to propagate undefined behavior if some user level function decides to muck with string internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the technique used by utf8.jl, I don't know who wrote that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed. @jakebolewski should we just add a copy
on
Line 123 in c85f1be
isempty(r) && return empty_utf8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is potentially creating tons of extra allocations ("" is very common) to protect against someone doing something bad really a good thing? Ban things like using C pointers then...
I incorporated most of the last round of comments from the reviews (I was running unit tests on that when it was closed 😞 ) [all of the things that I said I would fix], I moved (at least for now, I think it is a step backwards) some of the conversion functions from |
@throws never returns, always throws ArgumentError | ||
""" -> | ||
=# | ||
function utf_errfunc(errcode::Integer, charpos, invchar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of ad-hoc template substitution, I still think it would be more explicit to construct and throw the exception at the point of the error. utf_errfunc
does not really cut down on code redundancy in throwing similar error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is this function is used to avoid creating GC frame. I still think it is the safest to mark it as @noinline
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a gc-frame is not the end of the world, it is created virtually everywhere in the Julia codebase. I would guess that the change would be marginal to insignificant in terms of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually saw a major difference, in some early tests I did...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this extra indirection for error handling is unusual and should be removed for now. We don't have any other infrastructure anywhere else in base for internationalizing error messages, we don't need it here at this very moment. We can work on a more generic solution for that problem and start using it across the entire codebase later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottPJones we are not arguing that better error messages are not hugely beneficial. If writing the code this way (extracting out into a separate function, using global error constants) was motivated by performance issues then it is best to say:
I wrote this code first with inline
throw(Exception())
but that created a GC frame which degraded performance in the validation function. I extracted out the error functions this way and that resulted in a 25% performance increase shown by the following benchmark:
function bench()
# some code
end
@time bench()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This current structure was motived by a number of things:
- Long familiarity [25 years] with a very powerful system for having localizable messages,
where the messages can handle differences in word order between languages, and can handle
plain output messages, warning messages, and error messages, with any number of arguments
included in the output. - Fairly severe performance problems with GC frames, when I first implemented this by adding
a new UnicodeError. If that were totally fixed (see @yuyichao's RFC: Create local gc frame forthrow
#11508), I would prefer to move
back to using UnicodeError, but I really don't think that that should be a reason to hold this up. - Enum's not working in Base, at least not early enough to be useful for this.
When I first started implementing this, I used Enums, which is what all of those const's really
meant to be.
So, yes, these could be better, but not at the moment, because of issues out of my control.
I will definitely change them to be what I'd originally wanted (which I think you all as well would prefer), as soon as those issues are dealt with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- is not relevant here. You're continuing to argue from authority as resistance against changes we are all telling you will help this PR be smaller, less controversial, and more likely to be merged.
- If you had written that down, with numbers, saying "changing just the error code structure from something simple to this resulted in x% speedup on this benchmark gist" (btw please provide an executive summary of the numbers in those gists, there's way too much data there to parse at a glance) as @jakebolewski said, then we'd probably give in on this point if x were large enough.
- is superficial, Enums are new and not yet widely used, and would still result in this looking like C-written-in-Julia instead of idiomatic Julia code.
Every line of code in this PR is under your control. Dozens of people have suggested dozens of ways to break this down and make this substantially better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I actually wasn't trying to "argue from authority" at all. @jakebolewski had talked about my motivations for making the change this way, and I was trying to answer that as best I could.
Are Julians always so sensitive about talking about one's experience, in the context of a technical discussion? IMO, it can help shed light on a issue... I, for example, am very interested to learn about Julian's experiences in the technical/numeric/scientific computing world. The type of arguing that I've seen here that does bother me is "argument from a position of authority", and people willing to use that position to curtail or stifle discussion. -
I have tried to give the numbers and executive summaries as I went along, but with 500+ comments, I think the early ones where I tested different error handling strategies got lost in the chaff. As soon as I have a minute (I am also busy working here in Belgium), I'll try to recreate my tests that pointed me to the issue of GC frame creation [and I know that @yuyichao is working hard to fix things so it wouldn't even be an issue]. I did give an executive summary just before I went to bed, did you miss it? 21-100% slower without the
@inbounds
macros... (I think more frequently closer to 100%). -
Well, first off, I haven't really seen that much consistency in the Julia code I've been looking at...
(maybe because it is older code, who knows), so I don't totally buy the argument that everything must be written in "idiomatic" Julia code. Different people also have different ideas of what constitutes good structure and practice... Does Julia really need some sort of unwritten style that everybody must follow? (at least, if it were written down, people might have a chance of knowing what was expected, and using that, or debating whether the current practice really is good...)
I'm not sure at all that the suggestions that I haven't incorporated so far would make things "substantially better"...
About the error handling: the reason the very first thing that I implemented was the better error handling, is that I would not have been able to easily find and fix all of the many bugs in the current code without the enhanced messages. It all ties together... this PR adds some extra overhead already to do complete validity checking, which people worried about in the initial comments on #11004... A big goal of mine was to at the very least, not introduce any regressions in performance (the performance was so bad already, that would have been criminal). Because of my approaches, not only is the code not slower, even with the extra checking, it is faster as well.
At least in my early benchmarking, the extra overhead in the inner loops (which is where the error checking happens!) made the performance frequently worse than the current bad performance.
All of the above is why I really do not think it is at all a good think to try to split this up as has been suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the micro benchmarking... I know that the way it just runs things and uses @time
and everything is really crappy... I really haven't had time to learn how best to collect all of the information that I wish to and save it for later processing... I miss having a high speed database always available just by adding a ^
character to the associative array! (my changes to gc.c that already got merged actually help collect the data, but storing and retrieving)
Any pointers to how better to benchmark things like this, and produce a nice spreadsheet or graph in julia, would be greatly appreciated!
We already have 50ish comments worth of mostly minor review back-and-forth on this new PR. Given this is only about 10% different than #11004 I have a sneaky feeling @StefanKarpinski would object pretty strongly to merging it as is, even after addressing the review comments that have been made up to this point. So, question for you @ScottPJones. How small of a correct bugfix version of this could we distill this down to, strictly using edit3: Remember one of Jeff's mottos, underscores are a sign of missing abstraction. |
@throws ArgumentError | ||
""" -> | ||
=# | ||
function check_string_utf32(dat::Vector{UInt32}, len::Int, options::Integer=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is almost line-for-line identical with check_string_abs
. Can't you merge them? i.e. why can't you use something like ch, pos = next(dat, pos)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd answered this before... but again 1) I had done exactly that, and ran into severe performance degradation, which @JeffBezanson looked into and may have totally addressed, and would need some time to adequately benchmark, look into generated code, etc. before I would feel comfortable trying that again, 2) I had been told previously that I should not mix AbstractString
and Vector{T}
in a method like that, so I am getting conflicting requirements from the reviewers.
I don't think my question from https://github.com/JuliaLang/julia/pull/11004/files#r31562735 about checking for surrogates in |
I think that is a whole separate issue, and definitely should not be addressed here. |
@nalimilan I opened a new PR (I must be a glutton for punishment!) where I think it might be a good place to discuss the issue you raised. (#11558) |
@tkelman Maybe, I'll try that out, and look at the native code generated. Thanks! |
OK, I haven't gotten everything asked for changed (because I want to make sure the performance doesn't suffer, esp. to the point of being a regression compared to the current code), but hopefully this will make people happier, and it still maintains my design goals. |
Not really relavant for this PR but the travis test got killed by OOM killer and the test passes??? WTH? |
Yes, what's up with the strange OOM errors? I saw cases before my change, so I don't think it is related. |
Yeah, the OOM killer has nothing to do with this PR, the I was just a little surprised that the OOM killer firing doesn't make the test fail...... |
Yes, such a nasty ending definitely should make the test fail (whether it was caused by the PR or not!) |
totalchar = num2byte = num3byte = num4byte = 0 | ||
@inbounds while pos < len | ||
if T == AbstractString | ||
ch, pos = next(dat, pos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this wouldn't work for the vector cases too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be more specific... wouldn't what work for the vector cases? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ch, pos = nex(dat, pos)
, you might not need this if T == AbstractString
at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I had just copied that from the older code that walked over an AbstractString
... I had thought, for an AbstractString
, you couldn't count on the next position being +1, so that's why you had to use next
, and start
, and endof
. Given that that is the way most of the rest of the code dealing with AbstractString
s was coded, could we just leave that as is for now, and I'll try to investigate it more thoroughly later? (and then I could revamp the other places as well, all in other small PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... or did you mean to change it do always do ch, pos = next(dat, pos)
?
I'll take a quick look at the code generated in both cases... that would be great if it boiled down to the same code... 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to always use next
, since that's more general and should have the same meaning for simple vectors - and ideally generate nearly the same code
### Returns: | ||
* `UTF8String` | ||
" | ||
function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just encode_to_utf8{T<:Union{UInt16, UInt32}}(dat::AbstractVector{T}, len)
, rather than redundantly passing T
... isn't dat
always going to be some kind of vector of T
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was more for the future, where I want the first argument to really be an Encoding, see @quinnj's nice Strings.jl that he's been working on.
@stevengj Are my responses enough to satisfy you on the 3 notes, or do you want things changed? Thanks! |
@stevengj I just tried to remove those (from socket.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 128: LoadError(at "socket.jl" line 644: StackOverflowError()))
rec_backtrace at /j/julia/src/task.c:644
eval at /j/julia/usr/lib/libjulia.dylib (unknown line)
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
include at boot.jl:254
jl_apply at /j/julia/src/interpreter.c:55
eval at /j/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /j/julia/src/toplevel.c:517
jl_eval_module_expr at /j/julia/src/toplevel.c:156
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
exec_program at /j/julia/usr/bin/julia (unknown line)
true_main at /j/julia/usr/bin/julia (unknown line)
main at /j/julia/usr/bin/julia (unknown line) |
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Can this go in so we can move on to ripping apart my next set of bug fixes? 😀 |
I think this looks fine now, doesn't look like there are any obvious candidates for much code reuse in what's left. |
Bump: this code hasn't changed in almost two weeks (just been rebased to keep up with other changes being put into base). Anything preventing it from being merged in now? |
LGTM. |
Fix #10959 bugs with UTF-16 conversions
Thanks, @tkelman! |
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Crap, I think I'm going to have to revert this, I'm now getting an error |
Scratch that, I think it's an unrelated problem caused by a mistake I made, will fix. |
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
There is also a fix to JSON.jl in JuliaIO/JSON.jl#111 |
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Rewrote a number of the conversions between
ASCIIString
,UTF8String
, andUTF16String
.Rewrote
length()
forUTF16String
.Improved
reverse()
forUTF16String
.Added over 150 lines of testing code to detect the above conversion problems
Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc
Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84