Fix #10959 bugs with UTF-16 conversions #11551

ScottPJones · 2015-06-03T01:22:27Z

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String.
Improved reverse() for UTF16String.

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

hayd · 2015-06-03T01:40:18Z

FWIW I think this is a big improvement over #11004, I really like the break up into multiple files. Honestly, I'm glad this is a separate PR and away from the 500 comment behemoth; getting directed to that on a git bisect would be terrible. +1

ScottPJones · 2015-06-03T01:42:11Z

@hayd #11004 already had the breakup into multiple files... that's what surprised me about the closing it off in the middle of active review... but I am glad to get away from the 500 comments, CodeHub crashes when trying to read it!

tkelman · 2015-06-03T01:44:51Z

Could you summarize what's different here relative to the status of where #11004 left off? The diff appears to be only about 100 lines shorter here, but I'm having a hard time seeing exactly where.

jakebolewski · 2015-06-03T01:49:51Z

base/utfconvert.jl

+function convert(::Type{UTF16String}, str::UTF8String)
+    dat = str.data
+    # handle zero length string quickly
+    sizeof(dat) == 0 && return empty_utf16


I think it is best to return an explicit copy here, not an alias to a global. The empty UTF16String is immutable, but its best not to propagate undefined behavior if some user level function decides to muck with string internals.

That is the technique used by utf8.jl, I don't know who wrote that...

Ah, indeed. @jakebolewski should we just add a copy on

julia/base/utf8.jl

Line 123 in c85f1be

isempty(r) && return empty_utf8

?

Is potentially creating tons of extra allocations ("" is very common) to protect against someone doing something bad really a good thing? Ban things like using C pointers then...

ScottPJones · 2015-06-03T01:51:31Z

I incorporated most of the last round of comments from the reviews (I was running unit tests on that when it was closed 😞 ) [all of the things that I said I would fix], I moved (at least for now, I think it is a step backwards) some of the conversion functions from utfconvert.jl to utf16.jl/utf32.jl to make the diffs cleaner (and so I wouldn't get dinged for stuff I didn't change). There is some potential for further refactoring of a couple functions, and removal of some functions, but those suggestions either were definitely, or likely to be breaking changes, and so I didn't feel they were appropriate for this PR.

jakebolewski · 2015-06-03T01:54:25Z

base/utferror.jl

+@throws never returns, always throws ArgumentError
+""" ->
+=#
+function utf_errfunc(errcode::Integer, charpos, invchar)


Instead of ad-hoc template substitution, I still think it would be more explicit to construct and throw the exception at the point of the error. utf_errfunc does not really cut down on code redundancy in throwing similar error messages.

My guess is this function is used to avoid creating GC frame. I still think it is the safest to mark it as @noinline though.

Creating a gc-frame is not the end of the world, it is created virtually everywhere in the Julia codebase. I would guess that the change would be marginal to insignificant in terms of performance.

I actually saw a major difference, in some early tests I did...

I agree that this extra indirection for error handling is unusual and should be removed for now. We don't have any other infrastructure anywhere else in base for internationalizing error messages, we don't need it here at this very moment. We can work on a more generic solution for that problem and start using it across the entire codebase later.

@ScottPJones we are not arguing that better error messages are not hugely beneficial. If writing the code this way (extracting out into a separate function, using global error constants) was motivated by performance issues then it is best to say:

I wrote this code first with inline throw(Exception()) but that created a GC frame which degraded performance in the validation function. I extracted out the error functions this way and that resulted in a 25% performance increase shown by the following benchmark:

function bench() # some code end @time bench()

This current structure was motived by a number of things:

Long familiarity [25 years] with a very powerful system for having localizable messages,
where the messages can handle differences in word order between languages, and can handle
plain output messages, warning messages, and error messages, with any number of arguments
included in the output.

Fairly severe performance problems with GC frames, when I first implemented this by adding
a new UnicodeError. If that were totally fixed (see @yuyichao's RFC: Create local gc frame for throw #11508), I would prefer to move
back to using UnicodeError, but I really don't think that that should be a reason to hold this up.

Enum's not working in Base, at least not early enough to be useful for this.
When I first started implementing this, I used Enums, which is what all of those const's really
meant to be.
So, yes, these could be better, but not at the moment, because of issues out of my control.
I will definitely change them to be what I'd originally wanted (which I think you all as well would prefer), as soon as those issues are dealt with.

is not relevant here. You're continuing to argue from authority as resistance against changes we are all telling you will help this PR be smaller, less controversial, and more likely to be merged.

If you had written that down, with numbers, saying "changing just the error code structure from something simple to this resulted in x% speedup on this benchmark gist" (btw please provide an executive summary of the numbers in those gists, there's way too much data there to parse at a glance) as @jakebolewski said, then we'd probably give in on this point if x were large enough.

is superficial, Enums are new and not yet widely used, and would still result in this looking like C-written-in-Julia instead of idiomatic Julia code.

Every line of code in this PR is under your control. Dozens of people have suggested dozens of ways to break this down and make this substantially better.

I actually wasn't trying to "argue from authority" at all. @jakebolewski had talked about my motivations for making the change this way, and I was trying to answer that as best I could.
Are Julians always so sensitive about talking about one's experience, in the context of a technical discussion? IMO, it can help shed light on a issue... I, for example, am very interested to learn about Julian's experiences in the technical/numeric/scientific computing world. The type of arguing that I've seen here that does bother me is "argument from a position of authority", and people willing to use that position to curtail or stifle discussion.

I have tried to give the numbers and executive summaries as I went along, but with 500+ comments, I think the early ones where I tested different error handling strategies got lost in the chaff. As soon as I have a minute (I am also busy working here in Belgium), I'll try to recreate my tests that pointed me to the issue of GC frame creation [and I know that @yuyichao is working hard to fix things so it wouldn't even be an issue]. I did give an executive summary just before I went to bed, did you miss it? 21-100% slower without the @inbounds macros... (I think more frequently closer to 100%).

Well, first off, I haven't really seen that much consistency in the Julia code I've been looking at...
(maybe because it is older code, who knows), so I don't totally buy the argument that everything must be written in "idiomatic" Julia code. Different people also have different ideas of what constitutes good structure and practice... Does Julia really need some sort of unwritten style that everybody must follow? (at least, if it were written down, people might have a chance of knowing what was expected, and using that, or debating whether the current practice really is good...)
I'm not sure at all that the suggestions that I haven't incorporated so far would make things "substantially better"...

About the error handling: the reason the very first thing that I implemented was the better error handling, is that I would not have been able to easily find and fix all of the many bugs in the current code without the enhanced messages. It all ties together... this PR adds some extra overhead already to do complete validity checking, which people worried about in the initial comments on #11004... A big goal of mine was to at the very least, not introduce any regressions in performance (the performance was so bad already, that would have been criminal). Because of my approaches, not only is the code not slower, even with the extra checking, it is faster as well.
At least in my early benchmarking, the extra overhead in the inner loops (which is where the error checking happens!) made the performance frequently worse than the current bad performance.

All of the above is why I really do not think it is at all a good think to try to split this up as has been suggested.

About the micro benchmarking... I know that the way it just runs things and uses @time and everything is really crappy... I really haven't had time to learn how best to collect all of the information that I wish to and save it for later processing... I miss having a high speed database always available just by adding a ^ character to the associative array! (my changes to gc.c that already got merged actually help collect the data, but storing and retrieving)
Any pointers to how better to benchmark things like this, and produce a nice spreadsheet or graph in julia, would be greatly appreciated!

tkelman · 2015-06-03T03:55:31Z

We already have 50ish comments worth of mostly minor review back-and-forth on this new PR. Given this is only about 10% different than #11004 I have a sneaky feeling @StefanKarpinski would object pretty strongly to merging it as is, even after addressing the review comments that have been made up to this point.

So, question for you @ScottPJones. How small of a correct bugfix version of this could we distill this down to, strictly using check_string_abs? None of the other check_string_* variants for now [edit2: or parameterize them and reuse as much implementation as possible]. And as few as possible, maximally generic convert methods, parameterized on the input and output types to reuse as many lines of code as you can. [edit1: Remember, this PR is a bugfix.]

edit3: Remember one of Jeff's mottos, underscores are a sign of missing abstraction.

stevengj · 2015-06-03T06:22:44Z

base/utfcheck.jl

+@throws     ArgumentError
+""" ->
+=#
+function check_string_utf32(dat::Vector{UInt32}, len::Int, options::Integer=0)


This function is almost line-for-line identical with check_string_abs. Can't you merge them? i.e. why can't you use something like ch, pos = next(dat, pos)?

I'd answered this before... but again 1) I had done exactly that, and ran into severe performance degradation, which @JeffBezanson looked into and may have totally addressed, and would need some time to adequately benchmark, look into generated code, etc. before I would feel comfortable trying that again, 2) I had been told previously that I should not mix AbstractString and Vector{T} in a method like that, so I am getting conflicting requirements from the reviewers.

nalimilan · 2015-06-03T08:53:38Z

I don't think my question from https://github.com/JuliaLang/julia/pull/11004/files#r31562735 about checking for surrogates in Chars obtained by indexing an AbstractString (in check_string_abs) has been addressed. Do we consider part of the AbstractString interface that getindex must return valid codepoints (as I think we should)?

ScottPJones · 2015-06-03T08:57:55Z

I think that is a whole separate issue, and definitely should not be addressed here.
What I implemented correctly handles the way things currently are.
(I am not saying that it should not be addressed, I've already stated many times that I think the string types and Char should only contain 100% valid Unicode data... but that is a separate issue)

ScottPJones · 2015-06-03T11:52:41Z

@nalimilan I opened a new PR (I must be a glutton for punishment!) where I think it might be a good place to discuss the issue you raised. (#11558)

ScottPJones · 2015-06-03T13:23:40Z

@tkelman Maybe, I'll try that out, and look at the native code generated. Thanks!

ScottPJones · 2015-06-03T16:54:23Z

OK, I haven't gotten everything asked for changed (because I want to make sure the performance doesn't suffer, esp. to the point of being a regression compared to the current code), but hopefully this will make people happier, and it still maintains my design goals.
Especially look at the changes to utferror.jl and utfcheck.jl.
Thanks everybody for the time spent reviewing this, I really do appreciate it!

yuyichao · 2015-06-03T17:38:54Z

Not really relavant for this PR but the travis test got killed by OOM killer and the test passes??? WTH?

yuyichao · 2015-06-03T17:39:13Z

https://travis-ci.org/JuliaLang/julia/jobs/65276682#L4036

ScottPJones · 2015-06-03T23:34:59Z

Yes, what's up with the strange OOM errors? I saw cases before my change, so I don't think it is related.
Anyway, I'd appreciate it if people could take a look at the latest version, I was able to make things more generic without performance degrading (it seems @JeffBezanson's changes did fix the bugs I stumbled across before). Thanks!

yuyichao · 2015-06-03T23:43:44Z

Yeah, the OOM killer has nothing to do with this PR, the dmesg output is added as an indicator for OOM errors. This issue is tracked separately #11553 .

I was just a little surprised that the OOM killer firing doesn't make the test fail......

ScottPJones · 2015-06-03T23:57:25Z

Yes, such a nasty ending definitely should make the test fail (whether it was caused by the PR or not!)

tkelman · 2015-06-04T06:59:03Z

base/utfcheck.jl

+    totalchar = num2byte = num3byte = num4byte = 0
+    @inbounds while pos < len
+        if T == AbstractString
+             ch, pos = next(dat, pos)


any reason this wouldn't work for the vector cases too?

Please be more specific... wouldn't what work for the vector cases? Thanks!

ch, pos = nex(dat, pos), you might not need this if T == AbstractString at all

Oh! I had just copied that from the older code that walked over an AbstractString... I had thought, for an AbstractString, you couldn't count on the next position being +1, so that's why you had to use next, and start, and endof. Given that that is the way most of the rest of the code dealing with AbstractStrings was coded, could we just leave that as is for now, and I'll try to investigate it more thoroughly later? (and then I could revamp the other places as well, all in other small PR)

Hmm... or did you mean to change it do always do ch, pos = next(dat, pos)?
I'll take a quick look at the code generated in both cases... that would be great if it boiled down to the same code... 😀

I meant to always use next, since that's more general and should have the same meaning for simple vectors - and ideally generate nearly the same code

stevengj · 2015-06-23T20:21:45Z

base/utf16.jl

+### Returns:
+* `UTF8String`
+"
+function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len)


Why not just encode_to_utf8{T<:Union{UInt16, UInt32}}(dat::AbstractVector{T}, len), rather than redundantly passing T ... isn't dat always going to be some kind of vector of T?

That was more for the future, where I want the first argument to really be an Encoding, see @quinnj's nice Strings.jl that he's been working on.

ScottPJones · 2015-06-24T19:08:02Z

@stevengj Are my responses enough to satisfy you on the 3 notes, or do you want things changed? Thanks!

ScottPJones · 2015-06-24T19:19:38Z

@stevengj I just tried to remove those (from ascii.jl, utf8.jl, utf16.jl, and utf32.jl), but ended up with the following fun:

socket.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 128: LoadError(at "socket.jl" line 644: StackOverflowError()))
rec_backtrace at /j/julia/src/task.c:644
eval at /j/julia/usr/lib/libjulia.dylib (unknown line)
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
include at boot.jl:254
jl_apply at /j/julia/src/interpreter.c:55
eval at /j/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /j/julia/src/toplevel.c:517
jl_eval_module_expr at /j/julia/src/toplevel.c:156
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
exec_program at /j/julia/usr/bin/julia (unknown line)
true_main at /j/julia/usr/bin/julia (unknown line)
main at /j/julia/usr/bin/julia (unknown line)

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation

ScottPJones · 2015-06-26T21:43:39Z

Can this go in so we can move on to ripping apart my next set of bug fixes? 😀

tkelman · 2015-06-27T03:32:25Z

I think this looks fine now, doesn't look like there are any obvious candidates for much code reuse in what's left.

ScottPJones · 2015-06-30T11:30:11Z

Bump: this code hasn't changed in almost two weeks (just been rebased to keep up with other changes being put into base). Anything preventing it from being merged in now?

stevengj · 2015-06-30T21:13:11Z

LGTM.

Fix #10959 bugs with UTF-16 conversions

ScottPJones · 2015-07-01T03:27:16Z

Thanks, @tkelman!

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

tkelman · 2015-07-01T04:18:53Z

Crap, I think I'm going to have to revert this, I'm now getting an error syntax: invalid character "�". This is one case where it would've been ideal to run an immediately pre-merge integration test. Please restore the branch, make a copy of it and file a new PR.

tkelman · 2015-07-01T04:31:12Z

Scratch that, I think it's an unrelated problem caused by a mistake I made, will fix.

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

ScottPJones · 2015-07-01T18:07:02Z

There is also a fix to JSON.jl in JuliaIO/JSON.jl#111

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

hayd mentioned this pull request Jun 3, 2015

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed

jakebolewski reviewed Jun 3, 2015
View reviewed changes

stevengj reviewed Jun 3, 2015
View reviewed changes

ScottPJones mentioned this pull request Jun 3, 2015

RFC: Roadmap for improving string support in Julia #11558

Closed

ScottPJones force-pushed the spj/fixutf branch from 0b6619a to 58cc026 Compare June 3, 2015 11:47

ScottPJones force-pushed the spj/fixutf branch from 58cc026 to 4592c7b Compare June 3, 2015 16:50

tkelman reviewed Jun 4, 2015
View reviewed changes

stevengj reviewed Jun 23, 2015
View reviewed changes

ScottPJones added 4 commits June 25, 2015 19:03

Fix AbstractVector{UInt16} conversion

a0c273b

Remove support for converting Vector{UInt16} to UTF8String

00f0200

Update UTF16String map function

2ab9334

ScottPJones force-pushed the spj/fixutf branch from 5aecd65 to 2ab9334 Compare June 25, 2015 23:08

tkelman added a commit that referenced this pull request Jul 1, 2015

Merge pull request #11551 from ScottPJones/spj/fixutf

9071f14

Fix #10959 bugs with UTF-16 conversions

tkelman merged commit 9071f14 into JuliaLang:master Jul 1, 2015

ScottPJones deleted the spj/fixutf branch July 1, 2015 03:32

swt30 mentioned this pull request Jul 1, 2015

import JSON fails with utf16_is_surrogate undefined JuliaIO/JSON.jl#110

Closed

tkelman mentioned this pull request Jul 1, 2015

Deprecate utf16_is_*, utf16_get_supplementary, is_utf8_*, add @unexportedwarning macro, add tests #11976

Closed

Fix #10959 bugs with UTF-16 conversions #11551

Fix #10959 bugs with UTF-16 conversions #11551

Conversation

ScottPJones commented Jun 3, 2015

hayd commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

tkelman commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

yuyichao commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jun 24, 2015

ScottPJones commented Jun 24, 2015

ScottPJones commented Jun 26, 2015

tkelman commented Jun 27, 2015

ScottPJones commented Jun 30, 2015

stevengj commented Jun 30, 2015

ScottPJones commented Jul 1, 2015

tkelman commented Jul 1, 2015

tkelman commented Jul 1, 2015

ScottPJones commented Jul 1, 2015