Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #10959 bugs with UTF-16 conversions #11551

Merged
merged 4 commits into from
Jul 1, 2015

Conversation

ScottPJones
Copy link
Contributor

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String.
Improved reverse() for UTF16String.

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

@hayd
Copy link
Member

hayd commented Jun 3, 2015

FWIW I think this is a big improvement over #11004, I really like the break up into multiple files. Honestly, I'm glad this is a separate PR and away from the 500 comment behemoth; getting directed to that on a git bisect would be terrible. +1

@ScottPJones
Copy link
Contributor Author

@hayd #11004 already had the breakup into multiple files... that's what surprised me about the closing it off in the middle of active review... but I am glad to get away from the 500 comments, CodeHub crashes when trying to read it!

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

Could you summarize what's different here relative to the status of where #11004 left off? The diff appears to be only about 100 lines shorter here, but I'm having a hard time seeing exactly where.

function convert(::Type{UTF16String}, str::UTF8String)
dat = str.data
# handle zero length string quickly
sizeof(dat) == 0 && return empty_utf16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is best to return an explicit copy here, not an alias to a global. The empty UTF16String is immutable, but its best not to propagate undefined behavior if some user level function decides to muck with string internals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the technique used by utf8.jl, I don't know who wrote that...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed. @jakebolewski should we just add a copy on

isempty(r) && return empty_utf8
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is potentially creating tons of extra allocations ("" is very common) to protect against someone doing something bad really a good thing? Ban things like using C pointers then...

@ScottPJones
Copy link
Contributor Author

I incorporated most of the last round of comments from the reviews (I was running unit tests on that when it was closed 😞 ) [all of the things that I said I would fix], I moved (at least for now, I think it is a step backwards) some of the conversion functions from utfconvert.jl to utf16.jl/utf32.jl to make the diffs cleaner (and so I wouldn't get dinged for stuff I didn't change). There is some potential for further refactoring of a couple functions, and removal of some functions, but those suggestions either were definitely, or likely to be breaking changes, and so I didn't feel they were appropriate for this PR.

@throws never returns, always throws ArgumentError
""" ->
=#
function utf_errfunc(errcode::Integer, charpos, invchar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of ad-hoc template substitution, I still think it would be more explicit to construct and throw the exception at the point of the error. utf_errfunc does not really cut down on code redundancy in throwing similar error messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is this function is used to avoid creating GC frame. I still think it is the safest to mark it as @noinline though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a gc-frame is not the end of the world, it is created virtually everywhere in the Julia codebase. I would guess that the change would be marginal to insignificant in terms of performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually saw a major difference, in some early tests I did...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this extra indirection for error handling is unusual and should be removed for now. We don't have any other infrastructure anywhere else in base for internationalizing error messages, we don't need it here at this very moment. We can work on a more generic solution for that problem and start using it across the entire codebase later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottPJones we are not arguing that better error messages are not hugely beneficial. If writing the code this way (extracting out into a separate function, using global error constants) was motivated by performance issues then it is best to say:

I wrote this code first with inline throw(Exception()) but that created a GC frame which degraded performance in the validation function. I extracted out the error functions this way and that resulted in a 25% performance increase shown by the following benchmark:

function bench()
    # some code
end
@time bench()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This current structure was motived by a number of things:

  1. Long familiarity [25 years] with a very powerful system for having localizable messages,
    where the messages can handle differences in word order between languages, and can handle
    plain output messages, warning messages, and error messages, with any number of arguments
    included in the output.
  2. Fairly severe performance problems with GC frames, when I first implemented this by adding
    a new UnicodeError. If that were totally fixed (see @yuyichao's RFC: Create local gc frame for throw #11508), I would prefer to move
    back to using UnicodeError, but I really don't think that that should be a reason to hold this up.
  3. Enum's not working in Base, at least not early enough to be useful for this.
    When I first started implementing this, I used Enums, which is what all of those const's really
    meant to be.
    So, yes, these could be better, but not at the moment, because of issues out of my control.
    I will definitely change them to be what I'd originally wanted (which I think you all as well would prefer), as soon as those issues are dealt with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is not relevant here. You're continuing to argue from authority as resistance against changes we are all telling you will help this PR be smaller, less controversial, and more likely to be merged.
  2. If you had written that down, with numbers, saying "changing just the error code structure from something simple to this resulted in x% speedup on this benchmark gist" (btw please provide an executive summary of the numbers in those gists, there's way too much data there to parse at a glance) as @jakebolewski said, then we'd probably give in on this point if x were large enough.
  3. is superficial, Enums are new and not yet widely used, and would still result in this looking like C-written-in-Julia instead of idiomatic Julia code.

Every line of code in this PR is under your control. Dozens of people have suggested dozens of ways to break this down and make this substantially better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I actually wasn't trying to "argue from authority" at all. @jakebolewski had talked about my motivations for making the change this way, and I was trying to answer that as best I could.
    Are Julians always so sensitive about talking about one's experience, in the context of a technical discussion? IMO, it can help shed light on a issue... I, for example, am very interested to learn about Julian's experiences in the technical/numeric/scientific computing world. The type of arguing that I've seen here that does bother me is "argument from a position of authority", and people willing to use that position to curtail or stifle discussion.

  2. I have tried to give the numbers and executive summaries as I went along, but with 500+ comments, I think the early ones where I tested different error handling strategies got lost in the chaff. As soon as I have a minute (I am also busy working here in Belgium), I'll try to recreate my tests that pointed me to the issue of GC frame creation [and I know that @yuyichao is working hard to fix things so it wouldn't even be an issue]. I did give an executive summary just before I went to bed, did you miss it? 21-100% slower without the @inbounds macros... (I think more frequently closer to 100%).

  3. Well, first off, I haven't really seen that much consistency in the Julia code I've been looking at...
    (maybe because it is older code, who knows), so I don't totally buy the argument that everything must be written in "idiomatic" Julia code. Different people also have different ideas of what constitutes good structure and practice... Does Julia really need some sort of unwritten style that everybody must follow? (at least, if it were written down, people might have a chance of knowing what was expected, and using that, or debating whether the current practice really is good...)
    I'm not sure at all that the suggestions that I haven't incorporated so far would make things "substantially better"...

About the error handling: the reason the very first thing that I implemented was the better error handling, is that I would not have been able to easily find and fix all of the many bugs in the current code without the enhanced messages. It all ties together... this PR adds some extra overhead already to do complete validity checking, which people worried about in the initial comments on #11004... A big goal of mine was to at the very least, not introduce any regressions in performance (the performance was so bad already, that would have been criminal). Because of my approaches, not only is the code not slower, even with the extra checking, it is faster as well.
At least in my early benchmarking, the extra overhead in the inner loops (which is where the error checking happens!) made the performance frequently worse than the current bad performance.

All of the above is why I really do not think it is at all a good think to try to split this up as has been suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the micro benchmarking... I know that the way it just runs things and uses @time and everything is really crappy... I really haven't had time to learn how best to collect all of the information that I wish to and save it for later processing... I miss having a high speed database always available just by adding a ^ character to the associative array! (my changes to gc.c that already got merged actually help collect the data, but storing and retrieving)
Any pointers to how better to benchmark things like this, and produce a nice spreadsheet or graph in julia, would be greatly appreciated!

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2015

We already have 50ish comments worth of mostly minor review back-and-forth on this new PR. Given this is only about 10% different than #11004 I have a sneaky feeling @StefanKarpinski would object pretty strongly to merging it as is, even after addressing the review comments that have been made up to this point.

So, question for you @ScottPJones. How small of a correct bugfix version of this could we distill this down to, strictly using check_string_abs? None of the other check_string_* variants for now [edit2: or parameterize them and reuse as much implementation as possible]. And as few as possible, maximally generic convert methods, parameterized on the input and output types to reuse as many lines of code as you can. [edit1: Remember, this PR is a bugfix.]

edit3: Remember one of Jeff's mottos, underscores are a sign of missing abstraction.

@throws ArgumentError
""" ->
=#
function check_string_utf32(dat::Vector{UInt32}, len::Int, options::Integer=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is almost line-for-line identical with check_string_abs. Can't you merge them? i.e. why can't you use something like ch, pos = next(dat, pos)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd answered this before... but again 1) I had done exactly that, and ran into severe performance degradation, which @JeffBezanson looked into and may have totally addressed, and would need some time to adequately benchmark, look into generated code, etc. before I would feel comfortable trying that again, 2) I had been told previously that I should not mix AbstractString and Vector{T} in a method like that, so I am getting conflicting requirements from the reviewers.

@nalimilan
Copy link
Member

I don't think my question from https://github.com/JuliaLang/julia/pull/11004/files#r31562735 about checking for surrogates in Chars obtained by indexing an AbstractString (in check_string_abs) has been addressed. Do we consider part of the AbstractString interface that getindex must return valid codepoints (as I think we should)?

@ScottPJones
Copy link
Contributor Author

I think that is a whole separate issue, and definitely should not be addressed here.
What I implemented correctly handles the way things currently are.
(I am not saying that it should not be addressed, I've already stated many times that I think the string types and Char should only contain 100% valid Unicode data... but that is a separate issue)

@ScottPJones
Copy link
Contributor Author

@nalimilan I opened a new PR (I must be a glutton for punishment!) where I think it might be a good place to discuss the issue you raised. (#11558)

@ScottPJones
Copy link
Contributor Author

@tkelman Maybe, I'll try that out, and look at the native code generated. Thanks!

@ScottPJones
Copy link
Contributor Author

OK, I haven't gotten everything asked for changed (because I want to make sure the performance doesn't suffer, esp. to the point of being a regression compared to the current code), but hopefully this will make people happier, and it still maintains my design goals.
Especially look at the changes to utferror.jl and utfcheck.jl.
Thanks everybody for the time spent reviewing this, I really do appreciate it!

@yuyichao
Copy link
Contributor

yuyichao commented Jun 3, 2015

Not really relavant for this PR but the travis test got killed by OOM killer and the test passes??? WTH?

@yuyichao
Copy link
Contributor

yuyichao commented Jun 3, 2015

@ScottPJones
Copy link
Contributor Author

Yes, what's up with the strange OOM errors? I saw cases before my change, so I don't think it is related.
Anyway, I'd appreciate it if people could take a look at the latest version, I was able to make things more generic without performance degrading (it seems @JeffBezanson's changes did fix the bugs I stumbled across before). Thanks!

@yuyichao
Copy link
Contributor

yuyichao commented Jun 3, 2015

Yeah, the OOM killer has nothing to do with this PR, the dmesg output is added as an indicator for OOM errors. This issue is tracked separately #11553 .

I was just a little surprised that the OOM killer firing doesn't make the test fail......

@ScottPJones
Copy link
Contributor Author

Yes, such a nasty ending definitely should make the test fail (whether it was caused by the PR or not!)

totalchar = num2byte = num3byte = num4byte = 0
@inbounds while pos < len
if T == AbstractString
ch, pos = next(dat, pos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this wouldn't work for the vector cases too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be more specific... wouldn't what work for the vector cases? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ch, pos = nex(dat, pos), you might not need this if T == AbstractString at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I had just copied that from the older code that walked over an AbstractString... I had thought, for an AbstractString, you couldn't count on the next position being +1, so that's why you had to use next, and start, and endof. Given that that is the way most of the rest of the code dealing with AbstractStrings was coded, could we just leave that as is for now, and I'll try to investigate it more thoroughly later? (and then I could revamp the other places as well, all in other small PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... or did you mean to change it do always do ch, pos = next(dat, pos)?
I'll take a quick look at the code generated in both cases... that would be great if it boiled down to the same code... 😀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to always use next, since that's more general and should have the same meaning for simple vectors - and ideally generate nearly the same code

### Returns:
* `UTF8String`
"
function encode_to_utf8{T<:Union{UInt16, UInt32}}(::Type{T}, dat, len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just encode_to_utf8{T<:Union{UInt16, UInt32}}(dat::AbstractVector{T}, len), rather than redundantly passing T ... isn't dat always going to be some kind of vector of T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was more for the future, where I want the first argument to really be an Encoding, see @quinnj's nice Strings.jl that he's been working on.

@ScottPJones
Copy link
Contributor Author

@stevengj Are my responses enough to satisfy you on the 3 notes, or do you want things changed? Thanks!

@ScottPJones
Copy link
Contributor Author

@stevengj I just tried to remove those (from ascii.jl, utf8.jl, utf16.jl, and utf32.jl), but ended up with the following fun:

socket.jl
error during bootstrap:
LoadError(at "sysimg.jl" line 128: LoadError(at "socket.jl" line 644: StackOverflowError()))
rec_backtrace at /j/julia/src/task.c:644
eval at /j/julia/usr/lib/libjulia.dylib (unknown line)
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
include at boot.jl:254
jl_apply at /j/julia/src/interpreter.c:55
eval at /j/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /j/julia/src/toplevel.c:517
jl_eval_module_expr at /j/julia/src/toplevel.c:156
jl_parse_eval_all at /j/julia/src/toplevel.c:567
jl_load at /j/julia/src/toplevel.c:610
exec_program at /j/julia/usr/bin/julia (unknown line)
true_main at /j/julia/usr/bin/julia (unknown line)
main at /j/julia/usr/bin/julia (unknown line)

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
@ScottPJones
Copy link
Contributor Author

Can this go in so we can move on to ripping apart my next set of bug fixes? 😀

@tkelman
Copy link
Contributor

tkelman commented Jun 27, 2015

I think this looks fine now, doesn't look like there are any obvious candidates for much code reuse in what's left.

@ScottPJones
Copy link
Contributor Author

Bump: this code hasn't changed in almost two weeks (just been rebased to keep up with other changes being put into base). Anything preventing it from being merged in now?

@stevengj
Copy link
Member

LGTM.

tkelman added a commit that referenced this pull request Jul 1, 2015
@tkelman tkelman merged commit 9071f14 into JuliaLang:master Jul 1, 2015
@ScottPJones
Copy link
Contributor Author

Thanks, @tkelman!

@ScottPJones ScottPJones deleted the spj/fixutf branch July 1, 2015 03:32
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
@tkelman
Copy link
Contributor

tkelman commented Jul 1, 2015

Crap, I think I'm going to have to revert this, I'm now getting an error syntax: invalid character "�". This is one case where it would've been ideal to run an immediately pre-merge integration test. Please restore the branch, make a copy of it and file a new PR.

@tkelman
Copy link
Contributor

tkelman commented Jul 1, 2015

Scratch that, I think it's an unrelated problem caused by a mistake I made, will fix.

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
@ScottPJones
Copy link
Contributor Author

There is also a fix to JSON.jl in JuliaIO/JSON.jl#111

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 6, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 10, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jul 12, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants