throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

tkelman · 2015-05-24T00:43:33Z

…an attempting to adjust it. fix #7811

cc @vtjnash (any changes to make here? it's your branch, i'm just turning it into a PR) @ScottPJones

ScottPJones · 2015-05-24T01:09:07Z

👍

vtjnash · 2015-05-24T12:29:44Z

I had just forotten about this. It should be fine to merge

tkelman · 2015-05-27T04:15:42Z

I'm not really qualified to hit merge on anything unicode-related, could someone else review and merge if you agree with this change?

nalimilan · 2015-05-27T07:37:02Z

base/string.jl

@@ -185,10 +185,9 @@ function search(s::AbstractString, c::Chars, i::Integer)
 return 1 <= i <= nextind(s,endof(s)) ? i :
 throw(BoundsError(s, i))
 end
- if i < 1
+ if i < 1 || i > nextind(s,endof(s))


Doesn't look like you're catching all invalid indices here, are you? Shouldn't this be i > endof(s)?

julia> s = "café" "café" julia> nextind(s,endof(s)) 6 julia> s[5] ERROR: ArgumentError: invalid UTF-8 character index in next at ./utf8.jl:80 in getindex at string.jl:62

nalimilan · 2015-05-27T07:46:46Z

From what I can tell it's OK (but I may have missed details). Regarding my remarks about bounds checking, do as you prefer!

ScottPJones · 2015-05-27T08:24:28Z

base/ascii.jl

+ i == sizeof(s) + 1 && return 0
+ (i < 1 || i > sizeof(s)) && throw(BoundsError(s, i))
+ return c < Char(0x80) ? search(s.data,c%UInt8,i) : 0
+end
 rsearch(s::ASCIIString, c::Char, i::Integer) = c < Char(0x80) ? rsearch(s.data,c%UInt8,i) : 0


Why doesn't this also do the bounds checking that was added for search? Is it handled in the other rsearch method?
I think the bounds check needs to be done before you return 0 (for c < 0x80)

ScottPJones · 2015-05-27T21:35:20Z

Anybody concerned about the issue in rsearch?

ScottPJones · 2015-06-03T14:19:48Z

@tkelman 😀 How about addressing the issues I raised reviewing your PR? Seriously, it would be nice to get these sorts of fixes in...

tkelman · 2015-06-03T14:22:25Z

I don't actually know anything about what this code is doing. I just wanted to ping @vtjnash by opening this so he wouldn't forget about it.

ScottPJones · 2015-06-03T14:28:26Z

Oh, sorry, I saw the "Author" tag, and had forgotten that you'd just put his branch into a PR...

jiahao · 2015-06-03T14:30:50Z

@tkelman 😀 How about addressing the issues I raised reviewing your PR?

@ScottPJones kindly respect others' time by not pinging them frivolously. This is a borderline violation of our community standards, viz.

Be aware that most community members contribute on a voluntary basis, so ideas and bug reports are ok, but demands are not.

vtjnash · 2015-06-05T03:59:05Z

Anybody concerned about the issue in rsearch?

yes, but i was not prepared when making this PR to extend it further and test rsearch as well. that should be a separate commit, and perhaps a separate PR anyways.

ScottPJones · 2015-06-05T04:17:53Z

@jiahao That was NOT a frivolous ping. It was simply an honest mistake, pinging tkelman instead of vtjnash, because of the line at the top: "tkelman wants to merge...", which I'd already said I was sorry for. Your ping could be considered frivolous, OTOH...
@vtjnash OK, that makes sense to do it in another PR, but you didn't address the other issue I raised:

I think the bounds check needs to be done before you return 0 (for c < 0x80)

vtjnash · 2015-06-05T14:47:48Z

@ScottPJones it is counterproductive to try to point blame back at another volunteer in the community.

I think the bounds check needs to be done before you return 0 (for c < 0x80)

Is this in the context of rsearch? I believe I covered this case for search.

i'm definitely trying to get back to the PR, but still trying to get caught back up on everything else right now.

ScottPJones · 2015-06-05T19:43:50Z

@vtjnash I wasn't trying to "point blame". I was a bit taken aback by being told it was a "frivolous" ping, when pinging tkelman... which was an honest mistake, because it was he who brought me into this thread in the first place! I just wanted him to remember this issue, since nobody had responded to my comment here, and he was in contact with me constantly on a couple other threads, reviewing the Unicode PRs I've been trying to get in. Hence the smiley face. If I'd known that this PR was really you, I wouldn't have said anything, assuming you could be too busy to respond yet.

Looking again at it (and not from on my phone), that is rsearch, you have taken care of it for search.
rsearch will need to be fixed another day.

tkelman · 2015-06-05T20:21:27Z

Arguing about pinging back and forth is indeed unproductive. Given the new unicode error handling framework from #11573, the patch in this PR would need to be adapted to fit into it to resolve the conflict.

…n attempting to adjust it (in getindex, next, search). fix #7811

vtjnash · 2015-06-06T06:06:42Z

Given the new unicode error handling framework from #11573, the patch in this PR would need to be adapted to fit into it to resolve the conflict.

I've pushed an update for that.

Why doesn't this also do the bounds checking that was added for search? Is it handled in the other rsearch method?

afaict, none of the rsearch methods do this. also, as a breadcrumb, i have not attempted to add validation the endpoint index passed to a few of these methods (getindex(s::UTF8String, r::UnitRange{Int}) or SubString)

ScottPJones · 2015-06-06T06:22:13Z

Ok, good point about the other places lacking validation. 👍 to this change

throw ArgumentError when accessing an invalid UTF-8 indice

ScottPJones · 2015-07-20T18:47:42Z

I think I see a problem with part of this change, in next(), where it now simply gives an index error if it is not at a valid UTF-8 starting character. The problem is, since strings are not validated in Julia, that the index might have hit a Latin1 character, for example, and then get an invalid index error, instead of skipping the bad byte, and returning a 0xfffd replacement character, as it does elsewhere, [or give an invalid character error].

tkelman added the domain:unicode Related to unicode characters and encodings label May 24, 2015

nalimilan reviewed May 27, 2015
View reviewed changes

ScottPJones reviewed May 27, 2015
View reviewed changes

vtjnash self-assigned this Jun 6, 2015

vtjnash force-pushed the jn/invalid_utf8_idx branch from de8f898 to 6bd2aeb Compare June 6, 2015 04:02

throw UnicodeError when accessing an invalid UTF-8 indice, rather tha…

3e53553

…n attempting to adjust it (in getindex, next, search). fix #7811

vtjnash force-pushed the jn/invalid_utf8_idx branch from 6bd2aeb to 3e53553 Compare June 6, 2015 04:27

vtjnash added a commit that referenced this pull request Jun 10, 2015

Merge pull request #11423 from JuliaLang/jn/invalid_utf8_idx

e504e76

throw ArgumentError when accessing an invalid UTF-8 indice

vtjnash merged commit e504e76 into master Jun 10, 2015

vtjnash deleted the jn/invalid_utf8_idx branch June 10, 2015 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

tkelman commented May 24, 2015

ScottPJones commented May 24, 2015

vtjnash commented May 24, 2015

tkelman commented May 27, 2015

nalimilan May 27, 2015

nalimilan commented May 27, 2015

ScottPJones May 27, 2015

ScottPJones commented May 27, 2015

ScottPJones commented Jun 3, 2015

tkelman commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

jiahao commented Jun 3, 2015

vtjnash commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

vtjnash commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

vtjnash commented Jun 6, 2015

ScottPJones commented Jun 6, 2015

ScottPJones commented Jul 20, 2015

throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

Conversation

tkelman commented May 24, 2015

ScottPJones commented May 24, 2015

vtjnash commented May 24, 2015

tkelman commented May 27, 2015

nalimilan May 27, 2015

Choose a reason for hiding this comment

nalimilan commented May 27, 2015

ScottPJones May 27, 2015

Choose a reason for hiding this comment

ScottPJones commented May 27, 2015

ScottPJones commented Jun 3, 2015

tkelman commented Jun 3, 2015

ScottPJones commented Jun 3, 2015

jiahao commented Jun 3, 2015

vtjnash commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

vtjnash commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

vtjnash commented Jun 6, 2015

ScottPJones commented Jun 6, 2015

ScottPJones commented Jul 20, 2015