Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throw ArgumentError when accessing an invalid UTF-8 indice, rather th… #11423

Merged
merged 1 commit into from
Jun 10, 2015

Conversation

tkelman
Copy link
Contributor

@tkelman tkelman commented May 24, 2015

…an attempting to adjust it. fix #7811

cc @vtjnash (any changes to make here? it's your branch, i'm just turning it into a PR) @ScottPJones

@tkelman tkelman added the domain:unicode Related to unicode characters and encodings label May 24, 2015
@ScottPJones
Copy link
Contributor

👍

@vtjnash
Copy link
Sponsor Member

vtjnash commented May 24, 2015

I had just forotten about this. It should be fine to merge

@tkelman
Copy link
Contributor Author

tkelman commented May 27, 2015

I'm not really qualified to hit merge on anything unicode-related, could someone else review and merge if you agree with this change?

@@ -185,10 +185,9 @@ function search(s::AbstractString, c::Chars, i::Integer)
return 1 <= i <= nextind(s,endof(s)) ? i :
throw(BoundsError(s, i))
end
if i < 1
if i < 1 || i > nextind(s,endof(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like you're catching all invalid indices here, are you? Shouldn't this be i > endof(s)?

julia> s = "café"
"café"

julia> nextind(s,endof(s))
6

julia> s[5]
ERROR: ArgumentError: invalid UTF-8 character index
 in next at ./utf8.jl:80
 in getindex at string.jl:62

@nalimilan
Copy link
Member

From what I can tell it's OK (but I may have missed details). Regarding my remarks about bounds checking, do as you prefer!

i == sizeof(s) + 1 && return 0
(i < 1 || i > sizeof(s)) && throw(BoundsError(s, i))
return c < Char(0x80) ? search(s.data,c%UInt8,i) : 0
end
rsearch(s::ASCIIString, c::Char, i::Integer) = c < Char(0x80) ? rsearch(s.data,c%UInt8,i) : 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this also do the bounds checking that was added for search? Is it handled in the other rsearch method?
I think the bounds check needs to be done before you return 0 (for c < 0x80)

@ScottPJones
Copy link
Contributor

Anybody concerned about the issue in rsearch?

@ScottPJones
Copy link
Contributor

@tkelman 😀 How about addressing the issues I raised reviewing your PR? Seriously, it would be nice to get these sorts of fixes in...

@tkelman
Copy link
Contributor Author

tkelman commented Jun 3, 2015

I don't actually know anything about what this code is doing. I just wanted to ping @vtjnash by opening this so he wouldn't forget about it.

@ScottPJones
Copy link
Contributor

Oh, sorry, I saw the "Author" tag, and had forgotten that you'd just put his branch into a PR...

@jiahao
Copy link
Member

jiahao commented Jun 3, 2015

@tkelman 😀 How about addressing the issues I raised reviewing your PR?

@ScottPJones kindly respect others' time by not pinging them frivolously. This is a borderline violation of our community standards, viz.

Be aware that most community members contribute on a voluntary basis, so ideas and bug reports are ok, but demands are not.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 5, 2015

Anybody concerned about the issue in rsearch?

yes, but i was not prepared when making this PR to extend it further and test rsearch as well. that should be a separate commit, and perhaps a separate PR anyways.

@ScottPJones
Copy link
Contributor

@jiahao That was NOT a frivolous ping. It was simply an honest mistake, pinging tkelman instead of vtjnash, because of the line at the top: "tkelman wants to merge...", which I'd already said I was sorry for. Your ping could be considered frivolous, OTOH...
@vtjnash OK, that makes sense to do it in another PR, but you didn't address the other issue I raised:

I think the bounds check needs to be done before you return 0 (for c < 0x80)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 5, 2015

@ScottPJones it is counterproductive to try to point blame back at another volunteer in the community.

I think the bounds check needs to be done before you return 0 (for c < 0x80)

Is this in the context of rsearch? I believe I covered this case for search.

i'm definitely trying to get back to the PR, but still trying to get caught back up on everything else right now.

@ScottPJones
Copy link
Contributor

@vtjnash I wasn't trying to "point blame". I was a bit taken aback by being told it was a "frivolous" ping, when pinging tkelman... which was an honest mistake, because it was he who brought me into this thread in the first place! I just wanted him to remember this issue, since nobody had responded to my comment here, and he was in contact with me constantly on a couple other threads, reviewing the Unicode PRs I've been trying to get in. Hence the smiley face. If I'd known that this PR was really you, I wouldn't have said anything, assuming you could be too busy to respond yet.

Looking again at it (and not from on my phone), that is rsearch, you have taken care of it for search.
rsearch will need to be fixed another day.

@tkelman
Copy link
Contributor Author

tkelman commented Jun 5, 2015

Arguing about pinging back and forth is indeed unproductive. Given the new unicode error handling framework from #11573, the patch in this PR would need to be adapted to fit into it to resolve the conflict.

@vtjnash vtjnash self-assigned this Jun 6, 2015
…n attempting to adjust it (in getindex, next, search). fix #7811
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 6, 2015

Given the new unicode error handling framework from #11573, the patch in this PR would need to be adapted to fit into it to resolve the conflict.

I've pushed an update for that.

Why doesn't this also do the bounds checking that was added for search? Is it handled in the other rsearch method?

afaict, none of the rsearch methods do this. also, as a breadcrumb, i have not attempted to add validation the endpoint index passed to a few of these methods (getindex(s::UTF8String, r::UnitRange{Int}) or SubString)

@ScottPJones
Copy link
Contributor

Ok, good point about the other places lacking validation. 👍 to this change

vtjnash added a commit that referenced this pull request Jun 10, 2015
throw ArgumentError when accessing an invalid UTF-8 indice
@vtjnash vtjnash merged commit e504e76 into master Jun 10, 2015
@vtjnash vtjnash deleted the jn/invalid_utf8_idx branch June 10, 2015 18:34
@ScottPJones
Copy link
Contributor

I think I see a problem with part of this change, in next(), where it now simply gives an index error if it is not at a valid UTF-8 starting character. The problem is, since strings are not validated in Julia, that the index might have hit a Latin1 character, for example, and then get an invalid index error, instead of skipping the bad byte, and returning a 0xfffd replacement character, as it does elsewhere, [or give an invalid character error].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SubString{UTF8String} and UTF8String have different slicing behaviour
6 participants