-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 decoding should reject overlong sequences and surrogates #25452
Comments
This is by design (see discussion in #24420). If you compare characters, you get a comparison-based on exact encoding. If you want to convert an overlong encoding to an integer value, you get the code point that it encodes, regardless of whether it is standard or overlong. This allows you to express two different things in natural ways:
Those are both distinct, useful questions one can ask, which can be expressed straightforwardly in this scheme. Note that there is a bug here, which is that this should return false: julia> Unicode.isvalid("\xf0\x80\x80\x80"[1])
true The alternative design would be to throw an error upon attempting to convert an overlong encoding to an integer value. Instead, you'd have to have a different function for getting the code point value of an overlong character encoding. Or maybe a character normalization function? Unclear, but it requires more API surface to express the same thing the current design does with less. The trade off would be that it would be harder to accidentally match an overlong encoded character. I'm not sure whether that's a good thing or a bad thing though. |
It's quite clear that overlong sequences and surrogates are invalid UTF-8, as invalid as anything. Getting the code point value of an overlong encoding is not a normal thing to do, and I think it should require some other function. If data has a stray byte, like |
I'm unclear on what is so "sinister" about overlong encodings. Sure, they're invalid, but calling them sinister is a bit dramatic. Modified UTF-8 is quite common and intentionally uses |
It has to do with input validation, e.g. checking that a string doesn't have certain characters. Allowing decoding of overlong sequences provides a way for characters to evade such checking. But I also think it's better to treat all invalid UTF-8 uniformly instead of throwing errors only in some cases. Yes, there needs to be some way to look at the actual data without an error, but that also applies equally to all invalid data. |
Ok, can you propose a different design then? |
My proposal is for anything where |
And how would you get the corresponding code point value if you didn't want an error? E.g. if you were looking for an overlong |
I think that's really a second-order concern. Modified UTF-8 and CESU-8 are the only exceptions, and most languages seem to consider them marginal and support them in external packages if at all. For example JavaCall could have a function to transcode CESU-8/Modified UTF-8 and a function to test whether a Char is an overlong |
In general I agree we should be careful about overlong encodings, but I'm not sure the fact that OTOH what can be dangerous is that one would check Other than doing that, the only way to protect against it is to call |
My interpretation of the standard is that a correct UTF-8 decoder should not accept invalid data, and a correct encoder should not produce it. In our case Other than that, the argument is very simple: if I agree |
It's not clear how to define "decoding" in the Julia case, but it sounds weird to me to consider that |
Decoding means converting a UTF-8 sequence to a code point, which I do think throwing an error when comparing an invalid character to a valid one might make sense. I don't know if it was discussed, or if it's possible within the design, but it might be possible to represent valid characters as their code point (restoring the representation of |
Ok, I'm sold on being conservative here. If it turns out to be a problem, we can always relax it. So the actionable items here are:
n = '0' <= c <= '9' ? n<<4 + c - '0' :
'a' <= c <= 'f' ? n<<4 + c - 'a' + 10 :
'A' <= c <= 'F' ? n<<4 + c - 'A' + 10 : break This is a real case that was an error in Base, I already parenthesized this to avoid the problem: Lines 333 to 335 in 310667b
However, this change might break people's code since intermediate character values can be invalid that would later come back into the valid range at the end of the computation. In this case, the end value is supposed to be an integer in any case and the parenthesized version is more efficient. I suspect that will generally be the case, so this is fine. |
All but the last now throw InvalidCharErrors: julia> UInt32("\xf0\x80\x80\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xf0\x80\x80\x80')
Stacktrace:
[1] invalid_char(::Char) at ./char.jl:85
[2] UInt32(::Char) at ./char.jl:130
[3] top-level scope at REPL[34]:1
julia> UInt32("\xc0\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\x80')
Stacktrace:
[1] invalid_char(::Char) at ./char.jl:85
[2] UInt32(::Char) at ./char.jl:130
[3] top-level scope at REPL[35]:1
julia> UInt32("\xc0\xae"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\xae')
Stacktrace:
[1] invalid_char(::Char) at ./char.jl:85
[2] UInt32(::Char) at ./char.jl:130
[3] top-level scope at REPL[36]:1
julia> UInt32("\ud800"[1])
0x0000d800 The last one is safe from the objection given above since it's 0xd800 is not the code point of any valid character and julia> isvalid("\ud800"[1])
false So I think this can be closed (and probably should have been before 1.0). |
Hi, I'd like to propose a cleaner approach (IMHO). Not confident to optimize the ascii fast forward in unsafe, but I think state machine would describe the problem better and is more readable. As a bonus, the state machine could be made available for other more advanced parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams.
|
Are you proposing a different behavior or a different implementation or something else entirely? |
Hi @StefanKarpinski, I'd suggest the state machine approach as a replacement for Rust utf8 handling. I've just realized I'm in the wrong thread then as this is about Julia. I've created a separate issue for rust. |
These should be treated as invalid data; see e.g. rust-lang/rust#3787
The following calls to
UInt32
should throw errors:Otherwise you get things like this:
The text was updated successfully, but these errors were encountered: