-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24
Comments
Okay this part seems very possibly the case as... $ jstrdecode '\u006f\u0302'
ô but in the display of text in debug output it shows:
which suggests that maybe it's because of that. I don't know. |
For the output of $ jstrdecode '\u006f\u0302\u00f4'
ôô Are those the same character? See this though:
Those do indeed seem different. (A screenshot would be better to show it though.) |
Has this issue been resolved? |
I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later. If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too. |
Feel free to keep it open if that helps: it is your repo and this your call. It seems, given your doubt, that this issue should remain open. Nevertheless the question from GH-issuecomment-2427610604 in the other repo applies even if this issue remains open. |
Yes. I see you committed to sync it up. That's good. Just improvements in testing, really, plus the fun It might help to keep this open at least for now but it seems also like it might be one of those cases where unless an actual problem shows up it could be closed. I opened it as a reminder though. My guess is nothing much (or anything at all?) will occur with this, and in that case it will be closed. The spec unfortunately is ambiguous on some points and that might cause the concern that things might not be right, whether or not they are. |
Just changed the name here as the decoding issues should be resolved. But I have a thought about encoding/decoding which I'll put in issue #14. |
Okay so this in the standard seems to be wrong or obsolete or something:
as these can all be part of a code point that create different characters. So it's a matter of what else cannot be there that we have to check, to determine if anything else that's not yet rejected should be rejected. |
In fact .. and this is curious, in the Unicode standard:
but in the UTF-8 RFC it says otherwise. And trying that code point: $ jstrencode '\u00ff'
ÿ My guess is that this issue is probably resolved. But I'm trying to determine that, hopefully soon, as it would be nice to know if this is fine. Unfortunately the JSON test suite we found had a file with UTF-8 bytes that seem to be some that should be rejected that was in the good directory. I have to look at that file again and see if that change has to be undone. It would be nice if we could match all files. Anyway just thought you'd like an update here. I'll let you know where I get. Safe travels! |
Suggestion, @xexyl: See if @leobru has an opinion about this Unicode / UTF-8 issue. If we recall correctly, he has some working knowledge of that space and perhaps has had involvement with the Unicode consortium at some level: perhaps it was something them to include new Unicode symbol? If nothing else, he could test the |
Good idea. Leo, would you be willing to help with this, please? For instance, does it appear that everything that is rejected seems correct - and do you see anything that should be rejected that is not? These are the checks that are made:
/* Start of the "not character" range. */
#define UNI_NOT_CHAR_MIN 0xFDD0
/* End of the "not character" range. */
#define UNI_NOT_CHAR_MAX 0xFDEF The test is: return x >= UNI_NOT_CHAR_MIN && x <= UNI_NOT_CHAR_MAX; If that returns true then it is rejected.
/* Surrogate pair zone. */
#define UNI_SUR_HIGH_START 0xD800
#define UNI_SUR_HIGH_END 0xDBFF
#define UNI_SUR_LOW_START 0xDC00
#define UNI_SUR_LOW_END 0xDFFF The test is: val >= UNI_SUR_HIGH_START && val <= UNI_SUR_LOW_END ... if true then it's rejected as an invalid surrogate pair.
< 0x80 is one byte. Anything else is rejected. Now the standard says things like it has no octet with the bytes jstrencode '\u00ff' shows what it describes it as: $ jstrencode '\u00ff'
ÿ Your experience and knowledge, @leobru, would be greatly appreciated with our thanks! |
I am discovering that some of these code points that were rejected should not be rejected. A compliant program can accept non-character code points, it seems, so that will be removed from the check, and it will then allow the JSON test suite tool to match the parser. A later standard changed what can be allowed and that includes other characters that were rejected. Thus the files that had to be moved and thus diverging from the JSON test suite, will in a commit (soon), match the test suite tool, which means that we should completely pass it! I think that with that commit this issue can actually be marked as resolved. I might close it but if Leo wants to provide his insight I would still appreciate it. Nevertheless this new change will necessitate an update to mkiocccentry. Still with this in mind it'll mean that all bugs can be resolved. |
Closing this as complete. If you, @leobru, believe that there are things that should be rejected, please do let me know. I'll see about updating the comment so that the tests that were removed (to block the code points) are no longer there. |
Leo may not be following this repo and may be unaware of its relationship with the "other repo". |
I'm sure he is not. If you have a recommended way to contact him to ask for his feedback, please do let me know. I still believe this issue is now resolved, however, and I'm going to do some tests of non-ASCII characters too. Can you think of anything else? I'll be doing a pull request in mkiocccentry soon. |
Not at this time and best wishes. |
Great! And best wishes to you too! I have one more to do with the website and then I think that new issue can be marked complete. Well one more entry has to be updated. Unsure if I can do that right now or not but if I do not do it now I'll get to it today or else tomorrow. |
Is there an existing issue for this?
Describe the bug
My understanding is that the JSON spec mandates that invalid UTF-8 is rejected. I do not believe this is done in every case. Part of this is what seems to be some ambiguous wording but it might also be a lack of understanding of certain parts of it, maybe even UTF-8 itself.
What you expect
If invalid UTF-8 is given it should be rejected. Unfortunately there is not only ambiguous wording but some things claimed to be invalid code points are actually valid according to https://www.fileformat.info/info/unicode/index.htm.
Environment
jparse_bug_report.sh output
n/a
Anything else?
Some files that were supposedly invalid JSON were removed from the bad/ subdirectory as they actually are valid code points.
But some code points that are supposed to be rejected, it seems, are actually accepted. It is my understanding that the JSON spec mandates that invalid UTF-8 is rejected.
But a spec says for example that the bytes
This document might be of use: https://www.unicode.org/versions/Unicode16.0.0/core-spec/.
According to https://datatracker.ietf.org/doc/html/rfc3629 the octet
FF
never appears but https://www.fileformat.info/info/unicode/char/00ff/index.htm saysU+00FF
is valid so which is it? Or is this a misinterpretation of the standard? Or was this earlier on and there is an update? Or perhaps it'sU+FF00
that should be rejected?In fact, from: https://www.fileformat.info/info/unicode/utf8.htm we see:
Which, as I interpret it, means that the byte can be there?
I have also seen the text:
Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:
Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in
jstrdecode
the code points it looks different from the output in say a text file.The text was updated successfully, but these errors were encountered: