Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

Closed
1 task done
xexyl opened this issue Oct 21, 2024 · 17 comments
Closed
1 task done

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

xexyl opened this issue Oct 21, 2024 · 17 comments
Assignees
Labels
bug Something isn't working

Comments

@xexyl
Copy link
Owner

xexyl commented Oct 21, 2024

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

My understanding is that the JSON spec mandates that invalid UTF-8 is rejected. I do not believe this is done in every case. Part of this is what seems to be some ambiguous wording but it might also be a lack of understanding of certain parts of it, maybe even UTF-8 itself.

What you expect

If invalid UTF-8 is given it should be rejected. Unfortunately there is not only ambiguous wording but some things claimed to be invalid code points are actually valid according to https://www.fileformat.info/info/unicode/index.htm.

Environment

  • OS: n/a
  • Device: n/a
  • Compiler: n/a

jparse_bug_report.sh output

n/a

Anything else?

Some files that were supposedly invalid JSON were removed from the bad/ subdirectory as they actually are valid code points.

But some code points that are supposed to be rejected, it seems, are actually accepted. It is my understanding that the JSON spec mandates that invalid UTF-8 is rejected.

But a spec says for example that the bytes

This document might be of use: https://www.unicode.org/versions/Unicode16.0.0/core-spec/.

According to https://datatracker.ietf.org/doc/html/rfc3629 the octet FF never appears but https://www.fileformat.info/info/unicode/char/00ff/index.htm says U+00FF is valid so which is it? Or is this a misinterpretation of the standard? Or was this earlier on and there is an update? Or perhaps it's U+FF00 that should be rejected?

In fact, from: https://www.fileformat.info/info/unicode/utf8.htm we see:

The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.

Which, as I interpret it, means that the byte can be there?

I have also seen the text:

Page Separator. FF is commonly used as a page separator, and it should be interpreted that way in text. When displaying on the screen, it causes the text after the separator to be forced to the next page. It is interpreted in the same way as the LS for line breaking, in parsing, or in input segmentation such as readline. FF does not interrupt a paragraph, as paragraphs can and do span page boundaries.

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:

U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.

Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.

@xexyl
Copy link
Owner Author

xexyl commented Oct 21, 2024

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:

U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.

Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.

Okay this part seems very possibly the case as...

$ jstrdecode '\u006f\u0302'
ô

but in the display of text in debug output it shows:

debug[3]: \u0302\u006f != \u006f\u0302: ô != ̂o: true

which suggests that maybe it's because of that. I don't know.

@xexyl
Copy link
Owner Author

xexyl commented Oct 21, 2024

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:

U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.

Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.

Okay this part seems very possibly the case as...

$ jstrdecode '\u006f\u0302'
ô

but in the display of text in debug output it shows:

debug[3]: \u0302\u006f != \u006f\u0302: ô != ̂o: true

which suggests that maybe it's because of that. I don't know.

For the output of jstrdecode though:

$ jstrdecode '\u006f\u0302\u00f4'
ôô

Are those the same character? vim search suggests yes. Which suggests that there could be a problem.

See this though:

jstrdecode '\u0302\u006f\u006f\u0302'
̂oô

Those do indeed seem different. (A screenshot would be better to show it though.)

@xexyl xexyl changed the title Bug: some invalid UTF-8 **MIGHT** be accepted when it should be rejected Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected and there MIGHT be some decoding issues Oct 21, 2024
@lcn2
Copy link
Contributor

lcn2 commented Oct 21, 2024

Has this issue been resolved?

@xexyl
Copy link
Owner Author

xexyl commented Oct 21, 2024

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.

If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

@lcn2
Copy link
Contributor

lcn2 commented Oct 21, 2024

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.

If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

Feel free to keep it open if that helps: it is your repo and this your call. It seems, given your doubt, that this issue should remain open.

Nevertheless the question from GH-issuecomment-2427610604 in the other repo applies even if this issue remains open.

@xexyl
Copy link
Owner Author

xexyl commented Oct 21, 2024

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.
If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

Feel free to keep it open if that helps: it is your repo and this your call. It seems, given your doubt, that this issue should remain open.

Nevertheless the question from GH-issuecomment-2427610604 in the other repo applies even if this issue remains open.

Yes. I see you committed to sync it up. That's good. Just improvements in testing, really, plus the fun -E option.

It might help to keep this open at least for now but it seems also like it might be one of those cases where unless an actual problem shows up it could be closed. I opened it as a reminder though. My guess is nothing much (or anything at all?) will occur with this, and in that case it will be closed. The spec unfortunately is ambiguous on some points and that might cause the concern that things might not be right, whether or not they are.

@xexyl xexyl changed the title Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected and there MIGHT be some decoding issues Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected Oct 30, 2024
@xexyl
Copy link
Owner Author

xexyl commented Oct 30, 2024

Just changed the name here as the decoding issues should be resolved. But I have a thought about encoding/decoding which I'll put in issue #14.

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

Okay so this in the standard seems to be wrong or obsolete or something:

The octet values C0, C1, F5 to FF never appear.

as these can all be part of a code point that create different characters. So it's a matter of what else cannot be there that we have to check, to determine if anything else that's not yet rejected should be rejected.

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

In fact .. and this is curious, in the Unicode standard:

While it causes no ambiguity in UTF-8, Unicode assigns a graphic character ("Latin Small Letter Y with
Diaeresis") to U+00FF (octets C3 B0 in UTF-8).

but in the UTF-8 RFC it says otherwise. And trying that code point:

$ jstrencode '\u00ff'
ÿ

My guess is that this issue is probably resolved. But I'm trying to determine that, hopefully soon, as it would be nice to know if this is fine. Unfortunately the JSON test suite we found had a file with UTF-8 bytes that seem to be some that should be rejected that was in the good directory. I have to look at that file again and see if that change has to be undone. It would be nice if we could match all files.

Anyway just thought you'd like an update here. I'll let you know where I get.

Safe travels!

@lcn2
Copy link
Contributor

lcn2 commented Nov 1, 2024

Suggestion, @xexyl: See if @leobru has an opinion about this Unicode / UTF-8 issue.

If we recall correctly, he has some working knowledge of that space and perhaps has had involvement with the Unicode consortium at some level: perhaps it was something them to include new Unicode symbol?

If nothing else, he could test the jstrdecode(1) using one of his non-ASCII "keyboards" to see how well it works with, say, with Russian characters.

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

Suggestion, @xexyl: See if @leobru has an opinion about this Unicode / UTF-8 issue.

If we recall correctly, he has some working knowledge of that space and perhaps has had involvement with the Unicode consortium at some level: perhaps it was something them to include new Unicode symbol?

If nothing else, he could test the jstrdecode(1) using one of his non-ASCII "keyboards" to see how well it works with, say, with Russian characters.

Good idea. Leo, would you be willing to help with this, please?

For instance, does it appear that everything that is rejected seems correct - and do you see anything that should be rejected that is not?

These are the checks that are made:

  • non character:
/* Start of the "not character" range. */
#define UNI_NOT_CHAR_MIN    0xFDD0
/* End of the "not character" range. */
#define UNI_NOT_CHAR_MAX    0xFDEF

The test is:

return x >= UNI_NOT_CHAR_MIN && x <= UNI_NOT_CHAR_MAX;

If that returns true then it is rejected.

  • bytes ending in either 0xFFFF or 0xFFFE are rejected
  • surrogate pairs:
/* Surrogate pair zone. */
#define UNI_SUR_HIGH_START  0xD800
#define UNI_SUR_HIGH_END    0xDBFF
#define UNI_SUR_LOW_START   0xDC00
#define UNI_SUR_LOW_END     0xDFFF

The test is:

val >= UNI_SUR_HIGH_START && val <= UNI_SUR_LOW_END

... if true then it's rejected as an invalid surrogate pair.

  • Finally, the range of bytes:

< 0x80 is one byte.
< 0x800 is two bytes.
< 0x10000 is three bytes.
< 0x110000 is four bytes.

Anything else is rejected.

Now the standard says things like it has no octet with the bytes FF but even the Unicode standard says otherwise, and doing something like:

jstrencode '\u00ff'

shows what it describes it as:

$ jstrencode '\u00ff'
ÿ

Your experience and knowledge, @leobru, would be greatly appreciated with our thanks!

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

I am discovering that some of these code points that were rejected should not be rejected. A compliant program can accept non-character code points, it seems, so that will be removed from the check, and it will then allow the JSON test suite tool to match the parser. A later standard changed what can be allowed and that includes other characters that were rejected.

Thus the files that had to be moved and thus diverging from the JSON test suite, will in a commit (soon), match the test suite tool, which means that we should completely pass it!

I think that with that commit this issue can actually be marked as resolved. I might close it but if Leo wants to provide his insight I would still appreciate it. Nevertheless this new change will necessitate an update to mkiocccentry.

Still with this in mind it'll mean that all bugs can be resolved.

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

Closing this as complete. If you, @leobru, believe that there are things that should be rejected, please do let me know. I'll see about updating the comment so that the tests that were removed (to block the code points) are no longer there.

@xexyl xexyl closed this as completed Nov 1, 2024
@lcn2
Copy link
Contributor

lcn2 commented Nov 1, 2024

Leo may not be following this repo and may be unaware of its relationship with the "other repo".

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

Leo may not be following this repo and may be unaware of its relationship with the "other repo".

I'm sure he is not. If you have a recommended way to contact him to ask for his feedback, please do let me know.

I still believe this issue is now resolved, however, and I'm going to do some tests of non-ASCII characters too.

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

@lcn2
Copy link
Contributor

lcn2 commented Nov 1, 2024

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

Not at this time and best wishes.

@xexyl
Copy link
Owner Author

xexyl commented Nov 1, 2024

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

Not at this time and best wishes.

Great! And best wishes to you too! I have one more to do with the website and then I think that new issue can be marked complete. Well one more entry has to be updated. Unsure if I can do that right now or not but if I do not do it now I'll get to it today or else tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants