Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

xexyl · 2024-10-21T14:17:29Z

Is there an existing issue for this?

I have searched for existing issues and did not find anything like this

Describe the bug

My understanding is that the JSON spec mandates that invalid UTF-8 is rejected. I do not believe this is done in every case. Part of this is what seems to be some ambiguous wording but it might also be a lack of understanding of certain parts of it, maybe even UTF-8 itself.

What you expect

If invalid UTF-8 is given it should be rejected. Unfortunately there is not only ambiguous wording but some things claimed to be invalid code points are actually valid according to https://www.fileformat.info/info/unicode/index.htm.

Environment

OS: n/a
Device: n/a
Compiler: n/a

jparse_bug_report.sh output

n/a

Anything else?

Some files that were supposedly invalid JSON were removed from the bad/ subdirectory as they actually are valid code points.

But some code points that are supposed to be rejected, it seems, are actually accepted. It is my understanding that the JSON spec mandates that invalid UTF-8 is rejected.

But a spec says for example that the bytes

This document might be of use: https://www.unicode.org/versions/Unicode16.0.0/core-spec/.

According to https://datatracker.ietf.org/doc/html/rfc3629 the octet FF never appears but https://www.fileformat.info/info/unicode/char/00ff/index.htm says U+00FF is valid so which is it? Or is this a misinterpretation of the standard? Or was this earlier on and there is an update? Or perhaps it's U+FF00 that should be rejected?

In fact, from: https://www.fileformat.info/info/unicode/utf8.htm we see:

The value of each individual byte indicates its UTF-8 function, as follows:

00 to 7F hex (0 to 127): first and only byte of a sequence.
80 to BF hex (128 to 191): continuing byte in a multi-byte sequence.
C2 to DF hex (194 to 223): first byte of a two-byte sequence.
E0 to EF hex (224 to 239): first byte of a three-byte sequence.
F0 to FF hex (240 to 255): first byte of a four-byte sequence.

Which, as I interpret it, means that the byte can be there?

I have also seen the text:

Page Separator. FF is commonly used as a page separator, and it should be interpreted that way in text. When displaying on the screen, it causes the text after the separator to be forced to the next page. It is interpreted in the same way as the LS for line breaking, in parsing, or in input segmentation such as readline. FF does not interrupt a paragraph, as paragraphs can and do span page boundaries.

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:

U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.

Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.

The text was updated successfully, but these errors were encountered:

xexyl · 2024-10-21T14:23:39Z

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:
U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.
Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.

Okay this part seems very possibly the case as...

$ jstrdecode '\u006f\u0302'
ô

but in the display of text in debug output it shows:

debug[3]: \u0302\u006f != \u006f\u0302: ô != ̂o: true

which suggests that maybe it's because of that. I don't know.

xexyl · 2024-10-21T14:30:05Z

Finally I saw yesterday from https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf that:
U+00F4 is equivalent to U+006F followed by U+0302, but not
equivalent to U+0302 followed by U+006F.
Now the second part happens to test out okay but the first part does not test okay. Is this simply a rendering problem? Maybe. When trying in jstrdecode the code points it looks different from the output in say a text file.
Okay this part seems very possibly the case as...
$ jstrdecode '\u006f\u0302'
ô
but in the display of text in debug output it shows:
debug[3]: \u0302\u006f != \u006f\u0302: ô != ̂o: true
which suggests that maybe it's because of that. I don't know.

For the output of jstrdecode though:

$ jstrdecode '\u006f\u0302\u00f4'
ôô

Are those the same character? vim search suggests yes. Which suggests that there could be a problem.

See this though:

jstrdecode '\u0302\u006f\u006f\u0302'
̂oô

Those do indeed seem different. (A screenshot would be better to show it though.)

lcn2 · 2024-10-21T19:58:27Z

Has this issue been resolved?

xexyl · 2024-10-21T20:00:55Z

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.

If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

lcn2 · 2024-10-21T20:08:47Z

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.

If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

Feel free to keep it open if that helps: it is your repo and this your call. It seems, given your doubt, that this issue should remain open.

Nevertheless the question from GH-issuecomment-2427610604 in the other repo applies even if this issue remains open.

xexyl · 2024-10-21T20:12:54Z

Has this issue been resolved?

I'm not sure if it has. But it's I think lower priority. I opened it only so the other one could be closed and this can be looked at later.
If you think that the above does not suggest a problem then I can certainly close it. That is why I used the word 'MIGHT'; I am not sure. It seems like it but it might not be too.

Feel free to keep it open if that helps: it is your repo and this your call. It seems, given your doubt, that this issue should remain open.

Nevertheless the question from GH-issuecomment-2427610604 in the other repo applies even if this issue remains open.

Yes. I see you committed to sync it up. That's good. Just improvements in testing, really, plus the fun -E option.

It might help to keep this open at least for now but it seems also like it might be one of those cases where unless an actual problem shows up it could be closed. I opened it as a reminder though. My guess is nothing much (or anything at all?) will occur with this, and in that case it will be closed. The spec unfortunately is ambiguous on some points and that might cause the concern that things might not be right, whether or not they are.

xexyl · 2024-10-30T13:15:29Z

Just changed the name here as the decoding issues should be resolved. But I have a thought about encoding/decoding which I'll put in issue #14.

xexyl · 2024-11-01T12:50:11Z

Okay so this in the standard seems to be wrong or obsolete or something:

The octet values C0, C1, F5 to FF never appear.

as these can all be part of a code point that create different characters. So it's a matter of what else cannot be there that we have to check, to determine if anything else that's not yet rejected should be rejected.

xexyl · 2024-11-01T12:57:04Z

In fact .. and this is curious, in the Unicode standard:

While it causes no ambiguity in UTF-8, Unicode assigns a graphic character ("Latin Small Letter Y with
Diaeresis") to U+00FF (octets C3 B0 in UTF-8).

but in the UTF-8 RFC it says otherwise. And trying that code point:

$ jstrencode '\u00ff'
ÿ

My guess is that this issue is probably resolved. But I'm trying to determine that, hopefully soon, as it would be nice to know if this is fine. Unfortunately the JSON test suite we found had a file with UTF-8 bytes that seem to be some that should be rejected that was in the good directory. I have to look at that file again and see if that change has to be undone. It would be nice if we could match all files.

Anyway just thought you'd like an update here. I'll let you know where I get.

Safe travels!

lcn2 · 2024-11-01T13:01:34Z

Suggestion, @xexyl: See if @leobru has an opinion about this Unicode / UTF-8 issue.

If we recall correctly, he has some working knowledge of that space and perhaps has had involvement with the Unicode consortium at some level: perhaps it was something them to include new Unicode symbol?

If nothing else, he could test the jstrdecode(1) using one of his non-ASCII "keyboards" to see how well it works with, say, with Russian characters.

xexyl · 2024-11-01T13:11:50Z

Suggestion, @xexyl: See if @leobru has an opinion about this Unicode / UTF-8 issue.

If we recall correctly, he has some working knowledge of that space and perhaps has had involvement with the Unicode consortium at some level: perhaps it was something them to include new Unicode symbol?

If nothing else, he could test the jstrdecode(1) using one of his non-ASCII "keyboards" to see how well it works with, say, with Russian characters.

Good idea. Leo, would you be willing to help with this, please?

For instance, does it appear that everything that is rejected seems correct - and do you see anything that should be rejected that is not?

These are the checks that are made:

non character:

/* Start of the "not character" range. */
#define UNI_NOT_CHAR_MIN    0xFDD0
/* End of the "not character" range. */
#define UNI_NOT_CHAR_MAX    0xFDEF

The test is:

return x >= UNI_NOT_CHAR_MIN && x <= UNI_NOT_CHAR_MAX;

If that returns true then it is rejected.

bytes ending in either 0xFFFF or 0xFFFE are rejected
surrogate pairs:

/* Surrogate pair zone. */
#define UNI_SUR_HIGH_START  0xD800
#define UNI_SUR_HIGH_END    0xDBFF
#define UNI_SUR_LOW_START   0xDC00
#define UNI_SUR_LOW_END     0xDFFF

The test is:

val >= UNI_SUR_HIGH_START && val <= UNI_SUR_LOW_END

... if true then it's rejected as an invalid surrogate pair.

Finally, the range of bytes:

< 0x80 is one byte.
< 0x800 is two bytes.
< 0x10000 is three bytes.
< 0x110000 is four bytes.

Anything else is rejected.

Now the standard says things like it has no octet with the bytes FF but even the Unicode standard says otherwise, and doing something like:

jstrencode '\u00ff'

shows what it describes it as:

$ jstrencode '\u00ff'
ÿ

Your experience and knowledge, @leobru, would be greatly appreciated with our thanks!

xexyl · 2024-11-01T14:34:01Z

I am discovering that some of these code points that were rejected should not be rejected. A compliant program can accept non-character code points, it seems, so that will be removed from the check, and it will then allow the JSON test suite tool to match the parser. A later standard changed what can be allowed and that includes other characters that were rejected.

Thus the files that had to be moved and thus diverging from the JSON test suite, will in a commit (soon), match the test suite tool, which means that we should completely pass it!

I think that with that commit this issue can actually be marked as resolved. I might close it but if Leo wants to provide his insight I would still appreciate it. Nevertheless this new change will necessitate an update to mkiocccentry.

Still with this in mind it'll mean that all bugs can be resolved.

xexyl · 2024-11-01T14:46:26Z

Closing this as complete. If you, @leobru, believe that there are things that should be rejected, please do let me know. I'll see about updating the comment so that the tests that were removed (to block the code points) are no longer there.

lcn2 · 2024-11-01T14:49:31Z

Leo may not be following this repo and may be unaware of its relationship with the "other repo".

xexyl · 2024-11-01T14:51:24Z

Leo may not be following this repo and may be unaware of its relationship with the "other repo".

I'm sure he is not. If you have a recommended way to contact him to ask for his feedback, please do let me know.

I still believe this issue is now resolved, however, and I'm going to do some tests of non-ASCII characters too.

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

lcn2 · 2024-11-01T14:55:10Z

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

Not at this time and best wishes.

xexyl · 2024-11-01T14:56:31Z

Can you think of anything else? I'll be doing a pull request in mkiocccentry soon.

Not at this time and best wishes.

Great! And best wishes to you too! I have one more to do with the website and then I think that new issue can be marked complete. Well one more entry has to be updated. Unsure if I can do that right now or not but if I do not do it now I'll get to it today or else tomorrow.

xexyl added the bug Something isn't working label Oct 21, 2024

xexyl self-assigned this Oct 21, 2024

xexyl mentioned this issue Oct 21, 2024

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

Closed

7 tasks

xexyl changed the title ~~Bug: some invalid UTF-8 **MIGHT** be accepted when it should be rejected~~ Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected and there MIGHT be some decoding issues Oct 21, 2024

xexyl changed the title ~~Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected and there MIGHT be some decoding issues~~ Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected Oct 30, 2024

xexyl closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

xexyl commented Oct 21, 2024 •

edited

Loading

xexyl commented Oct 21, 2024 •

edited

Loading

xexyl commented Oct 21, 2024 •

edited

Loading

lcn2 commented Oct 21, 2024

xexyl commented Oct 21, 2024

lcn2 commented Oct 21, 2024

xexyl commented Oct 21, 2024

xexyl commented Oct 30, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024 •

edited

Loading

xexyl commented Nov 1, 2024

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

Bug: some invalid UTF-8 MIGHT be accepted when it should be rejected #24

Comments

xexyl commented Oct 21, 2024 • edited Loading

Is there an existing issue for this?

Describe the bug

What you expect

Environment

jparse_bug_report.sh output

Anything else?

xexyl commented Oct 21, 2024 • edited Loading

xexyl commented Oct 21, 2024 • edited Loading

lcn2 commented Oct 21, 2024

xexyl commented Oct 21, 2024

lcn2 commented Oct 21, 2024

xexyl commented Oct 21, 2024

xexyl commented Oct 30, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024

xexyl commented Nov 1, 2024

lcn2 commented Nov 1, 2024 • edited Loading

xexyl commented Nov 1, 2024

xexyl commented Oct 21, 2024 •

edited

Loading

xexyl commented Oct 21, 2024 •

edited

Loading

xexyl commented Oct 21, 2024 •

edited

Loading

lcn2 commented Nov 1, 2024 •

edited

Loading