Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13

Closed
7 tasks done
lcn2 opened this issue Sep 9, 2024 · 222 comments
Closed
7 tasks done
Labels
bug Something isn't working Low priority Low priority at this time

Comments

@lcn2
Copy link
Contributor

lcn2 commented Sep 9, 2024

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

Certain strings seem to not be decoded properly by the jstrdecode(1) tool, or so it seems.

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

Some background

Consider this JSON document containing just a JSON string:

"œßåé"

When you enter that JSON document into https://jsonlint.com/ is shows that the JSON is valid.

Moreover, the jparse(1) tool also shows that the JSON document is valid:

$ jparse -J 3 -s '"œßåé"'
JSON tree[3]:	lvl: 0	type: JTYPE_STRING	len{p,c:q}: 8	value:	"\xc5\x93\xc3\x9f\xc3\xa5\xc3\xa9"
in parse_json(): JSON debug[1]: valid JSON

it is GOOD that the JSON parser is happy with this JSON document.

FYI: Here is the hexdump(1) shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):

$ echo '"œßåé"' | hexdump -C
00000000  22 c5 93 c3 9f c3 a5 c3  a9 22 0a                 |"........".|
0000000b

Here is how jstrencode(1) processes the string:

$ jstrencode '"œßåé"'
\"œßåé\"

This is CORRECT as double quotes are back-slashed (i.e., ") is correct because within a JSON string, all double quotes MUST be backslashed.

Now if we use -Q, the enclosing double quotes are ignored:

$ jstrencode -Q '"œßåé"'
œßåé

If we consult https://codebeautify.org/json-encode-online, and we put "œßåé"' on the input side, we see that output side shows the same string.

Evidence of a decoding bug

Now, some JSON tools, such as jsp (see https://github.com/kjozsa/jsp) encodes "œßåé"' as:

$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"

When we put the "\u0153\u00df\u00e5\u00e9" string into the input side of https://codebeautify.org/json-encode-online, the output side shows "œßåé".

However if we give the "\u0153\u00df\u00e5\u00e9" string to jstrdecode(1), we get something odd:

$ jstrdecode "\u0153\u00df\u00e5\u00e9"
S���

Using hexdump(1) we see that the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

However this is what jstrdecode(1) produces:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

This suggests that the JSON string decoding may be incorrect.

What you expect

We expect that the output of

jstrdecode "\u0153\u00df\u00e5\u00e9"

to be the same as:

echo 'œßåé`

Environment

- OS:SO
- Device:eciveD
- Compiler:relipmoC

bug_report.sh output

n/a

Anything else?

Consider the result of using -v 3:

$ jstrdecode -v 3 "\u0153\u00df\u00e5\u00e9"
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: processing arg: 0: <\u0153\u00df\u00e5\u00e9>
debug[3]: arg length: 24
debug[3]: decode length: 5
S���

The debug messages look OK, but the output is not:

$ jstrdecode "\u0153\u00df\u00e5\u00e9" | hexdump -C
00000000  01 53 df e5 e9 0a                                 |.S....|
00000006

which is not the same as the expected decoded output:

$ echo 'œßåé' | hexdump -C
00000000  c5 93 c3 9f c3 a5 c3 a9  0a                       |.........|
00000009

So it appears that strings with various \u<hex><hex><hex><hex> might not be decoded correctly.

UPDATE 0C

SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.

UPDATE 0D - todo list

It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.

  • We should make sure that an error or warning is issued when invalid UTF-8 codes are given.
  • We need to make sure that the unicode boolean is updated properly.
  • We need to fix the test cases in jstr_test.sh that had to be temporarily disabled.

Final todo:

  • Clean up the files json_utf8.c and json_utf8.h to include only things we need, making sure to keep credit.

I believe that with those done, as long as things look good, we can mark this as complete.

Side todo:

  • Sync this all to mkiocccentry (this can be done before this of course).
  • If all is good, update the IOCCC website repo to use jstrdecode instead of jsp.
@lcn2 lcn2 added the bug Something isn't working label Sep 9, 2024
@lcn2
Copy link
Contributor Author

lcn2 commented Sep 9, 2024

How and why we filed 3 bugs reports

We found something that seemed odd regarding JSON string encoding and decoding. It relates to the bin tools from the "Other repo" that we are working on.

While investigating, we discovered the issue with the man page (issue #11). We would have simplex fixed the man page, but while we investigating further we found the issue with the quotes (issue #12).

The encode / decode puzzled us. Stings with wide variety of non-ASCII characters (like "àßç") seemed to be valid within JSON strings without the need to "encode" them as "\u". However other JSON implementations seemed to do just that.

It appears that among the multiple "bogons" of the JSON spec is that there is a "choice": that one can use "é" or one can use "\uc3a9" (because é in UTF-8 is c3 a9). Both are valid ways to refer to the same thing. In fact, one can refer to the ASCII "a" or to it as "\u0061" (ASCII "a" is 0x61). One cannot compare two JSON strings by a simple byte comparison. One MUST decode the JSON string and then compare byte for byte (grown / sigh).

So that is how we started trying to use jstrencode(1) and jstrdecode(1) to play around encoding and decoding. That is how we discovered what seems to be a decode issue (issue #13).

It is a good thing we did discover these issues now. They do block a fix/improvement to how the "other repo" will be able to render web pages that contain non-ASCII characters (such as in an author's name or an award title). Those happen to work because the JSON so-called spec chose to render the symbols. Had someone produced a string using with the equivalent "\u", things would have been broken (at least in terms of how the web pages would have looked).

Anyway we wanted you to know why we were looking into JSON string encoding/decoding, and why we filed these issues.

Now we will return back to the bin tools (we are nearly done, but for these issues and the need for another tool) so then we can return to the submit server, get the screen snapshots and completed one of the TODOs needed for the Great Fork Merge.

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 9, 2024

a proposed 2 new options to jstrencode

  • In GH-issuecomment-2337677465, we see that there are two ways "represent" a symbol: as the symbol itself and as \u<hex><hex><hex><hex>.

How should the jstrencode(1) and jstrdecode(1) tools deal with these two ways to represent a symbol?

By default, jstrencode(1) would encode symbols as themselves unless they were one of the symbols that must be backslash-escaped (such as the double-quote, the newline, the NUL byte, etc.).

With jstrencode -a, every symbol must be encoded as \u<hex><hex><hex><hex> according to the 2-byte encoded hex value. Here -a stands for "all".

NOTE: With -a, the double-quote, instead of being backslash-escaped as " would he encoded as \u0022 (the double-quote symbol has the value of 0x22). And lower-case a would be encoded as \u0061 (a has the value of 0x61).

With jstrencode -8, any symbol with a value > 127 MUST be encoded as \u<hex><hex><hex><hex> according to the 2-byte encoded hex value. Here -8 stands for "encode if value is >= 2^8".

Of course, the jstrdecode(1) tool will always decode \u<hex><hex><hex><hex> back into the original symbol, so -a and -8 do not apply to that tool.

UPDATE 0

Do we need to encode at 2-byte UTF-8 or ???

On the one hand, ß is UTF-8 c3 9f, so \uc39f.

On the other hand, ß has a byte value of 0xdf, so \u00df. We DO have examples of external applications encoding characters thus way, so at a minimum jstrdecode(1) needs to be able to decode \u00df back to ß.

Which is needed for JSON encoding?

UPDATE 1

According to RFC 4627 Section 3:

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Does that provide the vital clue?

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

Environment

- OS:SO
- Device:eciveD
- Compiler:relipmoC

:-)

BTW: did you ever notice that the word device is in deceive if you rearrange letters a bit?

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

I am waking up more now. The ac might be helping there. Unfortunately I am away from the computer for a little bit.

Your use of Eszett did make me check. It seems that's missing from the UTF-8 map in the mkiocccentry repo.

If I can determine the right \u value I can add it when back but since it was here I might finish the writing of my comment here (that I believe I am almost done writing but I do have a bit more to say when back).

Some of these bug reports can be done easily though with the new options it will of course take more time.

UPDATE 0

Oh! You do have it .. sharp S but you don't have it as ss which it should be. I'll fix that.

UPDATE 1

Eszett fixed in ioccc-src/mkiocccentry@f1cb9e6.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

Okay now back to more serious details.

Caveat

I am very tired so I might be missing something with the thought below. If so I'll either see it later when more alert/awake/clear-headed or maybe what I write will make you think of something.

With that out of the way...

First things first (actually this is the second thing :-) or third if you count the above comment first)

It seems like it might be a good idea to figure out what has to map to what. And since there are so many characters the question I have is if we need a table like struct encode jenc[] found in json_parse.c. But maybe it's more like the table struct utf8_posix_map hmap[] in mkiocccentry's soup/utf8_posix_map.c. Or perhaps it's kind of a combination of the two?

I am not sure how this might go though due to the way the enc table works in particular it being 8-bit values.

But if we need to encode the Eszett (again as per your example) then what should it be encoded to?

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

At a QUICK look at json_parse.c it seems like the reason jparse.c works is due to the function json_conv_string(). The function that calls that is parse_json_string() and it does so like:

str = json_conv_string(string, len, true);

and the boolean indicates:

 *      quote   true ==>  ignore JSON double quotes: both ptr[0] & ptr[len-1]
 *                        must be '"'

Of course it's entirely possible that at this quick glance I am parsing :-) this wrong :-( but I wonder if this might be something to look at. If it is how the parser does it (the parse_json_string() is in jparse.y so it seems likely) then the question is if that's why jparse gets it right. If it does maybe it's something to consider? Do we need (in this case at least?) the json_decode()? I'd say it's not as simple as replacing it given what that function also does but the question is what is different here?

Again this is assuming that I'm not missing something obvious at my quick glance.

UPDATE 0

I was thinking of the reverse step so the above is not going to work even if it was as simple as swapping (with some modifications) it ... which it would not be almost certainly. Even so it's interesting that that function does seem to work okay and the decode one does not.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

Ah .. but of course it's the reverse. The parser does the opposite. So it definitely is not so simple. I wonder if something similar can be done though? I'm not sure. I'm hoping I can read all three bug reports more carefully soon. But I also wonder about the other repo's is_available.sh script we were talking about me writing. I might do that sometime soon. But I do have other things I have to do today too.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

QUESTION - priority wrt the IOCCC

As I have other things I have to do including over at mkiocccentry but also things unrelated to programming, I just wanted to ask what the priority of this issue is versus the IOCCC.

As a programmer, and I'm sure you are very familiar with this, this is an issue that I do not like having, but I also know there is such a thing as prioritising so I am asking this here. I hope to look at the is_available.sh script sometime soon but I have those other things I have to do too so I'm not sure what I'll get done today.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

Another funny thing I noticed is reading from stdin seems wrong. Look:

$ jstrdecode 
foo bar
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

... when I hit ctrl-d obviously. But look:

$ echo foo bar | jstrdecode 
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

What am I missing? I don't recall this behaviour at all. In fact in the man page there's an example where this works.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

But I must leave for now possibly for the day.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

Actually I wonder if this should be looked at now .. might require a change in how the above is done. But maybe not. I don't know how to even begin to address it however.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

I'm afraid that that bug has to be fixed first in order to properly test this. Trouble is at this time I'm not sure where to begin with it other than checking that warning message (of course). But what that'll show is another matter.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

The problematic code seems to be in json_parse.c:

            /*
             * disallow characters that should have been escaped
             */
            switch (c) {
            case '\b':  /*fallthrough*/
            case '\t':  /*fallthrough*/
            case '\n':  /*fallthrough*/
            case '\f':  /*fallthrough*/
            case '\r':
                /* error - clear allocated length */
                if (retlen != NULL) {
                    *retlen = 0;
                }
                warn(__func__, "found non-\\-escaped char: 0x%02x", (uint8_t)c);
                return NULL;
                break;

But this is coming from sending EOF.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

I am guessing that it is from read_all() not removing the EOF or else the decode function needs to stop at EOF. But of course what if the string has an EOF in the middle? I guess one could check that it's EOF at the end of the string but as I did not write those functions and I am way too tired to be doing this I will have to stop for now and probably not continue until tomorrow.

By all means please advise on what you think there. Of course since there is very possibly a bug in the decode function anyway ...

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

This diff:

-    buf = json_decode(input, inputlen, &bufsiz, NULL);
+    buf = json_decode(input, inputlen-1, &bufsiz, NULL);

solves the problem with the EOF being seen in the decode function, at the end. But whether it will cause more problems or not I do not know. What do you think?

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

Returning tomorrow, hopefully with a better sleep. Unfortunately (not at all surprising) the excessive heat alert was changed from ending at 2000 hours tonight to 2000 hours tomorrow so that's not fun but hopefully it's not much longer than that. Not that it not being in place does not mean it won't be scorching but any lower is better than not lower or even higher.

@xexyl
Copy link
Owner

xexyl commented Sep 9, 2024

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf.

What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

QUESTION - priority wrt the IOCCC

As I have other things I have to do including over at mkiocccentry but also things unrelated to programming, I just wanted to ask what the priority of this issue is versus the IOCCC.

As a programmer, and I'm sure you are very familiar with this, this is an issue that I do not like having, but I also know there is such a thing as prioritising so I am asking this here. I hope to look at the is_available.sh script sometime soon but I have those other things I have to do too so I'm not sure what I'll get done today.

Issue impacts

Issue #13 could be said to impact progress in the Great Fork Merge, as it potentially impacts the non-ASCII characters on the web site in 4 places. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice.

Issue #13 would potentially impact new authors with non-ASCII characters for where there isn't a good ASCII quasi-equivalent. One could say that it impacts the I in IOCCC.

Solving issue #13 might require a fix to some jparse JSON string decoding related code, not just to the jstrdecode(1) tool. That would certainly involve a new release of the mkiocccentry repo: something that should happen before we ask the public to try out the repo code (which is post Great Fork Merge and pre opening of IOCCC28).

Current non-ASCII character impact

Here are the 4 places where non-ASCII characters are found in JSON. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice.

  • 2000/bmeyer - michael.pgm image text

OK, there is a workaround to change the fancy quote character to an ASCII single quote.

  • 2020/kurdyukov4/.entry.json - award title

While one could change "lámatyávë" into ASCII "lamatyave", but that would be a shame to do so.

  • author/Anton_Algmyr.json - Author full name

One could change "Anton Älgmyr" into ASCII "Anton Algmyr", but that would be a shame to do so.

  • author/Ondrej_Adamovsky.json - Author full name

One could change "Ondřej Adamovský" into ASCII "Ondrej Adamovsky"", but that would be a shame to do so.

Of course, the above 4 places then cause these web pages to be impacted as a result of the above 4 issues:

  • authors.html
  • location.html
  • years.html
  • 2000/bmeyer/index.html
  • 2018/algmyr/index.html
  • 2019/adamovsky/index.html
  • 2020/kurdyukov4/index.html

The 4 "hack-a-rounds" noted above could address the above web pages, but that would be a shame to do so.

Priority

If there is a fix for this decoding problem, then it should be applied before the Great Fork Merge. If that is not possible, then we could force the 4 "hack-a-rounds" noted above for now, and then potentially have to fix it before we ask for a public trial of this repo. We might have to put up a "KNOWN BUG" notice for this repo, however.

Our suggestion is that you look into the issue enough to determine the type of solution needed and to access the time/complexity of the fix. Depending on that you find, we might opt to fix things before the Great Fork Merge, or before the public trial of this repo, or before IOCCC28, or ...

p.s.

We could have made a fix to issue #11 and issue #12, but we didn't want to complicate your looking into issue #13 so we stayed out of a 3 rather than potentially impacting your efforts on the more important issue #13.

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

At a QUICK look at json_parse.c it seems like the reason jparse.c works is due to the function json_conv_string(). The function that calls that is parse_json_string() and it does so like:

str = json_conv_string(string, len, true);

and the boolean indicates:

 *      quote   true ==>  ignore JSON double quotes: both ptr[0] & ptr[len-1]
 *                        must be '"'

Of course it's entirely possible that at this quick glance I am parsing :-) this wrong :-( but I wonder if this might be something to look at. If it is how the parser does it (the parse_json_string() is in jparse.y so it seems likely) then the question is if that's why jparse gets it right. If it does maybe it's something to consider? Do we need (in this case at least?) the json_decode()? I'd say it's not as simple as replacing it given what that function also does but the question is what is different here?

Again this is assuming that I'm not missing something obvious at my quick glance.

UPDATE 0

I was thinking of the reverse step so the above is not going to work even if it was as simple as swapping (with some modifications) it ... which it would not be almost certainly. Even so it's interesting that that function does seem to work okay and the decode one does not.

The json_decode() function is also certainly the place the fix will impact.

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one.

Actually I wonder if this should be looked at now .. might require a change in how the above is done. But maybe not. I don't know how to even begin to address it however.

A potential change to the API is a reason to fix it sooner than later. Hopefully it is just a fix to the json_decode() which would be less impactful.

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

Another funny thing I noticed is reading from stdin seems wrong. Look:

$ jstrdecode 
foo bar
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

... when I hit ctrl-d obviously. But look:

$ echo foo bar | jstrdecode 
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer

What am I missing? I don't recall this behaviour at all. In fact in the man page there's an example where this works.

In JSON encoded strings, a "naked" newline (\n or 0x0a) is NOT allowed. Such characters must be encoded as "\n".

FYI, this works:

echo -n foo bar | jstrdecode

This also works:

printf "%s %s\\\n" foo bar | jstrdecode

As does this using gnu sed:

echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode

To see what is happening, try with -Q:

$ echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode -Q
"foo bar
"

UPDATE 0

No to GH-issuecomment-2339232202 and No to GH-issuecomment-2339246212.

Our guess is that it is NOT related to some EOF or the input being one character too long, but rather than a naked newline is NOT allowed within a JSON encoded string. One MUST encode with these 2 ASCII characters: \n

Or are we missing something?

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf.
What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

We do not know. You will have to try and research the problem.

SUGGESTION

The problem seems to be on the "JSON string decoding" side. Perhaps the code that converts the 6 ASCII characters: \u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] into a one or more bytes is broken?

@lcn2
Copy link
Contributor Author

lcn2 commented Sep 10, 2024

We believe we have addressed all of the current questions that still need answering at this time. If we've missed something or something else needs to be clarified, please ask again.

@xexyl
Copy link
Owner

xexyl commented Sep 10, 2024

Ahhh right. The newline! Of course. With the hex value that should be obvious.

But the problem there is that how does one enter EOF when they do not specify an arg? It will end up with a newline even if they do something like...

jstrdecode 
foo^D

will show something like:

$ ./jstrdecode -v 5
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
foo^D

Or is this a bug in macOS Terminal?

@xexyl
Copy link
Owner

xexyl commented Sep 10, 2024

UPDATE 0a

One issue may be the difference between the UTF-8 encoding and the 8-bit value. The ß symbol in UTF-8 is c3 9f. But as a 1-byte value ß is 0xdf.
What is needed here?

Good question.

We have found external applications that do encode ß as \u00df. So clearly jstrdecode(1) needs to convert \u00df back into ß. So at a minimum this needs to happen.

I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.

Will other external applications encode ß as \uc39f, and if so will jstrdecode(1) needs to convert \uc39f back into ß as well?

That would be quite unfortunate. What do you think we should do there?

We do not know. You will have to try and research the problem.

SUGGESTION

The problem seems to be on the "JSON string decoding" side. Perhaps the code that converts the 6 ASCII characters: \u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f] into a one or more bytes is broken?

Are you referring to jenc[] table used by both json_decode() and json_encode()? At a glance it seems that is indeed it but the question is what needs to change? I think that if we're looking at having to have more than one value for each (or some) char(s) then we might need to expand the table. What do you think?

@xexyl
Copy link
Owner

xexyl commented Sep 10, 2024

Ahhh right. The newline! Of course. With the hex value that should be obvious.

But the problem there is that how does one enter EOF when they do not specify an arg? It will end up with a newline even if they do something like...

jstrdecode 
foo^D

will show something like:

$ ./jstrdecode -v 5
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
foo^D

Or is this a bug in macOS Terminal?

Something tells me it might be a bug in macOS Terminal ... which would be unfortunate. But I think it because I know I used to be able to do this and did so for the man page and all was good. So the question is how to get past this one.

As for the how to go about the quotes on stdin versus args when some arg can be '-' the answer might be using a dynamic array of strings that then are added and at the end everything is printed. I have a little bit of time but I am not sure if I can complete that task before I have to go. Later today hopefully I'll have more time and energy.

@xexyl
Copy link
Owner

xexyl commented Sep 10, 2024

I think I solved this bug .. except that I cannot test the typing EOF at the command line.

@xexyl
Copy link
Owner

xexyl commented Sep 10, 2024

Before I commit

... or until I do more testing if you don't get back to me) I want to make sure I have used the dyn_array facility correctly as this is the first time I have done so. It appears to be okay but since this is my first time using the facility I want to run it by you to make sure that I didn't miss something. The test example code suggests I have it right but it's entirely possible I missed something (I guess).

To create a dynamic array for a char * I should do:

    array = dyn_array_create(sizeof(char *), JSON_CHUNK, JSON_CHUNK, true);

QUESTION

Should I use JSON_CHUNK or just CHUNK? I think JSON_CHUNK due to the fact it's JSON related.

Next, to append each value I should do:

if (dyn_array_append_value(array, &buf)) {
                        dbg(DBG_LOW, "moved data after appending buf: %s", buf);
                    }

and to get the length of the array I should do:

array_len = dyn_array_tell(array);

and to then write each element I can do something like:

   for (idx = 0; idx < array_len; ++idx) {
        /*
         * write starting escaped quote if requested
         */
        if (esc_quotes) {
            fprint(stdout, "%s", "\\\"");
        }
        buf = dyn_array_value(array, char *, idx);
        bufsiz = (uintmax_t)strlen(buf);
        errno = 0;              /* pre-clear errno for warnp() */
        outputlen = fwrite(buf, 1, bufsiz, stdout);
        if (outputlen != bufsiz) {
            warnp(__func__, "error: write of %ju bytes of arg: %jd returned: %ju",
                            (uintmax_t)bufsiz, idx, (uintmax_t)outputlen);
            success = false;
        }

        /*
         * write ending escaped quote if requested
         */
        if (esc_quotes) {
            fprint(stdout, "%s", "\\\"");
        }
    }

and finally after everything to free the array:

    /*  
     * free dynamic array
     */ 
    if (array != NULL) {
        dyn_array_free(array);
        array = NULL;
    }

Have I missed anything? If not then this bug at least is fixed, though I know there are other issues.

I know that the append function requires a pointer to a pointer so I had to do a &buf not just buf. Not doing that will segfault. I can run this through valgrind and verify things are okay but you might want something done. But of course those could come later I suppose.

UPDATE 0

Okay there is one mistake above I can see. The optind reference. That obviously has to be removed or done differently.

UPDATE 1

Fixed in above code. As long as arg 0 can be for no arg specified (read from stdin only) then that should be fine.

@xexyl
Copy link
Owner

xexyl commented Oct 8, 2024

I will make this commit shortly and then close this. However if you discover a problem please open again and we can work it out.
I'll sync over to mkiocccentry too.

make all.recreate_clone && make all.update_from_clone && make release && git status

is a good check. :-)

Indeed .. doing soon.

@xexyl
Copy link
Owner

xexyl commented Oct 8, 2024

Okay closing this as resolved. If this turns out to not be true please feel free to open again or if it seems like another issue then by all means open a new issue.

I do want to see about cleaning up json_utf8.[ch] but not today. Have to go do other things now.

Good luck in being fully recovered by tomorrow! (And if asymptomatic now then hopefully not contagious too.)

@xexyl xexyl closed this as completed Oct 8, 2024
@lcn2
Copy link
Contributor Author

lcn2 commented Oct 8, 2024

Congratulations on resolving a tricky and difficult issue, @xexyl ‼️

@xexyl
Copy link
Owner

xexyl commented Oct 8, 2024

Congratulations on resolving a tricky and difficult issue, @xexyl ‼️

Thank you very much! And likewise to you too. Let's hope it is indeed all that's needed. Best wishes on a good night sleep for you!

@xexyl
Copy link
Owner

xexyl commented Oct 8, 2024

Leaving for the day .. if any problems found please let me know (or if you have a fix make a pull request). But of course also make sure to get some rest. I am happy this is resolved t hough, or so it seems.

Have a great day! I look forward to seeing those scripts updated in the other repo.

xexyl added a commit that referenced this issue Oct 9, 2024
Remove has_nul in struct json_string as UTF-8 should, it is my
understanding, not have a NUL byte.

Fix path in jsemcgen.sh. The repo we know that uses this might have a
problem here that has to be fixed but the path was broken for this repo
so it has to be done, and as the script has an option to change the tool
path, it should not be a problem as this is just the default path.

Update all tools and the release to be the same version after issue #13 was
resolved: "1.2.0 2024-10-09". "1.2.0" was chosen because it was the first one >
some of the versions and the others could be bumped up to it without any harm.
@xexyl
Copy link
Owner

xexyl commented Oct 10, 2024

So.. do we need to do anything with:

jstrdecode: Beginning tests of JSON decode/encode functionality ...
jstrdecode: XXX - JSON encode/decode tests not yet written - XXX
jstrdecode: ... passed JSON encode/decode tests

?
Because the jstr_test.sh script does a number of checks for this. If you think that it should be done in code then please either open a pull request OR let me know what you think it should be. If you think it should be fine as is then I can remove that message from jstrencode and jstrdecode.c.

Because the the jstr_test.sh script does a number of checks for this, the "XXX - JSON encode/decode tests not yet written - XXX" messages should be removed.

Well as you know .. I decided to have some fun with adding it anyway. :-)

I had to decide which emojis to try. I almost considered the ridiculous one I found yesterday: they actually have an emoji for a pregnant MAN! I was also considering the other sex. But this seemed going too far as not everyone appreciates such dark humour (as it is some might not appreciate the one I did do but oh well).

I had the thought of something with a Ring. Ah .. I should have done also a ring and a volcano! Oh well: maybe next time. Or not. Probably not worth it. Though I don't know .. it seems pretty tempting. I wonder. Well maybe if I have time before it's frozen.

I have another comment to look at from you, I know, and I'm going for that now.

@lcn2
Copy link
Contributor Author

lcn2 commented Oct 10, 2024

Thanks to the fixed jstrdecode(1) tool, JSON files in that "other repo" can now use \uHexHexHexHex in JSON strings while the HTML files produced have the correct decoded Unicode characters.

UPDATE 0

See commit da0c000f79c6a77ce0e6edf8e971ff2bcdc2a893 in the other repo.

@xexyl
Copy link
Owner

xexyl commented Oct 11, 2024

So.. do we need to do anything with:

jstrdecode: Beginning tests of JSON decode/encode functionality ...
jstrdecode: XXX - JSON encode/decode tests not yet written - XXX
jstrdecode: ... passed JSON encode/decode tests

?
Because the jstr_test.sh script does a number of checks for this. If you think that it should be done in code then please either open a pull request OR let me know what you think it should be. If you think it should be fine as is then I can remove that message from jstrencode and jstrdecode.c.

Because the the jstr_test.sh script does a number of checks for this, the "XXX - JSON encode/decode tests not yet written - XXX" messages should be removed.

Well as you know .. I decided to have some fun with adding it anyway. :-)

I had to decide which emojis to try. I almost considered the ridiculous one I found yesterday: they actually have an emoji for a pregnant MAN! I was also considering the other sex. But this seemed going too far as not everyone appreciates such dark humour (as it is some might not appreciate the one I did do but oh well).

With commit 0deffa6 I actually did this. I'm rotten, I know, but it's only done if a certain option and level is used. :-) Lower levels have other fun things though I probably should make it so it checks the consistency too. I'll do that sometime soon.

UPDATE 0

Okay also see commit d2586d9 :-)

@xexyl
Copy link
Owner

xexyl commented Oct 11, 2024

Thanks to the fixed jstrdecode(1) tool, JSON files in that "other repo" can now use \uHexHexHexHex in JSON strings while the HTML files produced have the correct decoded Unicode characters.

Very nice!

UPDATE 0

See commit da0c000f79c6a77ce0e6edf8e971ff2bcdc2a893 in the other repo.

Pulled it some hours ago - glad it works out well! It took some work but I'm glad I focused on it even though we had a workaround.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

Unfortunately I am not sure that this bug is completely resolved :( My understanding is that invalid UTF-8 bytes are not allowed and should be rejected.

Well...

jstrdecode '\uff00'# try this at the command line and you get something else though, an indication that it's wrong

FF00 is NOT a valid UTF-8 code point! But on the other hand 1000 is so we can't just check if the second byte is 00 and the first is not (0000 is also valid of course).

The code used suggests that code points that start with FF (or have that byte at all) are invalid but then FF01 is actually not invalid! Neither are a bunch of others with FF in them.

So I'm not sure what to do here.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

Okay one bug was fixed just now with commit 1a33062. It is recommended, it seems, that if a code point is in the so-called non-character range, the code point should be set to the replacement character. Unfortunately this broke make test so I have to fix that. I was going to go afk to exercise a bit but might avoid that as I have to go out today. Not sure yet.

UPDATE 0

Upon further inspection it seems a properly confirming program can either reject it or change it to a designated replacement character. As it seems better to reject it outright and since that would allow us to not diverge even further from the JSON test suite, I have rolled back that part of the commit.

I am unfortunately not having an easy time seeing enough where I can easily check formatting issues so the website is not an option right now.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

Commit a774970 fixes more bugs where it seems that both 0xFF and 0xFE are valid code points, despite what some say. So that check was removed also.

Still have the problem of 0xff00 is invalid but 0xFF01 is not and neither is 0x1000 so it's not any code point that starts with FF and code points can end in 00 too. If I could find the rule there it would be easier to check but it seems wrong to just put a check in for that specific byte unless that's all there is to it. But I do not know if that's the case.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

I noticed there also needs to be checks for bad continuation bytes. I began this but I'm too tired to do more and i have to leave for a bit, soon, so I won't continue this. It seems improbable I'll be able to work on the website too, but maybe I'll be more awake later on. When I get back I have to work on some other things first but if I have time after that I'll see how it goes. Anyway at least one issue is resolved.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

I noticed there also needs to be checks for bad continuation bytes. I began this but I'm too tired to do more and i have to leave for a bit, soon, so I won't continue this. It seems improbable I'll be able to work on the website too, but maybe I'll be more awake later on. When I get back I have to work on some other things first but if I have time after that I'll see how it goes. Anyway at least one issue is resolved.

FYI: this will require, it appears, a change in how the checks are done. It might involve another function. It needs to act on the string I'm afraid. There is a function that does that that I wrote but it was meant to detect the UTF-8 length; it needs to be refactored (I think) and cleaned up and then with the help of another function (or maybe more than one) it can do things right. It is necessary to have the length function do some checks and set to a negative number but it's getting too complicated to have in a single function. Also some of the checks are in the encode function, and it might be better to have a new function that validates the string first. It would call the length function but then do all the checks. I think that would be cleaner.

But then the fact that some code points that are known to not be valid are not detected means that there is a problem. Of course the continuation byte problem could for all I know solve this one but I don't know.

Anyway I have to leave for now .. back later. Might or might not resume this today.

@xexyl
Copy link
Owner

xexyl commented Oct 13, 2024

This confuses me:

  • The octet values C0, C1, F5 to FF never appear.

.. since there are code points that have FF in them, if not any of the others (and maybe there are). What is missing?

I have to take care of other things now but can hopefully look at this later or tomorrow or the next day. I hope to look at the website too. If I have more time later today I will see if I can do that.

@xexyl
Copy link
Owner

xexyl commented Oct 14, 2024

A clue is from the RFC on UTF-8. It says:


NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This
grammar is believed to describe the same thing Unicode describes, but
does not claim to be authoritative. Implementors are urged to rely
on the authoritative source, rather than on this ABNF.


So I can look at the Unicode spec instead.

However, I'll worry about this after the IOCCC website is in order.

UPDATE 0

Another one that does not work right, for example:

jstrdecode "\u1AB1"

So that has to be fixed too. Anyway that's all for another day.

@xexyl
Copy link
Owner

xexyl commented Oct 16, 2024

Reopening this bug #13 as it seems that some decoding bugs / problems still exist. This should be considered lower priority until the IOCCC website is done and maybe until after the IOCCC28 but I want this open to remind me (though the changed code I have would also remind me but this might be easier).

I'll have you reply to some of those comments in time. Sadly this does not seem resolved, at least as I understand it. I hope that the one up there, \u1ab1 is just a font issue but I fear it's not. Not sure. Anyway even if it is a font issue there are some that are not correct that are being decoded (or not rejected).

@xexyl xexyl reopened this Oct 16, 2024
@xexyl xexyl added the Low priority Low priority at this time label Oct 16, 2024
@xexyl
Copy link
Owner

xexyl commented Oct 20, 2024

Actually there might not be a problem with the \u1ab1 after all. Trying it in html format in Safari and I get a symbol that I have seen for other incorrect (or symbols that cannot be printed?) code points.

Perhaps this bug is really solved and it's a font (or something like it) issue? If that's the case then the only thing is it still passes through some (supposedly) invalid UTF-8.

Do you see the same thing with the code point? What about the html version: &#x1ab1;?

Supposedly it is a valid code point but maybe not every system can display it?

Do you think this bug might actually be solved? I'm inclined to think it might in fact be fixed. That would be good, though I see I might want to modify a check in json_parse.c.

@lcn2
Copy link
Contributor Author

lcn2 commented Oct 21, 2024

Supposedly it is a valid code point but maybe not every system can display it?

Very likely. Just because the Unicode committee declares a code point to be "valid" does NOT mean that everywhere will instantly have the symbol to render let along render at the correct size or even be the correct symbol.

@lcn2
Copy link
Contributor Author

lcn2 commented Oct 21, 2024

Do you think this bug might actually be solved? I'm inclined to think it might in fact be fixed. That would be good, though I see I might want to modify a check in json_parse.c.

It feels like it is fixed and this issue can be closed.

General comment

Producing the correct "encoded string and decoded bits" is the job of jstrdecode(1) and "friends". Doing something useful with the "decoded bits" is the job of the "things doing the displaying", and all of its "friends".

For most of the common Unicode, the "things doing the displaying" (should) do a reasonable job of something useful with the "decoded bits". The more esoteric Unicode "decoded bits" may be less certain to do something useful, and new Unicode "decoded bits" would have to depend on app / library / system updates connected to the "things doing the displaying" .

We suggest you focus on producing the correct "encoded string" in jsteencode(1) (and friends) and correct "decoded bits" in jstrencode(1) (and friends) and NOT worry about what "things doing the displaying" do and that is well out of scope of this repo.

@xexyl
Copy link
Owner

xexyl commented Oct 21, 2024

Supposedly it is a valid code point but maybe not every system can display it?

Very likely. Just because the Unicode committee declares a code point to be "valid" does NOT mean that everywhere will instantly have the symbol to render let along render at the correct size or even be the correct symbol.

That's quite true. I'm glad you agree!

@xexyl
Copy link
Owner

xexyl commented Oct 21, 2024

Do you think this bug might actually be solved? I'm inclined to think it might in fact be fixed. That would be good, though I see I might want to modify a check in json_parse.c.

It feels like it is fixed and this issue can be closed.

Thanks. On the other hand I think another bug exists in the decoding: some invalid UTF-8 is accepted. I'll maybe open a new bug for that.

General comment

Producing the correct "encoded string and decoded bits" is the job of jstrdecode(1) and "friends". Doing something useful with the "decoded bits" is the job of the "things doing the displaying", and all of its "friends".

Meaning the terminal emulator etc., right?

For most of the common Unicode, the "things doing the displaying" (should) do a reasonable job of something useful with the "decoded bits". The more esoteric Unicode "decoded bits" may be less certain to do something useful, and new Unicode "decoded bits" would have to depend on app / library / system updates connected to the "things doing the displaying" .

That's true. And at the console it might not show anything at all (other than a character indicating it can't display anything). For instance at the server that's what happens exactly. I don't know how to get it to display emojis but it's not that important either. It would be great if there was a way to determine if a system can display them or not, say for testing purposes, but if there is I don't know what it is and in any case it probably is far out of scope anyway.

We suggest you focus on producing the correct "encoded string" in jsteencode(1) (and friends) and correct "decoded bits" in jstrencode(1) (and friends) and NOT worry about what "things doing the displaying" do and that is well out of scope of this repo.

That's true - thanks. Even so apparently invalid UTF-8 is supposed to be rejected in correct JSON parsers. Unfortunately I'm not even clear on all of these things. There are some details I'm not even clear about unfortunately: even what appears to be ambiguous wording.

@xexyl
Copy link
Owner

xexyl commented Oct 21, 2024

Closing this issue but I have opened a new issue #24.

Please look at it when you can. I will be looking at the FAQ over at the IOCCC repo in a bit. I have some things I have to do first.

UPDATE 0

Obviously the IOCCC is more important than this. Thanks for your feedback!

@xexyl xexyl closed this as completed Oct 21, 2024
xexyl added a commit that referenced this issue Nov 14, 2024
Renamed utf8encode() to utf8_to_unicode() to be less confusing as although
converting a code point to unicode is called encoding (from all sources we have
seen) in JSON, according to encoders **AND** decoders out there, a code point in a
string should be converted to unicode.

A number of bugs have been identified in jstrencode(1) during discussion in
'the other repo' (or one of the 'other repos'). This is in jstrencode(1) now
(as of yesterday); prior to yesterday it was in jstrdecode(1) due to the
unfortunate swap in names. This swap happened because when focusing on issue #13
(the decoding - which turned out to be encoding - bug of \uxxxx) focus of the
fact that the jstrencode(1) tool is not strictly UTF-8 but rather JSON
was lost. The man page has had these bugs added so it is important to
remove them when the bugs are fixed. A new issue #28 has been opened for
these problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Low priority Low priority at this time
Projects
None yet
Development

No branches or pull requests

2 participants