-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Some Unicode symbols encoded as \uHexHexHexHex in JSON strings are not decoded properly #13
Comments
How and why we filed 3 bugs reportsWe found something that seemed odd regarding JSON string encoding and decoding. It relates to the bin tools from the "Other repo" that we are working on. While investigating, we discovered the issue with the man page (issue #11). We would have simplex fixed the man page, but while we investigating further we found the issue with the quotes (issue #12). The encode / decode puzzled us. Stings with wide variety of non-ASCII characters (like "àßç") seemed to be valid within JSON strings without the need to "encode" them as "\u". However other JSON implementations seemed to do just that. It appears that among the multiple "bogons" of the JSON spec is that there is a "choice": that one can use "é" or one can use "\uc3a9" (because é in UTF-8 is c3 a9). Both are valid ways to refer to the same thing. In fact, one can refer to the ASCII "a" or to it as "\u0061" (ASCII "a" is 0x61). One cannot compare two JSON strings by a simple byte comparison. One MUST decode the JSON string and then compare byte for byte (grown / sigh). So that is how we started trying to use It is a good thing we did discover these issues now. They do block a fix/improvement to how the "other repo" will be able to render web pages that contain non-ASCII characters (such as in an author's name or an award title). Those happen to work because the JSON so-called spec chose to render the symbols. Had someone produced a string using with the equivalent "\u", things would have been broken (at least in terms of how the web pages would have looked). Anyway we wanted you to know why we were looking into JSON string encoding/decoding, and why we filed these issues. Now we will return back to the bin tools (we are nearly done, but for these issues and the need for another tool) so then we can return to the submit server, get the screen snapshots and completed one of the TODOs needed for the Great Fork Merge. |
a proposed 2 new options to jstrencode
How should the By default, With NOTE: With With Of course, the UPDATE 0Do we need to encode at 2-byte UTF-8 or ??? On the one hand, On the other hand, Which is needed for JSON encoding? UPDATE 1According to RFC 4627 Section 3:
Does that provide the vital clue? |
:-) BTW: did you ever notice that the word |
I am waking up more now. The ac might be helping there. Unfortunately I am away from the computer for a little bit. Your use of Eszett did make me check. It seems that's missing from the UTF-8 map in the mkiocccentry repo. If I can determine the right Some of these bug reports can be done easily though with the new options it will of course take more time. UPDATE 0Oh! You do have it .. sharp S but you don't have it as UPDATE 1Eszett fixed in ioccc-src/mkiocccentry@f1cb9e6. |
Okay now back to more serious details. CaveatI am very tired so I might be missing something with the thought below. If so I'll either see it later when more alert/awake/clear-headed or maybe what I write will make you think of something. With that out of the way... First things first (actually this is the second thing :-) or third if you count the above comment first)It seems like it might be a good idea to figure out what has to map to what. And since there are so many characters the question I have is if we need a table like I am not sure how this might go though due to the way the But if we need to encode the Eszett (again as per your example) then what should it be encoded to?
At a QUICK look at json_parse.c it seems like the reason str = json_conv_string(string, len, true); and the boolean indicates: * quote true ==> ignore JSON double quotes: both ptr[0] & ptr[len-1]
* must be '"' Of course it's entirely possible that at this quick glance I am Again this is assuming that I'm not missing something obvious at my quick glance. UPDATE 0I was thinking of the reverse step so the above is not going to work even if it was as simple as swapping (with some modifications) it ... which it would not be almost certainly. Even so it's interesting that that function does seem to work okay and the decode one does not. |
Ah .. but of course it's the reverse. The parser does the opposite. So it definitely is not so simple. I wonder if something similar can be done though? I'm not sure. I'm hoping I can read all three bug reports more carefully soon. But I also wonder about the other repo's |
QUESTION - priority wrt the IOCCCAs I have other things I have to do including over at mkiocccentry but also things unrelated to programming, I just wanted to ask what the priority of this issue is versus the IOCCC. As a programmer, and I'm sure you are very familiar with this, this is an issue that I do not like having, but I also know there is such a thing as prioritising so I am asking this here. I hope to look at the |
Another funny thing I noticed is reading from stdin seems wrong. Look: $ jstrdecode
foo bar
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer ... when I hit $ echo foo bar | jstrdecode
Warning: json_decode: found non-\-escaped char: 0x0a
Warning: jstrdecode_stream: error while encoding stdin buffer What am I missing? I don't recall this behaviour at all. In fact in the man page there's an example where this works. |
My thinking is that I should resolve the jstrdecode problems brought up (new option included) and commit and then we can worry about this one. But I must leave for now possibly for the day. |
Actually I wonder if this should be looked at now .. might require a change in how the above is done. But maybe not. I don't know how to even begin to address it however. |
I'm afraid that that bug has to be fixed first in order to properly test this. Trouble is at this time I'm not sure where to begin with it other than checking that warning message (of course). But what that'll show is another matter. |
The problematic code seems to be in json_parse.c: /*
* disallow characters that should have been escaped
*/
switch (c) {
case '\b': /*fallthrough*/
case '\t': /*fallthrough*/
case '\n': /*fallthrough*/
case '\f': /*fallthrough*/
case '\r':
/* error - clear allocated length */
if (retlen != NULL) {
*retlen = 0;
}
warn(__func__, "found non-\\-escaped char: 0x%02x", (uint8_t)c);
return NULL;
break; But this is coming from sending EOF. |
I am guessing that it is from By all means please advise on what you think there. Of course since there is very possibly a bug in the decode function anyway ... |
This diff: - buf = json_decode(input, inputlen, &bufsiz, NULL);
+ buf = json_decode(input, inputlen-1, &bufsiz, NULL); solves the problem with the EOF being seen in the decode function, at the end. But whether it will cause more problems or not I do not know. What do you think? |
Returning tomorrow, hopefully with a better sleep. Unfortunately (not at all surprising) the excessive heat alert was changed from ending at 2000 hours tonight to 2000 hours tomorrow so that's not fun but hopefully it's not much longer than that. Not that it not being in place does not mean it won't be scorching but any lower is better than not lower or even higher. |
Good question.
I guess the table would help there? Not looking at it right now and just replying with this because I had started it earlier and I want to not not have it done before I shut down.
That would be quite unfortunate. What do you think we should do there? |
Issue impactsIssue #13 could be said to impact progress in the Great Fork Merge, as it potentially impacts the non-ASCII characters on the web site in 4 places. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice. Issue #13 would potentially impact new authors with non-ASCII characters for where there isn't a good ASCII quasi-equivalent. One could say that it impacts the I in IOCCC. Solving issue #13 might require a fix to some jparse JSON string decoding related code, not just to the Current non-ASCII character impactHere are the 4 places where non-ASCII characters are found in JSON. There are "hack-a-rounds" where one might replace the non-ASCII characters with ASCII quasi-equivalent, but at a significant loss of looking nice.
OK, there is a workaround to change the fancy quote character to an ASCII single quote.
While one could change "lámatyávë" into ASCII "lamatyave", but that would be a shame to do so.
One could change "Anton Älgmyr" into ASCII "Anton Algmyr", but that would be a shame to do so.
One could change "Ondřej Adamovský" into ASCII "Ondrej Adamovsky"", but that would be a shame to do so. Of course, the above 4 places then cause these web pages to be impacted as a result of the above 4 issues:
The 4 "hack-a-rounds" noted above could address the above web pages, but that would be a shame to do so. PriorityIf there is a fix for this decoding problem, then it should be applied before the Great Fork Merge. If that is not possible, then we could force the 4 "hack-a-rounds" noted above for now, and then potentially have to fix it before we ask for a public trial of this repo. We might have to put up a "KNOWN BUG" notice for this repo, however. Our suggestion is that you look into the issue enough to determine the type of solution needed and to access the time/complexity of the fix. Depending on that you find, we might opt to fix things before the Great Fork Merge, or before the public trial of this repo, or before IOCCC28, or ... p.s.We could have made a fix to issue #11 and issue #12, but we didn't want to complicate your looking into issue #13 so we stayed out of a 3 rather than potentially impacting your efforts on the more important issue #13. |
The |
A potential change to the API is a reason to fix it sooner than later. Hopefully it is just a fix to the |
In JSON encoded strings, a "naked" newline (\n or 0x0a) is NOT allowed. Such characters must be encoded as "\n". FYI, this works: echo -n foo bar | jstrdecode This also works: printf "%s %s\\\n" foo bar | jstrdecode As does this using gnu sed: echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode To see what is happening, try with -Q: $ echo foo bar | gsed -z 's/\n/\\n/g' | jstrdecode -Q
"foo bar
" UPDATE 0No to GH-issuecomment-2339232202 and No to GH-issuecomment-2339246212. Our guess is that it is NOT related to some EOF or the input being one character too long, but rather than a naked newline is NOT allowed within a JSON encoded string. One MUST encode with these 2 ASCII characters: \n Or are we missing something? |
We do not know. You will have to try and research the problem. SUGGESTIONThe problem seems to be on the "JSON string decoding" side. Perhaps the code that converts the 6 ASCII characters: |
We believe we have addressed all of the current questions that still need answering at this time. If we've missed something or something else needs to be clarified, please ask again. |
Ahhh right. The newline! Of course. With the hex value that should be obvious. But the problem there is that how does one enter EOF when they do not specify an arg? It will end up with a newline even if they do something like... jstrdecode
foo^D will show something like: $ ./jstrdecode -v 5
debug[1]: enclose in quotes: false
debug[1]: newline output: true
debug[1]: silence warnings: false
debug[1]: about to encode all data on input stream
foo^D Or is this a bug in macOS Terminal? |
Are you referring to |
Something tells me it might be a bug in macOS Terminal ... which would be unfortunate. But I think it because I know I used to be able to do this and did so for the man page and all was good. So the question is how to get past this one. As for the how to go about the quotes on stdin versus args when some arg can be '-' the answer might be using a dynamic array of strings that then are added and at the end everything is printed. I have a little bit of time but I am not sure if I can complete that task before I have to go. Later today hopefully I'll have more time and energy. |
I think I solved this bug .. except that I cannot test the typing EOF at the command line. |
Before I commit... or until I do more testing if you don't get back to me) I want to make sure I have used the dyn_array facility correctly as this is the first time I have done so. It appears to be okay but since this is my first time using the facility I want to run it by you to make sure that I didn't miss something. The test example code suggests I have it right but it's entirely possible I missed something (I guess). To create a dynamic array for a array = dyn_array_create(sizeof(char *), JSON_CHUNK, JSON_CHUNK, true); QUESTIONShould I use Next, to append each value I should do: if (dyn_array_append_value(array, &buf)) {
dbg(DBG_LOW, "moved data after appending buf: %s", buf);
} and to get the length of the array I should do: array_len = dyn_array_tell(array); and to then write each element I can do something like: for (idx = 0; idx < array_len; ++idx) {
/*
* write starting escaped quote if requested
*/
if (esc_quotes) {
fprint(stdout, "%s", "\\\"");
}
buf = dyn_array_value(array, char *, idx);
bufsiz = (uintmax_t)strlen(buf);
errno = 0; /* pre-clear errno for warnp() */
outputlen = fwrite(buf, 1, bufsiz, stdout);
if (outputlen != bufsiz) {
warnp(__func__, "error: write of %ju bytes of arg: %jd returned: %ju",
(uintmax_t)bufsiz, idx, (uintmax_t)outputlen);
success = false;
}
/*
* write ending escaped quote if requested
*/
if (esc_quotes) {
fprint(stdout, "%s", "\\\"");
}
} and finally after everything to free the array: /*
* free dynamic array
*/
if (array != NULL) {
dyn_array_free(array);
array = NULL;
}
Have I missed anything? If not then this bug at least is fixed, though I know there are other issues. I know that the append function requires a pointer to a pointer so I had to do a UPDATE 0Okay there is one mistake above I can see. The optind reference. That obviously has to be removed or done differently. UPDATE 1Fixed in above code. As long as arg 0 can be for no arg specified (read from stdin only) then that should be fine. |
Indeed .. doing soon. |
Okay closing this as resolved. If this turns out to not be true please feel free to open again or if it seems like another issue then by all means open a new issue. I do want to see about cleaning up json_utf8.[ch] but not today. Have to go do other things now. Good luck in being fully recovered by tomorrow! (And if asymptomatic now then hopefully not contagious too.) |
Congratulations on resolving a tricky and difficult issue, @xexyl |
Thank you very much! And likewise to you too. Let's hope it is indeed all that's needed. Best wishes on a good night sleep for you! |
Leaving for the day .. if any problems found please let me know (or if you have a fix make a pull request). But of course also make sure to get some rest. I am happy this is resolved t hough, or so it seems. Have a great day! I look forward to seeing those scripts updated in the other repo. |
Remove has_nul in struct json_string as UTF-8 should, it is my understanding, not have a NUL byte. Fix path in jsemcgen.sh. The repo we know that uses this might have a problem here that has to be fixed but the path was broken for this repo so it has to be done, and as the script has an option to change the tool path, it should not be a problem as this is just the default path. Update all tools and the release to be the same version after issue #13 was resolved: "1.2.0 2024-10-09". "1.2.0" was chosen because it was the first one > some of the versions and the others could be bumped up to it without any harm.
Well as you know .. I decided to have some fun with adding it anyway. :-) I had to decide which emojis to try. I almost considered the ridiculous one I found yesterday: they actually have an emoji for a pregnant MAN! I was also considering the other sex. But this seemed going too far as not everyone appreciates such dark humour (as it is some might not appreciate the one I did do but oh well). I had the thought of something with a Ring. Ah .. I should have done also a ring and a volcano! Oh well: maybe next time. Or not. Probably not worth it. Though I don't know .. it seems pretty tempting. I wonder. Well maybe if I have time before it's frozen. I have another comment to look at from you, I know, and I'm going for that now. |
Thanks to the fixed UPDATE 0See commit da0c000f79c6a77ce0e6edf8e971ff2bcdc2a893 in the other repo. |
With commit 0deffa6 I actually did this. I'm rotten, I know, but it's only done if a certain option and level is used. :-) Lower levels have other fun things though I probably should make it so it checks the consistency too. I'll do that sometime soon. UPDATE 0Okay also see commit d2586d9 :-) |
Very nice!
Pulled it some hours ago - glad it works out well! It took some work but I'm glad I focused on it even though we had a workaround. |
Unfortunately I am not sure that this bug is completely resolved :( My understanding is that invalid UTF-8 bytes are not allowed and should be rejected. Well... jstrdecode '\uff00'
# try this at the command line and you get something else though, an indication that it's wrong FF00 is NOT a valid UTF-8 code point! But on the other hand The code used suggests that code points that start with So I'm not sure what to do here. |
Okay one bug was fixed just now with commit 1a33062. It is recommended, it seems, that if a code point is in the so-called non-character range, the code point should be set to the replacement character. Unfortunately this broke make test so I have to fix that. I was going to go afk to exercise a bit but might avoid that as I have to go out today. Not sure yet. UPDATE 0Upon further inspection it seems a properly confirming program can either reject it or change it to a designated replacement character. As it seems better to reject it outright and since that would allow us to not diverge even further from the JSON test suite, I have rolled back that part of the commit. I am unfortunately not having an easy time seeing enough where I can easily check formatting issues so the website is not an option right now. |
Commit a774970 fixes more bugs where it seems that both Still have the problem of |
I noticed there also needs to be checks for bad continuation bytes. I began this but I'm too tired to do more and i have to leave for a bit, soon, so I won't continue this. It seems improbable I'll be able to work on the website too, but maybe I'll be more awake later on. When I get back I have to work on some other things first but if I have time after that I'll see how it goes. Anyway at least one issue is resolved. |
FYI: this will require, it appears, a change in how the checks are done. It might involve another function. It needs to act on the string I'm afraid. There is a function that does that that I wrote but it was meant to detect the UTF-8 length; it needs to be refactored (I think) and cleaned up and then with the help of another function (or maybe more than one) it can do things right. It is necessary to have the length function do some checks and set to a negative number but it's getting too complicated to have in a single function. Also some of the checks are in the encode function, and it might be better to have a new function that validates the string first. It would call the length function but then do all the checks. I think that would be cleaner. But then the fact that some code points that are known to not be valid are not detected means that there is a problem. Of course the continuation byte problem could for all I know solve this one but I don't know. Anyway I have to leave for now .. back later. Might or might not resume this today. |
This confuses me:
.. since there are code points that have I have to take care of other things now but can hopefully look at this later or tomorrow or the next day. I hope to look at the website too. If I have more time later today I will see if I can do that. |
A clue is from the RFC on UTF-8. It says: NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This So I can look at the Unicode spec instead. However, I'll worry about this after the IOCCC website is in order. UPDATE 0Another one that does not work right, for example: jstrdecode "\u1AB1" So that has to be fixed too. Anyway that's all for another day. |
Reopening this bug #13 as it seems that some decoding bugs / problems still exist. This should be considered lower priority until the IOCCC website is done and maybe until after the IOCCC28 but I want this open to remind me (though the changed code I have would also remind me but this might be easier). I'll have you reply to some of those comments in time. Sadly this does not seem resolved, at least as I understand it. I hope that the one up there, |
Actually there might not be a problem with the Perhaps this bug is really solved and it's a font (or something like it) issue? If that's the case then the only thing is it still passes through some (supposedly) invalid UTF-8. Do you see the same thing with the code point? What about the html version: Supposedly it is a valid code point but maybe not every system can display it? Do you think this bug might actually be solved? I'm inclined to think it might in fact be fixed. That would be good, though I see I might want to modify a check in |
Very likely. Just because the Unicode committee declares a code point to be "valid" does NOT mean that everywhere will instantly have the symbol to render let along render at the correct size or even be the correct symbol. |
It feels like it is fixed and this issue can be closed. General commentProducing the correct "encoded string and decoded bits" is the job of For most of the common Unicode, the "things doing the displaying" (should) do a reasonable job of something useful with the "decoded bits". The more esoteric Unicode "decoded bits" may be less certain to do something useful, and new Unicode "decoded bits" would have to depend on app / library / system updates connected to the "things doing the displaying" . We suggest you focus on producing the correct "encoded string" in |
That's quite true. I'm glad you agree! |
Thanks. On the other hand I think another bug exists in the decoding: some invalid UTF-8 is accepted. I'll maybe open a new bug for that.
Meaning the terminal emulator etc., right?
That's true. And at the console it might not show anything at all (other than a character indicating it can't display anything). For instance at the server that's what happens exactly. I don't know how to get it to display emojis but it's not that important either. It would be great if there was a way to determine if a system can display them or not, say for testing purposes, but if there is I don't know what it is and in any case it probably is far out of scope anyway.
That's true - thanks. Even so apparently invalid UTF-8 is supposed to be rejected in correct JSON parsers. Unfortunately I'm not even clear on all of these things. There are some details I'm not even clear about unfortunately: even what appears to be ambiguous wording. |
Closing this issue but I have opened a new issue #24. Please look at it when you can. I will be looking at the FAQ over at the IOCCC repo in a bit. I have some things I have to do first. UPDATE 0Obviously the IOCCC is more important than this. Thanks for your feedback! |
Renamed utf8encode() to utf8_to_unicode() to be less confusing as although converting a code point to unicode is called encoding (from all sources we have seen) in JSON, according to encoders **AND** decoders out there, a code point in a string should be converted to unicode. A number of bugs have been identified in jstrencode(1) during discussion in 'the other repo' (or one of the 'other repos'). This is in jstrencode(1) now (as of yesterday); prior to yesterday it was in jstrdecode(1) due to the unfortunate swap in names. This swap happened because when focusing on issue #13 (the decoding - which turned out to be encoding - bug of \uxxxx) focus of the fact that the jstrencode(1) tool is not strictly UTF-8 but rather JSON was lost. The man page has had these bugs added so it is important to remove them when the bugs are fixed. A new issue #28 has been opened for these problems.
Is there an existing issue for this?
Describe the bug
Certain strings seem to not be decoded properly by the
jstrdecode(1)
tool, or so it seems.SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.
Some background
Consider this JSON document containing just a JSON string:
When you enter that JSON document into
https://jsonlint.com/
is shows that the JSON is valid.Moreover, the
jparse(1)
tool also shows that the JSON document is valid:it is GOOD that the JSON parser is happy with this JSON document.
FYI: Here is the
hexdump(1)
shows about the string (we include the enclosing double quotes and trailing new for consistency with the JSON file):Here is how
jstrencode(1)
processes the string:This is CORRECT as double quotes are back-slashed (i.e., ") is correct because within a JSON string, all double quotes MUST be backslashed.
Now if we use
-Q
, the enclosing double quotes are ignored:If we consult
https://codebeautify.org/json-encode-online
, and we put"œßåé"'
on the input side, we see that output side shows the same string.Evidence of a decoding bug
Now, some JSON tools, such as
jsp
(see https://github.com/kjozsa/jsp) encodes"œßåé"'
as:When we put the
"\u0153\u00df\u00e5\u00e9"
string into the input side of https://codebeautify.org/json-encode-online, the output side shows"œßåé"
.However if we give the
"\u0153\u00df\u00e5\u00e9"
string tojstrdecode(1)
, we get something odd:$ jstrdecode "\u0153\u00df\u00e5\u00e9" S���
Using
hexdump(1)
we see that the expected decoded output:However this is what
jstrdecode(1)
produces:This suggests that the JSON string decoding may be incorrect.
What you expect
We expect that the output of
jstrdecode "\u0153\u00df\u00e5\u00e9"
to be the same as:
Environment
bug_report.sh output
n/a
Anything else?
Consider the result of using
-v 3
:The debug messages look OK, but the output is not:
which is not the same as the expected decoded output:
So it appears that strings with various
\u<hex><hex><hex><hex>
might not be decoded correctly.UPDATE 0C
SUGGESTION: See GH-issuecomment-2350854096 for some updated commentary on this issue.
UPDATE 0D - todo list
It appears that the decoding issue itself is resolved. However a few things have to be done before this can be marked fixed.
Final todo:
I believe that with those done, as long as things look good, we can mark this as complete.
Side todo:
The text was updated successfully, but these errors were encountered: