-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] bin tools are using jstrencode(1) when they should be decoding JSON encoded strings #2752
Comments
We remain concerned with the issue of JSON string encoding and decoding. We suspect that the details of GH-issuecomment-2471493104 were either unclear or glossed over. We don't have access to the shell now, but when we return we plan to do more testing. Nevertheless what we saw in the code concerned us enough to file this bug and halt the Great Fork Merge process. |
Okay. Without having the time to read this I am admittedly confused because taking UTF-8 to Unicode is encoding and it seemed to work. But I will have to look at this tomorrow. Sorry. |
I just read it and it doesn't make sense. It sounds like a terminology issue but taking a UTF-8 code point and converting it into a Unicode symbol is encoding not decoding. Please explain what you are getting at. I will look at this tomorrow. |
UTF-8 is an encoding of Unicode stuff. JSON string encoding of "real" strings, decoding of JSON encoded strings into "real" strings is another matter. When "\uD83D\uDD25" one expects to get the "real" string of:
P.S.The man page for
The term "JSON decoded strings" is somewhat meaningless. JSON doesn't decode strings. The so-called JSON specification requires all strings to be encoded, and thus produce JSON encoded strings. See GH-ssuecomment-2471493104. The man page for
|
All sources say the opposite of what you're saying though. That's why I swapped the terms. The man page can be updated of course as can documentation. |
JSON requires strings to be encoded. This JSON string encoding has nil to do with how UTF-8 encodes Unicode stuff. Quoting from GH-issuecomment-2471493104: JSON string encoding, at a minimum, requires the string to be surrounded by double quotes. At a minimum, encoding will result in the prepending and appending a double quote character. JSON string encoding ALSO requires one to convert things like ASCII newlines into "\n". There are other important back-slashing requirements such as dealing with double quotes are within the "real" string, backslashes, tabs, etc. (that need to be handed during the JSON string encoding process) Quoting from GH-issuecomment-2471493104 again: Encoding this "real" string:
into this JSON encoded string: "This \"string\" has a newline\nin the middle and at the end\n" The above is JSON string encoding. Now the question of what to do with non-ASCII stuff IS a matter for Unicode / UTF-8 encoding and decoding. But that is NOT JSON string encoding, nor is it decoding of JSON encoded strings. The issue 13 in the other repo raised a concern about JSON encoded strings that contained \uHexHexHexHex-stuff was being handed as a side effect of decoding JSON encoded strings. Tools such as $ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9" NOTE: The pipe input above is a JSON encoded string. The output produced by the BTW: we think the The "\u0153\u00df\u00e5\u00e9" and produce the original "real" string: Instead `jstrencode(1) does this! We don't mind if, in the process of JSON string encoding (what So by all means, let And by all means, let We just need to ALSO be sure that when Both "\uD83D\uDD25" and "🔥" are valid JSON encoded strings. They should convert into the same real string 🔥 as well. We hope this helps. |
If there is a comment bug in the source comments, then those should be fixed. We looked at the comment for the Are we missing something? There certainly could have been some copy and paste errors when code for one side of encoding was converted into decoding, for example. UPDATE 0We recommend to prioritize on GH-issuecomment-2472008592 and the core of this issue first, before worrying about source code comments and man pages. |
It could be a comment bug yes: when I tried to (later) correct the comments by swapping from one function to the other, it might have been a mistake. However ... I see something really odd.
Okay so then it maybe isn't a comment bug?
It could be. But if the encoding comment looks correct and I swapped it then it seems like it might be right and this is not an issue? But even so see below.
Of course. I'll post a new comment with something funny though. |
Here's something funny. All sources say that converting const json_to_encode = {
name: "\u0f0f"
};
const json_to_decode = '{"name": "\u0f0f"}';
const json_decoded = JSON.parse(json_to_decode);
const json_encoded = JSON.stringify(json_to_encode);
document.writeln(json_decoded.name);
document.write(json_encoded); shows in the html file:
which suggests that BOTH encoding and decoding convert the Now given that the function we have is What do you think? |
Meanwhile I had an interesting idea based on this, as I was waking up (a common thing of programmers as you surely know :-) ). What if the jstrencode/jstrdecode tools had an option to parse the string as JSON? It would be something like (I think - I would have to look at it more and only just did) something like... For encoding:
and for decoding:
Now what would the purpose of this be? Perhaps sanity checks or maybe for some experiment or else because part of the parsing is the encoding (which might suggest that only the encoding one should have the option but that's why the post/pre actions). |
Yes. But the point is that when UTF-8
Aha. Now I wonder about this. Perhaps the problem is that some of the options need to be moved from one tool to the other? That seems likely. Though in that case will the output of the UTF-8 code points for the website then show the right string? If it adds
Okay so it seems likely that some of the options/functionality has to be moved over too, and the names alone cannot be swapped? For instance: $ cat nl
$ jstrdecode < nl
\n\n
$ jstrencode < nl
Warning: json_encode: found non-\-escaped char: 0x0a
Warning: jstrencode_stream: error while encoding stdin buffer
Warning: main: error while encoding processing stdin
? Unless there is also some confusion with the terms?
Okay and I see... $ jstrdecode < foo.txt
This \"string\" has a newline\nin the middle and at the end\n But the thing is how do we determine when to encode and when to decode, then? I can see what you mean here: JSON with a
Hmm ... okay so that might be something to consider too. The question we have I think then: what do we do here? Perhaps the tool names should be reverted again but then we have to decide how to proceed with the code points?
Yes.
Right. It is encoded. Which suggests that the decoded is the
When it converts it to
That's because all the sources say that that IS encoding, not decoding. Except for the example I gave above which suggests both do it. And this is why the tools here use the jstrencode tool, not the jstrdecode tool. So minus the fact that the javascript example suggests that both encoding and decoding should print out the fire based on the emoji I gave (above), it should be good, unless you want to quibble about terminology. But since below you talk about how both should do it then it seems like that matters less if at all. Now as you say though, the json encoding/decoding is not the same thing as unicode. On the other hand you did raise the problem of it not doing it at all.
I wonder if there is a way to do it for both like the javascript above shows.
It does this already indeed.
Well jstrdecode will take the fire emoji and output the fire emoji: $ cat fire.json | jstrdecode -n
🔥\n but it appears that the Still it should I believe also output the emoji for both encoding and decoding.
Is this the real problem then? The tool names should be swapped back and both should print out the fire emoji from the fire code point? This way the json encoded strings will be correctly encoded but still print out the encoded form? I think that sounds reasonable but how to go about it I'm not sure yet.
That's true. And that's what happens with |
I guess the following. Please correct me if I'm wrong.
I have a thought on what might allow this to happen but I am not sure. I have to look at the code too. |
.. first step is to swap the names again. That'll be fun. |
Okay the filenames and terms are swapped. The next step would be to make sure that both encode and decode convert code points to unicode symbols. I think I have an idea how to do this. But I will have to go afk very soon. |
Made a commit ... not pushing yet. Have to go afk. Once I'm back I'll work on the unicode problem. Then we can figure out which tool belongs in the website. Hoping I can manage this today. |
Ugh. The real problem is that the json_encode() function (after name change) uses the table ... not the parsing manually. I don't know how to fix that yet. I'll ponder it as I'm afk (or part of the time). |
Well I have an idea but unfortunately it might have to be done first .. but that means the table access will be wrong. This is because the '` for example in |
Just pushed the changes noted above. In a bit I will look at seeing if I can figure out the encoding/decoding of code points. |
Hmm .. my initial idea will not work. This is turning into a nightmare. |
Have another thought. Looking into it. |
That does not work either because the table converts |
I think i got it! Have to do more testing ... |
The one problem is that doing... $ jstrencode '\' fails when it should print |
I discovered another bug too ... check this: $ jstrencode '\a'
'\\a Should not have the first character there, but just UPDATE 0Or perhaps that is a display issue? |
Recommend PrioritiesWe recommend a priority on moving to a state where we can do a "soft code freeze 🥶" for the mkiocccentry repo while making sure that In particular, fixing stuff (code, code comments,man pages, doc) that flipped around for We are aware that there are questions and issues relating to blackslashes and perhaps Unicode/UTF-8 matters, and perhaps other jparse matters. While those are interesting and important, they may not necessarily be a problem for the existing web site and thus are not a top priority. Once mkiocccentry repo is stable and in a position to have a "soft code freeze 🥶", we can consider those "questions and issues" because by then, And by consider those "questions and issues", assess the size of the task to investigate 🔬, determine the effort to answer open questions, assess the potential impact on the web site, and if there is enough time, install a code change into the mkiocccentry repo before the "hard code freeze 🧊" and final release prior to IOCCC28. We hope this helps. UPDATE 0aQuestionIs the mkiocccentry repo "in sync" with the jparse repo? We ask this because we thought we say some slight differences in |
I fixed the man page mix up .. thanks! Not in mkiocccentry just yet. I have to sync it over. It happened due to rushing to get the commit in before I left yesterday for an appointment. As for jstrdecode and quotes: it's not that simple but I'll find the thread you wrote it in and reply to that later on .. very tired. Woke up at stupid o'clock today :( |
Ah, it's this thread. Okay I'll try replying now but might take a bit of time. |
Thanks for pointing this out! I forgot to do the name swap when I was rushing yesterday. I did that a few minutes ago. Syncing to mkiocccentry now too. |
And no: jparse in mkiocccentry is not up to date because I made some slight changes yesterday. Nothing vital but something that I felt should be done. I'll sync it with the man pages as well. Still the jstrdecode issue with quotes will take more time to consider and it might need an option to say it's a quote. I'll explain in a bit. |
Ugh .. I just found yet another problem from the rush! I'll fix that too. |
Doing a sync ... will then reply to other comments if I can. Otherwise later on today I'm sure. |
Unfortunately it's not that simple. Because this is also valid json: true as is -5 and so on. If we expected quotes then it would be in error. Instead we need an option to jstrdecode that says the arg is a string and then if that arg is used we expect a
There are but not in the same way you refer to.
Okay but in this case it should actually be that using it means TO enclose the output in double quotes. Why? Because otherwise we'll get something like: $ jstrencode -- -5
"-5" which is wrong. I suggest the option
In this case it has to be the opposite for reasons I cited; it should be that it IS a string. Otherwise the new option you suggest would have to be used most the time which is annoying to the user. Here too I suggest
There were options that deal with quotes but not like this. As I see it the following options should be added to jstrencode: -s assume args are strings, enclosing output in quotes and jstrdecode should have: -s assume arg is a string, expecting a leading and trailing " I hope to work on these today but as I said I am extremely tired .. woke up too early :( I hope I get more energy sometime soon.
That was fixed in jparse and mkiocccentry (along with other problems discovered). Thanks! |
Well the good thing is that after you merge the commit I made today this part should be fine. Also a mistake with the rush to get those important fixes in yesterday was repaired.
This should, as noted above, be fine, though I have not tested it with the jparse subdirectory of mkiocccentry. However given that it's now synced up (though not merged in the master branch) it should be fine.
I agree there. I can always work on it in the interim of course.
Great. I hope that the new options I described will be easy enough. I don't know if you want those in the mkiocccentry repo or not. I can guess that you do want them in and I also guess you want the minimum version of the jstrdecode in the bin tools (here) updated too. I can do that for sure.
That makes sense.
It is now thanks to your comment .. you were right: that file did have some changes as did some others. Some of them I am not surprised as I made those changes after the pull request. Others surprised me. I thought I had synced them. But obviously I didn't. The man page files were swapped too thanks to you. I guess I shouldn't have tried to rush it when I had to leave early morning but at least it should be fine now and we also found some other issues in the process. |
!sknahT .did yllautca ti ,yako ... ton did tI |
I guess that this issue here though can be closed as complete? Unless you want the new options to the encode/decode tools in first. I'm going to go afk a bit .. when back I hope I can work on the new options. |
The The jstrencode now needs its version. |
The jstrencode version was done a bit ago .. doing a make test. Increased version of each to 2.5 as it seems like a significant update. Also you might enjoy the modification to the help string of the After these are synced to mkiocccentry I'll do a make install from the mkiocccentry subdirectory and then do a make www though I can't imagine there will be any problems. I will also update the minimum version of the tool in the bin/ scripts. |
Well okay a test did fail .. I wonder why. Probably the new flag but I'll check. |
Hmm ... strange issue. Working on it. |
Oh. I think I see. Investigating. |
Solved. Will do a commit soonish .. then sync to mkiocccentry then work on the website scripts. |
Committed to the jparse repo. Will now make sure it goes well in test workflow. Assuming all is good I'll sync to mkiocccentry and then update the bin/ tools here. After that I'll run |
The So the above JSON examples to not apply. |
Thank you for resolving this imporant issue @xexyl |
Hmm .... that is a valid point indeed given the name. However the examples have always shown that and I think that also came from you, though perhaps not. But the name would suggest that it's only strings. On the one hand it would be nice to have a general encoder/decoder but on the other ... I wonder what should be done. Probably need to rethink this and then maybe take the -s option away. It might be good to still have the option I was working on though, to validate as JSON or not. But it might not. I'll close the pull request over at the mkiocccentry repo for now. |
Or actually I'll turn it into a draft. And you're welcome. |
My guess is the following. Given the name I can get rid of the The library change is still good, however. Then I can also use the new options I added that have the parser validate it. That being said if we do this then examples have to be heavily modified. |
Hello from eastern Greenland 🇬🇱 👋 |
Actually .. it turns out that the use of the library update is very useful! It's just that we now will assume it's only strings. I have added a If I have already removed the |
And yet .. sadly, due to human arrogance, it's getting more green - in colour, and less green in environment :( Anyway glad travels are going well! Stay safe and well! |
Unsure if jstrdecode is supposed to quote the output now .. I think not since it's decoding, not encoding. I'll do it that way but if you can think of a case where an option would be good for that, please let me know. |
By default, If |
Thanks for the reminder! |
Is there an existing issue for this?
Describe the bug
Tools such as:
Are using
jstrencode(1)
when they need to take JSON encoded strings and decode them into "real strings" for use in markdown and HTML pages.What you expect
Tools such as:
would use
jstrdecode(1)
to decode JSON encoded strings, converting them into "real" strings.Environment
Anything else?
Things may have gone amiss with commit e58bd97 (dated: Fri Nov 1 06:45:02 2024 -0700)
This seems to be linked to commit 0a7c9673fa7797f1e9c2c87dea377edb03816f03 (dated: Thu Oct 31 11:57:38 2024 -0700) from the "other repo".
UPDATE 0
This is a Great Fork Merge show stopper.
The text was updated successfully, but these errors were encountered: