Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: the concept of jstrencode(1) appears to be both very wrong and very buggy #28

Open
1 task done
xexyl opened this issue Nov 14, 2024 · 30 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@xexyl
Copy link
Owner

xexyl commented Nov 14, 2024

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

Based on JavaScript encoding of JSON, there appear to be multiple problems with our tool.

First of all, it appears that the \ being converted to a \\ is wrong. I will give examples in the what we should expect section.

The next problem is that code points should be converted to unicode symbols just like with jstrdecode(1). This is how it is with JavaScript too: both encode and decode need to do this.

Another problem is that the \ escape chars should not be done the way we have it. See what to expect for examples.

What you expect

In order of the problems above, here are examples of what jstrencode(1) does and what JavaScript does.

If we have the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\bo"
        };

it SHOULD (i.e. this is what JavaScript does) encode to:

{"name":"༏f\bo"}

but our tool converts the string to:

$  jstrencode '"\u0f0ff\bo"'
\"\\u0f0ff\\bo\"

Notice the double \ before the b! Now on the subject of the escaped quotes in the beginning see the anything else section.

Now as far as the code points go, javascript of:

const json_to_encode = {
          name: "\u0f0f"
        };

converts to:

{"name":"༏"}

but our tool does this:

$ jstrencode '{"name": "\u0f0f"}'
{\"name\": \"\\u0f0f\"}

or as just a string:

$ jstrencode '"\u0f0f"'
\"\\u0f0f\"

Now if the \uxxxx was converted to a unicode symbol it might just be that the \" surrounding the output would be the difference but I am not sure of this.

The third problem appears to be even worse. There might be other cases where something like this happens but anyway the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\co"
        };

... turns into:

{"name":"༏fco"}

Notice how the \c has the \ silently removed and the c is by itself (or rather it's after the unicode symbol and before the o).

But our tool does something extremely wrong. First as a string by itself:

$ jstrencode '"\u0f0ff\co"'

0;xexyl@xexyz:~$

.. or as the json with {}:

$ jstrencode {"name":"\u0f0ff\co"}

0;xexyl@xexyz:~$

As can be seen the encoding concept we have appears to be totally wrong.

Environment

  • OS: linux for tool tests, macOS for JavaScript but really should be n/a
  • Device: n/a
  • Compiler: n/a

jparse_bug_report.sh output

n/a

Anything else?

As for the escaped " surrounding the string. I guess it depends on how this tool will be used but even so it would appear to be the wrong default.

As for the doubling of \ it also depends on how we need to use it.

But clearly the \ of non-valid escape chars seems to be wrong.

Of course it could be that it's because I am tired or because I do not really know JavaScript but it might not be. I think it probably isn't either of those. But of course depending on what our tool needs to do that deviates from a normal json encoder ....

@xexyl
Copy link
Owner Author

xexyl commented Nov 14, 2024

One thing that should be done and which I will do is to rename the utf8encode() function to not suggest encoding or decoding but rather unicode. This way there is no confusion in the decode function.

xexyl added a commit that referenced this issue Nov 14, 2024
Renamed utf8encode() to utf8_to_unicode() to be less confusing as although
converting a code point to unicode is called encoding (from all sources we have
seen) in JSON, according to encoders **AND** decoders out there, a code point in a
string should be converted to unicode.

A number of bugs have been identified in jstrencode(1) during discussion in
'the other repo' (or one of the 'other repos'). This is in jstrencode(1) now
(as of yesterday); prior to yesterday it was in jstrdecode(1) due to the
unfortunate swap in names. This swap happened because when focusing on issue #13
(the decoding - which turned out to be encoding - bug of \uxxxx) focus of the
fact that the jstrencode(1) tool is not strictly UTF-8 but rather JSON
was lost. The man page has had these bugs added so it is important to
remove them when the bugs are fixed. A new issue #28 has been opened for
these problems.
@xexyl
Copy link
Owner Author

xexyl commented Nov 14, 2024

Something I do not understand, looking at the so-called json spec and the JavaScript output, is that for the non valid \- characters, it would appear that the \ should just be part of the string. But then in JavaScript it looks like the \ is removed and the character is there.

I presume that this is not understanding the standard correctly (if we can call something about it 'correct' :-) ) or because I am tired and not reading the grammar right. I rather suspect it's a bit of both.

@lcn2
Copy link
Contributor

lcn2 commented Nov 16, 2024

We are in the process of making some minor edits to jstrdecode(1) (and friends). We will apply these same edits to the mkiocccmetry repo. This in order to fix some minor issues with make quick_www and make www in the "other (temp-test-ioccc) repo".

We do not expect these edits to me major, just a few tweaks needed.

Once the make quick_www and make www work well, we will issue a PR with respect to this repo as well as an commit update to the mkiocccmetry repo and finally a commit to the "other (temp-test-ioccc) repo".

Please, @xexyl, stand by / try to withhold any edits and PRs for short term. We will advise when we are done.

@xexyl
Copy link
Owner Author

xexyl commented Nov 16, 2024

We are in the process of making some minor edits to jstrdecode(1) (and friends). We will apply these same edits to the mkiocccmetry repo. This in order to fix some minor issues with make quick_www and make www in the "other (temp-test-ioccc) repo".

We do not expect these edits to me major, just a few tweaks needed.

Once the make quick_www and make www work well, we will issue a PR with respect to this repo as well as an commit update to the mkiocccmetry repo and finally a commit to the "other (temp-test-ioccc) repo".

Please, @xexyl, stand by / try to withhold any edits and PRs for short term. We will advise when we are done.

Thank you! I won't make any edits. I am done for the day and given that (as I brought up in the other thread) I don't even know what you have in mind for the functionality I probably can't do much anyway.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

UPDATE 0

When we return after dinner, we plan to update the documentation as per our minor edits.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

NOTE: In all of these example, we do NOT have a trailing newline.

Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭

What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes.

What should happen? An ERROR should be thrown.

Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash.

Either way, the JSON encoded string is invalid and the resulting JSON is not valid.

It's your choice as to what error you'd prefer to throw.

We recommend such a test case be added to the "invalid JSON" test suite.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

We have a solution that appears to work well for make www. We are doing to dinner and then we will come back a re-test. If all does well we will issue a PR for this repo as well as commits for the other repos.

UPDATE 0

When we return after dinner, we plan to update the documentation as per our minor edits.

See PR #29

UPDATE 0

With pr #29 and release 1.6.13 2024-11-16 of the mkiocccentry repo and commit c47f3abbe47cc63c33a5f4b99ac6ca35cb773dfe of the temp-test-ioccc repo we have now entered a soft code freeze for the mkiocccentry repo.

UPDATE 1a

Please see, @xexyl, GH-issuecomment-2480929883.

Please also see GH-issuecomment-2481266359.

@xexyl
Copy link
Owner Author

xexyl commented Nov 17, 2024

Good things above. I'll reply now. Unfortunately did not have time to address anything before the mkiocccentry sync but these things are not really important for there I guess. I don't think you addressed everything I brought up in this issue but we can discuss those later.

@xexyl
Copy link
Owner Author

xexyl commented Nov 17, 2024

NOTE: In all of these example, we do NOT have a trailing newline.

Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭

What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes.

What should happen? An ERROR should be thrown.

Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash.

Either way, the JSON encoded string is invalid and the resulting JSON is not valid.

It's your choice as to what error you'd prefer to throw.

We recommend such a test case be added to the "invalid JSON" test suite.

Other than this issue there are others as well.

It appears to me that we would have to keep track of each character in the decoding code. That would complicate matters greatly. We deliberately designed the regexp in the parser/lexer to not worry about certain details since we could take care of them in the code outside the parser/scanner. But the ending quotes would be easily solved by passing the strings through the parser for validation.

Some might say this would be overly complicated but it might be the cleaner way since otherwise we would have to add even more code to the decode functions (and maybe encode too?). It would be far easier to use the parser to do this and after all, isn't that the point of it - to verify it's valid and to provide a way to traverse it? We do not have to traverse it but it using the parse function would sure simplify these tools when it comes to errors.

What do you think?

@xexyl
Copy link
Owner Author

xexyl commented Nov 17, 2024

Unsure but it might also help with some of the other problems in this issue. Not sure. Just a thought. I'll look back at this thread and the OT one tomorrow. Got behind in things I have to do today so doing those things now.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

NOTE: In all of these example, we do NOT have a trailing newline.
Regarding a string with a trailing backslash, and here our examples, consider this string:

foo\

When encoded by jstrencode(1), would produce, by default, the following JSON string:

"foo\\"

All good. But let's now mess with things. 🤭
What should happen if we were to give this to jstrdecode(1):

"foo\"

In the above, we have 4 characters foo\ enclosed with double quotes.
What should happen? An ERROR should be thrown.
Depending on how you want to slice the problem, one could consider the string to end with \" and thus lacks the necessary trailing end double quote, OR the string does have enclosing double quotes but has a dangling backslash.
Either way, the JSON encoded string is invalid and the resulting JSON is not valid.
It's your choice as to what error you'd prefer to throw.
We recommend such a test case be added to the "invalid JSON" test suite.

Other than this issue there are others as well.

It appears to me that we would have to keep track of each character in the decoding code. That would complicate matters greatly. We deliberately designed the regexp in the parser/lexer to not worry about certain details since we could take care of them in the code outside the parser/scanner. But the ending quotes would be easily solved by passing the strings through the parser for validation.

Some might say this would be overly complicated but it might be the cleaner way since otherwise we would have to add even more code to the decode functions (and maybe encode too?). It would be far easier to use the parser to do this and after all, isn't that the point of it - to verify it's valid and to provide a way to traverse it? We do not have to traverse it but it using the parse function would sure simplify these tools when it comes to errors.

What do you think?

We do NOT think that internal string formatting, such as trialing backslashes and Unicode/UTF-8, belong in the bison/flex scanner/parser. We strongly recommend against involving the bison/flex scanner/parser in details such as determine if a JSON strings properly encoded.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

We need to move on and focus on the Great Fork Merge and getting ready for the IOCCC28.

After the IOCCC28 closes, and the judges go on a vacation, and we address calc v3 fork, we can return to considering improvements to this JSON parser.

UPDATE 0

Sorry (tm Canada 🇨🇦).

@xexyl
Copy link
Owner Author

xexyl commented Nov 17, 2024

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems?

My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all.

But if you can think of something better or easier please let me know.

@xexyl
Copy link
Owner Author

xexyl commented Nov 17, 2024

We need to move on and focus on the Great Fork Merge and getting ready for the IOCCC28.

After the IOCCC28 closes, and the judges go on a vacation, and we address calc v3 fork, we can return to considering improvements to this JSON parser.

UPDATE 0

Sorry (tm Canada 🇨🇦).

I'm looking forward to it so no problem. I also have other things to concern myself with as well.

These do not seem critical, certainly, even if it'd be better if it was resolved. The example you gave is easy to see what is happening but solving it is more complicated than figuring out what it is.

Anyway best wishes. We can (as I think I said already) talk about this another time after the next IOCCC.

@lcn2
Copy link
Contributor

lcn2 commented Nov 17, 2024

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems?

My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all.

But if you can think of something better or easier please let me know.

Let the routines that form the parse tree call functions that detect what the scanner/parser thinks of as a string, and if the JSON string is not properly encoded (for whatever reason), raise an error and not return the parse tree. That way, the scanner/parser is agnostic as to the internal content of things such as strings. Meanwhile from the point of an application calling the JSON scanner/parser, a bogus encoding of a JSON string is still flagged as an error.

We hope this helps.

UPDATE 0

Some typos fixed.

UPDATE 1

Time to move on for us as we need to focus on the two significant open Great Fork Merge TODOs.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

I just meant if we were to invoke it it would be simple to solve these problems. If not though how do you propose to solve the problems?
My point was that the parser would flag these as errors and the problem would be solved. We would not have to update the scanner/parser at all.
But if you can think of something better or easier please let me know.

Let the routines that form the parse tree call functions that detect what the scanner/parser thinks of as a string, and if the JSON string is not properly encoded (for whatever reason), raise an error and not return the parse tree. That way, the scanner/parser is agnostic as to the internal content of things such as strings. Meanwhile from the point of an application calling the JSON scanner/parser, a bogus encoding of a JSON string is still flagged as an error.

That, as I read it, is exactly what I was getting at. Using the parser can determine if the string is properly encoded. If it's not an error is raised and the decoding function is not called. If it is valid JSON it can be called.

Also as an extra sanity check we could invoke the JSON parser on the encode tool after encoding, to verify that it's valid JSON.

I'll do that but I have some other things in my mind to do with jparse. First one I think will improve error messages when working with strings and not files.

We hope this helps.

Yes it does.

UPDATE 0

Some typos fixed.

Ever notice that TYPOS is an anagram for SPOT + Y? And TYPO is an anagram for POT+Y? :-)

UPDATE 1

Time to move on for us as we need to focus on the two significant open Great Fork Merge TODOs.

Great!

Today I have a blood draw, unfortunately. Not that vampires really bother me but I'd rather not go out. Also this place might insist I sit down and I really don't want to esp as I always change clothes after sitting down in another's chair and our washer is not working right (hopefully it can be repaired Wednesday). So I'm doing everything I can to not change outer clothes until it's fixed.

Hope you have a great day! In a while I will probably be afk a bit but I do hope to do a number of things here today. I also have other things to do too of course.

Hope you slept well or you're having a nice sleep! I also hope that if you're back home that you had a nice holiday and you're safe and sound and well!

THANK YOU for the feedback! It won't solve all the issues I raised in this thread but it will solve one that you came up with (I did not even think of the example you gave but it's a great one to fix).

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

That new functionality is added .. just running make test. If all is good I'll then work on the jstrdecode issue. But I might be going afk before that: not sure.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Next thing actually is fixing the man page ... for newer functions and strings. But I have to go afk now for a while.

Once I've done the man pages I'll then return to jstrdecode. Already added the new functionality (as in committed) of the string parsing wrt better error messages.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

I ended up updating the library in a significant way. I noticed that some function names are ambiguous. Thus the names were renamed. However given that these are internal functions it should not matter. Namely:

-extern struct json *parse_json_string(char const *string, size_t len);
-extern struct json *parse_json_bool(char const *string);
-extern struct json *parse_json_null(char const *string);
-extern struct json *parse_json_number(char const *string);
-extern struct json *parse_json_array(struct json *elements);
-extern struct json *parse_json_member(struct json *name, struct json *value);
+extern struct json *json_parse_string(char const *string, size_t len);
+extern struct json *json_parse_bool(char const *string);
+extern struct json *json_parse_null(char const *string);
+extern struct json *json_parse_number(char const *string);
+extern struct json *json_parse_array(struct json *elements);
+extern struct json *json_parse_member(struct json *name, struct json *value);

They were too similar to the parse_json() functions in name.

I also updated the jparse.3 man page.

Unfortunately I also accidentally ODed on a medication so I am feeling quite awful. This should not happened but unfortunate circumstances led to this. I had a blood draw but that was postponed until Friday. I'll be fine though.

I hope to look at the parse call in jstrdecode but I'm not sure if I'll get to that today. If I don't today I won't until Wednesday at least most likely.

However I think once it is done it'll be worth it. It won't solve all problems here but it will solve some.

Of course none of this needs to be synced to mkiocccentry right now .. can wait until after IOCCC28 unless of course you need the jstrdecode bug fixes applied for the website.

I hope you're doing well with the Great Fork Merge! I do have other things I have to do and given what happened it seems that I'll have to start doing that soon. Though I would think the adding the parser to jstrdecode won't take much effort: though it'll also require a new option, namely -J. It might also be good to have an option that says to NOT use the parser. I can think of a few cases like the jstr_test.sh file where this might be useful but I'll have to wait and see .. ideally would be good to fix that script too.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Functionality added .. have to fix jstr_test.sh. If all is good I'll commit and that'll be all I do today - well after updating documentation I'll commit.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Hmm .. strange. The input '"foo\"' did cause an error. Still the parser being there is a good thing so I'll commit soon. Then I must go.

UPDATE 0

Well okay it only warned but now it shows an error.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Oh great .. now there is a linker error in linux:

/usr/bin/ld: ../libjparse.a(json_parse.o): in function `json_process_floating':
/home/xexyl/code/jparse/json_parse.c:3083: undefined reference to `floorl'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3111: undefined reference to `floor'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3133: undefined reference to `floorl'

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Nothing should have changed here .. no idea how to fix. Feature test macros do not work :(

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

This problem has been going on for some time it seems. But I don't know why. Not good.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

My concern is what if it's something that's a blocker for people in linux using it for mkiocccentry? I'll open a bug report and then I must leave.

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Interesting .. even adding -lm to LDFLAGS and I do not see it in the compile line. Hmm....

@xexyl
Copy link
Owner Author

xexyl commented Nov 18, 2024

Got it. It was util_test in test_jparse. Okay that's a relief. Will fix it but that means it'll have to be updated in mkiocccentry. Sorry :(

@xexyl
Copy link
Owner Author

xexyl commented Nov 19, 2024

Oh great .. now there is a linker error in linux:

/usr/bin/ld: ../libjparse.a(json_parse.o): in function `json_process_floating':
/home/xexyl/code/jparse/json_parse.c:3083: undefined reference to `floorl'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3111: undefined reference to `floor'
/usr/bin/ld: /home/xexyl/code/jparse/json_parse.c:3133: undefined reference to `floorl'

With commit ef45fa8 this horrible mess should be prevented in the future. This caused the problems in mkiocccentry and another problem here too. The -lm was not even needed it turns out, all because of the makefile.local file I had on the server!

If you wish me to remove that functionality from dyn_array and dbg I can do that but not until tomorrow. I know mkiocccentry is in code freeze status so that can happen later. I just hope I can fix my clone .. but might have to wait until another commit is pushed to it as it seems that it might want to merge (for reasons beyond my comprehension) commits that were closed.

Good luck with the Great Fork Merge and good day! It's going to be a difficult, long day and that's why I'm stopping now .. working on other things for the time being but probably a bad idea to work here until tomorrow.

@lcn2
Copy link
Contributor

lcn2 commented Nov 19, 2024

TIME TO PAUSE

We urgently need to FOCUS on IOCCC28 matters and this will have to respectfully delay for months responding to comments about this repo. Unless we do that, IOCCC28 is not going to start this calendar year ... and we don't want to deal with the consequences of that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants