-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: finish the C-based general JSON parser #156
Comments
In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example. |
Sounds good. We can then discuss it more here. Hope you're having a nice sleep my friend! EDIT: I hope to in the coming days write some information on where I am with this as in what is done so far and what I'm thinking on etc. It might very well not be until the middle of next week but we'll see. |
We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed. For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing: [
"foo" : "bar",
"curds" : 123,
"when" : { "fizz" : "bin" }
] The array object will be completely parsed after the 3 elements are parsed, so the CORRECTION: The cut and paste of info was botched, sorry (tm Canada :) ). Reentering this later with more info. |
Not sure if I understand this sentence particularly the first part. Mind clarifying?
I'll wait and see what you have in mind with the array code as it'll allow me to better come up with any questions/thoughts (I think). |
Okay I'll be off for the day now. I might have a chance to reply to messages later but not sure of that. Tomorrow morning I will have a bit of time but later in the morning until the end of the day I won't be able to interact with you so I hope you have a great Saturday. I am sure I'll do some things or at least reply to some messages though. Good day! |
When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed. What is not clear, because it depends on how Take this compound JSON item: "curds" : 123 Somewhere in the timeline, the following JSON: "curds" will be identified as a string and then sometime The And somewhere in the timeline, the following JSON: 123 will be identified as an integer and then then sometime The So now you have two The JSON parser, have identified a member will call by a (as yet to be written) function that creates |
From the above comment, functions such as the The ...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret; OK, with better variable names and more comments 👍, but hopefully you get the concept. We will start to make the needed changes. |
Good morning my friend! I hope you're having a nice sleep though given the time; I cannot sleep so I decided to sit up.
I understand that all too well but I think we all do.
That's what I was thinking with the sentence above this one (okay, the sentence above the sentence above this one! :) ): how could I force the order? On the other hand though it makes sense now that I've read it after a long sleep (though I wish I was still asleep as I'll be awake many hours today though for good reasons).
Right.
Okay. That makes sense. Actually though I'm still unsure how to use the: %union json_type {
struct json *json;
...
} because in the lexer I'd have e.g.: yylval->json but this means it has to be allocated ... and then inside that there has to be the right allocations and assignments. Thus I think I have to change it to: %union json_type {
struct json json;
...
} and that would greatly simplify things. In fact I'll try doing that today (if I remember to do so after I write my replies to your comments).
Right. I was thinking of that as I was reading the above as well and not sure how to address that.
That sounds reasonable. If you're up to doing that that would be of help: I mean writing the function that sets the name and value in |
That's an interesting point too because in the book they use a cast as well so that they have a single type (but cast when necessary). On the other hand since
Of course.
Thanks! That will be of great help! |
Now as for the The problem is of course that the parser needs token names as well and I figured the prefix UPDATE 0: commit cccbff7 takes care of the changing the
UPDATE 1: I don't plan on doing anything else but as it's still early it's possible I do. Tomorrow will be a slow day almost certainly though if nothing else we can discuss some things. |
With commit 8f9e9f9 we have added code to generate JSON parse tree nodes from the fundament JSON types:
Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated. Consider this JSON: { "foo" : 123 } For the sake of this example, assume: char *foo = "{ \"foo\" : 123 }";
/* ^-- bar == foo+10 */
char *bar = foo+10; /* point at the '1' in the above JSON string */ One may call struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3); These malloc_json_conv_foo() functions return a pointer to a The malloc_json_conv_foo() functions will never return NULL. In the extreme case of a For each malloc_json_conv_foo() function, there is a malloc_json_conv_foo_str() function. char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL); Like malloc_json_conv_foo() functions , they will NOT return NULL. And as before, in the extreme case of a The malloc_json_conv_foo_str() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in: char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len); One should use malloc_json_conv_foo() functions to point to a substring inside a larger JSON block, and use malloc_json_conv_foo_str() function interface to point to a C string that contains the JSON item in question. Neither the malloc_json_conv_foo() functions nor the malloc_json_conv_foo_str() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, Each JSON parse tree node as an important structure member: When The The JSON string conversion does NOT include the JSON double quotes. Thus: struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false); /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false); /* incorrect */ The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The |
We plan to work on functions that form compound JSON nodes such as |
With commit 3179f21 the dynamic array facility as been added. TODO: Change read_all() to use dynamic array facility, then |
I guess the but was accidentally added; looking at the code I can only presume so since the length is passed in as a parameter and that's used in the call to Is there a reason throughout the code you use
Right. This makes sense.
Of course we can't know the length of the number like that. However in the lexer we could assign a
Okay. I will have to think on how to do this. It might be obvious but not in my current state: I was awake quite late for me and awake at 0400 so I'm not awake enough. I probably won't fully process this until I start looking at the code and maybe the book. I actually do think chapter 3 will be of value: it does create a tree for the calculator mostly through bison. Of course other chapters will also be of value but I think chapter 3 will be quite valuable.
This makes sense.
Good to know this. I just looked at the string version and I see it's just a simplified form.
Is there a mistake here? Because the function prototype looks like: struct json *
malloc_json_conv_int(char const *str, size_t len) Ah. Perhaps you made a typo for
Would you please give an example of each if you have any in mind?
Makes sense.
In which case we cannot rely on the struct being valid.
Right.
I noticed the boolean but I haven't really looked at the code to see what that means - yet.
Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer JTYPE_STRING "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")" Of course we don't want to include the
No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense. |
I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.
I'll wait on this to be done before I ask anything or make any comments. What else has to be done btw? |
We have more edits along the way in Once that has been coded and tested, then we will be ready to code the JSON compound functions. |
The duplicate the JSON integer string as done using Thanks @xexyl. |
Sure. There are other places that use Should the functions be renamed too to reflect that they use |
Sounds good. I eagerly await to see what you come up with! I'll probably be a few days before I can do anything much but I'm sure each day I'll do a bit. |
The point of the example was to show that the JSON surrounding Assume this JSON document in memory and is pointed by { "foobar" : 12379 } So you don't have to count, by hand, bytes in the above string, here are a few facts about that string:
In this case you would NOT want to use The conversion call you would want to make (using those above hand counted byte addresses :) ) is: node = malloc_json_conv_string(ptr+3, 6, strict); NOTE: The boolean After the above call:
Had
Accordingly you would NOT want to call node = malloc_json_conv_int(ptr+13, 5); After the above call:
Had
Because Continuing just for a few more of the
Because
Containing for a bit (pun intended):
Because the above mentioned values are
We hope this helps, @xexyl. UPDATE: The above example has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-) |
Right but I was getting at how the regex in the scanner will include the
I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you. |
Yes the JSON surrounding JSON surrounding Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded: {
"foo" : "{}[]:,"
} UPDATE: While your JTYPE_STRING does need the surrounding |
The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-) You might want to refetch it @xexyl. |
If you wrote that on the iPad I commend you! That's impressive!
No worries. I just saved the link so I can open it and read it as it stands but thanks for the notice. |
Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.
I'll have to address this later possibly once I have the parser correct in the above mentioned way. |
So do we think this issue is entirely resolved then ? That would be great! |
With commit 1390b4c error messages now have locations! Check the log:
|
Here's another file: $ ./jparse test/test_JSON/general.json/bad/all.random.text.json
Warning: yylex: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON Should we then get rid of the warning line ? Or I suppose we could improve it by merging them? Though I'm not sure how it looked before. Let's see with another clone: $ ./jparse test/test_JSON/general.json/bad/all.random.text.json
Warning: yylex: invalid token: 0x26 = <&>
syntax error line: 1: &
ERROR[1]: ./jparse: invalid JSON Looks like we could merge them. What do you think ? It might also be that we (though I doubt it) could show all the contents of strings as we tried to do the other day (we'd have to skip scanning of course to test it and I really don't think it will work but it seems like we had more than one place where the token was referred to anyway). Then again that was strings so I don't know. Still what do you think ? Should we merge the two lines by removing the warning and letting |
The mention of the line number seems helpful. The invalid token warning is also useful. Could the Warning: yylex: line be improved to include the line number? If so when both could be combined. If not, then perhaps both are OK? Those are just our thoughts for you, @xexyl, to consider and make a decision about. |
Something like the below ? $ ./jparse test/test_JSON/general.json/bad/all.random.text.json
Warning: yylex: at line 1: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON Or did you have something else in mind ? |
Or maybe better yet: $ cat test.json
&
$ ./jparse test.json
Warning: yylex: at line 1 column 3: invalid token: 0x26 = <&>
syntax error at line 1 on column 3: &
ERROR[1]: ./jparse: invalid JSON ? |
That is more helpful! Yes, that is what we had in mind. |
Even better idea! |
A commit that's not yet published:
So later on it'll be visible to you. UPDATE 0Please see 82705d7. |
Looking at the book I wonder if this might be of use for the NUL / other lower bytes issue:
\"[^\"\n]*\" { yylval.string = yytext; return QSTRING; }
\"[^\"\n]*$ { warning("Unterminated string"); yylval.string = yytext; return QSTRING; }
What do you think ? Of course it might be out of scope at this point and could be discussed more when I split this to another repo. My guess is that's the case but it is something that I wonder about anyway. |
With commit de1561a we now have filenames in error messages! There are a couple possible things that might need to be addressed but they should be easy to fix (except for - at this time at least - the filename has to be global along with the other variables we're forced to have global .. not sure how to do this otherwise for now). The log looks like:
I forgot to add examples but since the log is rather long maybe that's better. Here are a couple. With empty files it won't get the filename as I explained above (though again that's just a guess): $ ./jparse test/test_JSON/general.json/bad/empty.json
ERROR[31]: low_byte_scan: len: 0 <= 0
syntax error at line 0 column 0: empty text
ERROR[1]: ./jparse: invalid JSON What about invalid bytes ? The filename also won't be shown because the parser is never called. But now let's see about other invalid files including stdin: $ ./jparse -
"test":
syntax error node type JTYPE_STRING in file - at line 1 column 7: <:>
ERROR[1]: ./jparse: invalid JSON What about an actual file ? Try it: ./jparse test/test_JSON/general.json/bad/auth.missing-line.92.json
syntax error in file test/test_JSON/general.json/bad/auth.missing-line.92.json at line 91 column 47: empty text
ERROR[1]: ./jparse: invalid JSON (It turns out that in some cases empty text will give a filename but perhaps that's because it's not an empty file ... not sure). One more: $ ./jparse test/test_JSON/general.json/bad/info.string.1.json
Warning: yylex: at line 14 column 15: invalid token: 0x22 = <">
syntax error in file test/test_JSON/general.json/bad/info.string.1.json at line 14 column 15: <">
ERROR[1]: ./jparse: invalid JSON Going to try resting now or at least take a break ... please let me know if the fact the function shares the same name as a global is a problem even if it's actually the same variable. If it is I can rename it or if you have a better thought I'll take care of that. Or you can skip that specific commit or I can roll it back if necessary. Hope you're sleeping well my friend! |
With commit ffb89dd the shadowed variable no longer exists and an obscure problem was resolved as well. I'm going to look at the other comment now and I hope I can reply or at least think about it. |
So... the scanner is now re-entrant! See commit c5daa50:
I wonder if you would like to take care of the todo? That would be helpful. I can then update the man pages and other documentation. This would officially make the json parser entirely complete unless we learn something that needs fixing! There is the one caveat that I can address where filename and the number of errors are globals but as I noted the debug level cannot be made local. Still it's a great accomplishment! As for why I tackled it: it's been annoying me for many days and I had the right amount of focus and energy and patience .. and I am not sure what else to do at this point save maybe the man pages for the json semantics tables which as I said I'm not as clear on the details. Thank you! UPDATE 0No need to do this after all. Done in commit bd49395. All that needs to be done now is make filename local and maybe num_errors should also be local .. if we even need it. I'm not sure we do? What do you think? |
The |
Right. I'm not sure it is needed now. If it is how to go about it is I think tricky. Originally it was useful but I don't think it's useful now. Removing it would mean the only global related to the parser is filename - well that and the debug variable but as I said that's global no matter what. I will maybe look at this sometime soon. The filename one I'm not so clear on how to resolve yet. |
Does the _ filename one_ need to be opened as a new issue? |
I wouldn't think so. It's on my mind I just have to come up with how to do it probably after both a bit more reading and experimenting with it. I will possibly look at it in the coming days. Once it's done the only thing left to do with this would be to test more than one parser but that's entirely out of scope of this repo so I'm not going to even consider it. |
With commit 0ff387d I have fixed a segfault under linux in chkentry which happened as a result of us no longer being able to use |
With commit a3683bc a memory leak in jparse was also fixed. |
With c506150 the low byte scan function had return values (in some cases) fixed. This was discovered when looking at the I think if we enforce that the What do you think about this? -- I might be taking a break real soon but I hope to look at the filename issue later on .. though if I had to guess whether I'll get that done today I'd say it's not too likely. However I might start feeling a bit better a while later. I did at least sleep in some but not as much as I'd like of course. I could have possibly stayed asleep a bit longer but I felt like I would have to sit up soon so I did and now I might or might not be able to lie down again. Anyway hope you're sleeping well! |
With commit 55b6aad the filename is no longer global! This did require some changes with some of the functions (calling semantics) but everything should work fine. I did not test it under fedora but I will do that a bit later. I don't foresee any problems with this. So with the exception of debug level which we cannot control and the number of errors which we probably no longer need we're entirely re-entrant! UPDATE 0Tested under CentOS and fedora linux. All good. Also helped me discover a compilation error that for some reason was not picked up under macOS. See commit fb400c7. It seemed kind of wrong at the time but since it didn't issue a compilation error I didn't think anything beyond it. |
Just ticked this: ...and as for commit f8a8030 and changing the json parser version: good call. The API changes besides the fact it's re-entrant now should suggest it as well (though maybe you also refer to that .. it kind of would be the API as well). I just noticed there is a macOS update and an iOS update and probably an iPadOS update so I'm going to look at doing those but after that maybe (or during - download and installation for the latter two and download for the former) I hope to find something to do with the repo as I don't feel like anything else (though I'm sure I'll come up with something). Anyway thought you'd like to see the ticked task. I guess you do agree with it? |
As for So what's left to do with the parser is besides making any necessary changes as the public reviews it and/or after IOCCCMOCK I should update the documentation. But some of that will also not happen until after those mentioned events. |
Yes. |
Great. |
The other day you made a comment that made me wonder if I should actually consider making a separate repo of this code before the IOCCCMOCK is run. Did I misinterpret this? Either way what do you think? Here are my thoughts on that: Pros for making a new repo now
There probably are other things as well. Cons against doing it now
There probably are other cons as well. :-) (or maybe that's :-( ? :-) ) Comments, thoughts and suggestions welcome! |
There is value in the general public using the tool and providing feedback. While some of that feedback may be IOCCC specific, other feedback is likely to be JSON parser specific. What we learn from that will be valuable to the JSON parser code. So we suggest that the Cons against doing it now outweighs the pros. |
Okay but I am not sure about what you mean. It seems a bit ambiguous. It might be because I am tired or it might be something else. Do you suggest it should be done now? If so we can discuss how to go about it. I am not at the laptop but I can say that I know the following has to be done:
But what other files should be copied over? I guess some will have to be modified too but in what way I do not know. If you think it should be done and you get back to me on these I can start working on it hopefully starting tomorrow. I will quickly look at the other threads and then I am done with this for the day. Will be getting ready to sleep in a bit. I hope you have a good night! This morning I was able to sleep in a little bit (just a little bit) which is good. I am hoping to reverse the process so that in time I wake up somewhere between 430 to 530 in the morning. I am okay waking up at that hour. Anyway let me know what you think about the above and I will get back to you most likely within 24 hours from that point. Good night! Edit 0I am very sorry for the appalling typo for man page where it said msn. Typing on the phone is very often a big problem for me. Maybe I have fat fingers or maybe I just am terrible typing on the phone but either way it seems every time I make even a somewhat lengthy message I make at least a typo if not several! |
We think you should wait .. and there is lots more todo in related IOCCC stuff |
That's better for me too.
I'm kind of at loss of what else to do besides the last two man pages. What else comes to mind? I'm holding off on going through everything to look for consistencies and typos etc. until we're ready for that review. But what else needs to be done? I have a zoom meeting in under an hour. After that I'll be leaving for the day though I might be able to reply to comments a while later. |
As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.
As I also said I will be gone most of Saturday and probably will take a few days to recover.
Finally feel free to edit the title and this message or else let me know what you think is a good start.
We can then discuss what I have so far and how we should proceed with the parser.
I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.
I am not sure but there very possibly is missing grammar too.
Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!
Have a safe trip home when you go back.
TODO:
The text was updated successfully, but these errors were encountered: