Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: finish the C-based general JSON parser #156

Closed
1 task done
xexyl opened this issue Apr 15, 2022 · 1,000 comments
Closed
1 task done

Enhancement: finish the C-based general JSON parser #156

xexyl opened this issue Apr 15, 2022 · 1,000 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@xexyl
Copy link
Contributor

xexyl commented Apr 15, 2022

As I said in the other thread I am also just typing on my phone and I am about to go get some sleep but I can answer your questions tomorrow @lcn2.

As I also said I will be gone most of Saturday and probably will take a few days to recover.

Finally feel free to edit the title and this message or else let me know what you think is a good start.

We can then discuss what I have so far and how we should proceed with the parser.

I will say quickly before I go that I have kind of held back a bit to see where the structs (or the functions acting on them) go as I think it will be helpful to have something more to work on. Once the structs are sorted it should be easier to actually write the rules and actions in the lexer and parser.

I am not sure but there very possibly is missing grammar too.

Hope this is a good start for the issue. Have a good rest of your day and look forward seeing more here my friend!

Have a safe trip home when you go back.

TODO:

  • Make better error messages for invalid input (and all errors in general).
@lcn2
Copy link
Contributor

lcn2 commented Apr 15, 2022

In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 15, 2022

In support of this issue, we plan to add a dynamic array interface. This will be needed to support building structures for JSON arrays, for example.

Sounds good. We can then discuss it more here.

Hope you're having a nice sleep my friend!

EDIT: I hope to in the coming days write some information on where I am with this as in what is done so far and what I'm thinking on etc. It might very well not be until the middle of next week but we'll see.

@xexyl xexyl changed the title Finish the JSON parser Discuss the details of the JSON parser so it can be finished Apr 15, 2022
@lcn2
Copy link
Contributor

lcn2 commented Apr 15, 2022

We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.

For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:

[
     "foo" : "bar",
     "curds" : 123,
     "when" : { "fizz" : "bin" }
]

The array object will be completely parsed after the 3 elements are parsed, so the struct json_array will be finalized after the three struct json_member structures (the last one with a value that points to another struct json_member) are finalized.

CORRECTION: The cut and paste of info was botched, sorry (tm Canada :) ). Reentering this later with more info.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 15, 2022

We will assume that compound JSON aspects (such as arrays, objects with members, member) we will presume that their sub-components will be parsed first before the parent object is parsed.

Not sure if I understand this sentence particularly the first part. Mind clarifying?

For example the JSON parser will need to parse the zero or more elements of an array before the parent JSON array is finally parsed. So when parsing:

[
     "foo" : "bar",
     "curds" : 123,
     "when" : { "fizz" : "bin" }
]

The array object will be completely parsed after the 3 elements are parsed, so the struct json_array will be finalized after the three struct json_member structures (the last one with a value that points to another struct json_member) are finalized.

I'll wait and see what you have in mind with the array code as it'll allow me to better come up with any questions/thoughts (I think).

@xexyl
Copy link
Contributor Author

xexyl commented Apr 15, 2022

Okay I'll be off for the day now. I might have a chance to reply to messages later but not sure of that. Tomorrow morning I will have a bit of time but later in the morning until the end of the day I won't be able to interact with you so I hope you have a great Saturday. I am sure I'll do some things or at least reply to some messages though.

Good day!

@lcn2
Copy link
Contributor

lcn2 commented Apr 16, 2022

Not sure if I understand this sentence particularly the first part. Mind clarifying?

Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).

When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.

What is not clear, because it depends on how bison / flex generated C code works, is when things happen.

Take this compound JSON item:

"curds" : 123

Somewhere in the timeline, the following JSON:

"curds"

will be identified as a string and then sometime malloc_json_conv_string() will be called with the curds string.

The struct json_string pointer returned by malloc_json_conv_string() will be loaded into a struct json and the type will be set to JTYPE_STRING by a (as yet to be written) function.

And somewhere in the timeline, the following JSON:

123

will be identified as an integer and then then sometime malloc_json_conv_int() will be called with the 123 string.

The struct json_integer pointer returned by malloc_json_conv_int() will be loaded into a struct json and the type will be set to JTYPE_INT by a (as yet to be written) function.

So now you have two struct json structures, one for "curds" and one for 123. One is the name and one is the value for a struct json_member.

The JSON parser, have identified a member will call by a (as yet to be written) function that creates struct json_member and sets the name and value accordingly. This struct json_member pointer returned by a by a (as yet to be written) function will be loaded into a struct json and the type will be set to JTYPE_MEMBER.

@lcn2
Copy link
Contributor

lcn2 commented Apr 16, 2022

From the above comment, functions such as the malloc_json_conv_int() function should return a struct json pointer (not a struct json_integer pointer).

Thestruct json should also be filled out in themalloc_json_conv_int() function should do so along the following concept:

...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
    ...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
    ...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret;

OK, with better variable names and more comments 👍, but hopefully you get the concept.

We will start to make the needed changes.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 16, 2022

Good morning my friend! I hope you're having a nice sleep though given the time; I cannot sleep so I decided to sit up.

Not sure if I understand this sentence particularly the first part. Mind clarifying?

Cut and paste of text on a iPhone failed, sorry (tm Canada :) ).

I understand that all too well but I think we all do.

When a compound JSON item is parsed, such as a JSON array, the sub-items of the compound JSON item need to be parsed before the parse of the compound JSON item is completed.

What is not clear, because it depends on how bison / flex generated C code works, is when things happen.

That's what I was thinking with the sentence above this one (okay, the sentence above the sentence above this one! :) ): how could I force the order?

On the other hand though it makes sense now that I've read it after a long sleep (though I wish I was still asleep as I'll be awake many hours today though for good reasons).

Take this compound JSON item:

"curds" : 123

Somewhere in the timeline, the following JSON:

"curds"

will be identified as a string and then sometime malloc_json_conv_string() will be called with the curds string.

Right.

The struct json_string pointer returned by malloc_json_conv_string() will be loaded into a struct json and the type will be set to JTYPE_STRING by a (as yet to be written) function.

Okay. That makes sense. Actually though I'm still unsure how to use the:

%union json_type { 
struct json *json;
...
}

because in the lexer I'd have e.g.:

yylval->json

but this means it has to be allocated ... and then inside that there has to be the right allocations and assignments.

Thus I think I have to change it to:

%union json_type { 
struct json json;
...
}

and that would greatly simplify things. In fact I'll try doing that today (if I remember to do so after I write my replies to your comments).

And somewhere in the timeline, the following JSON:

123

will be identified as an integer and then then sometime malloc_json_conv_int() will be called with the 123 string.

The struct json_integer pointer returned by malloc_json_conv_int() will be loaded into a struct json and the type will be set to JTYPE_INT by a (as yet to be written) function.

So now you have two struct json structures, one for "curds" and one for 123. One is the name and one is the value for a struct json_member.

Right. I was thinking of that as I was reading the above as well and not sure how to address that.

The JSON parser, have identified a member will call by a (as yet to be written) function that creates struct json_member and sets the name and value accordingly. This struct json_member pointer returned by a by a (as yet to be written) function will be loaded into a struct json and the type will be set to JTYPE_MEMBER.

That sounds reasonable. If you're up to doing that that would be of help: I mean writing the function that sets the name and value in struct json_member *. Is that something you had in mind to do?

@xexyl
Copy link
Contributor Author

xexyl commented Apr 16, 2022

From the above comment, functions such as the malloc_json_conv_int() function should return a struct json pointer (not a struct json_integer pointer).

That's an interesting point too because in the book they use a cast as well so that they have a single type (but cast when necessary). On the other hand since struct json will have the union and the type (for the union) it might not need a cast in that case. What will have to change, of course, is code that calls the function(s).

Thestruct json should also be filled out in themalloc_json_conv_int() function should do so along the following concept:

...
struct json_integer *foo;
struct json *ret;
...
ret = calloc(1, sizeof(struct json));
if (ret == NULL) {
    ...
}
...
foo = calloc(1, sizeof(struct json_integer));
if (foo == NULL) {
    ...
}
...
ret->type = JTYPE_INT;
ret->element.integer = foo;
ret->parent = NULL;
ret->prev = NULL;
ret->next = NULL;
...
return ret;

OK, with better variable names and more comments 👍, but hopefully you get the concept.

Of course.

We will start to make the needed changes.

Thanks! That will be of great help!

@xexyl
Copy link
Contributor Author

xexyl commented Apr 16, 2022

Now as for the JTYPE_ versus JSON_ I'm still not sure what to do. I wish you or I had thought of opening a separate issue (this issue) earlier on because I had noted in the other thread an idea I had about this subject but finding it might be difficult: though perhaps in Mail.app I can do a search in that thread for JTYPE_.

The problem is of course that the parser needs token names as well and I figured the prefix JSON_ would be ideal. However I'm starting to wonder if for the parser that JTYPE_ would be better (esp since the code nobody will look at .. without nightmares at least :)) ) and the parse tree related struct/union should have JSON_ prefix. What do you think about that? Maybe I should do that. On the other hand maybe there's a better solution. Any thoughts on that?

UPDATE 0: commit cccbff7 takes care of the changing the struct json * to struct json as well as the above prefix changes. I trust that you'll appreciate and hopefully laugh at the commit log wrt the prefixes:

    The JTYPE_ enum in jparse.h have been renamed with the prefix JSON_ and
    the JSON_ in the lexer and parser have been prefixed instead with JTYPE_
    as nobody in their right mind will look at the parser or scanner code
    but people will look at the other code. Actually nobody in their left
    (wrong) mind would look at the generated code either because anyone who
    does will have horrible nightmares about spaghetti code with very long
    tentacles attached to the pasta that tries to grab them and pull them
    into its mouth before consuming them with insatiable craving for humans
    in its eyes. Okay you get the point: JTYPE_ is a better name for the
    ugly, offensive code that can't be maintained whereas JSON_ is better
    for code that can be maintained and is not ugly or offensive.

UPDATE 1: I don't plan on doing anything else but as it's still early it's possible I do. Tomorrow will be a slow day almost certainly though if nothing else we can discuss some things.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

With commit 8f9e9f9 we have added code to generate JSON parse tree nodes from the fundament JSON types:

  • integer - malloc_json_conv_int()
  • floating point - malloc_json_conv_float
  • string - malloc_json_conv_string()
  • boolean - malloc_json_conv_bool()
  • null - malloc_json_conv_null()

Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.

Consider this JSON:

{ "foo" : 123 }

For the sake of this example, assume:

char *foo = "{ \"foo\" : 123 }";
/*                       ^-- bar == foo+10 */

char *bar = foo+10;   /* point at the '1' in the above JSON string */

One may call malloc_json_conv_int() as follows:

struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3);

These malloc_json_conv_foo() functions return a pointer to a struct json, a node in the JSON parse tree.
The JSON parse tree node is NOT linked. I.e., the parent, prev, and next pointers are NULL.
This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.

The malloc_json_conv_foo() functions will never return NULL. In the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

For each malloc_json_conv_foo() function, there is a malloc_json_conv_foo_str() function.
The malloc_json_conv_foo_str() function interface is given a pointer to a NUL terminated string:

char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL);

Like malloc_json_conv_foo() functions , they will NOT return NULL. And as before, in the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

The malloc_json_conv_foo_str() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:

char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len);

One should use malloc_json_conv_foo() functions to point to a substring inside a larger JSON block, and use malloc_json_conv_foo_str() function interface to point to a C string that contains the JSON item in question.

Neither the malloc_json_conv_foo() functions nor the malloc_json_conv_foo_str() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, node->as_str points to a copy of the JSON item as a NUL terminated C string. Moreover, node->as_str_len contains the length of that NUL terminated C string.

Each JSON parse tree node as an important structure member: node->converted. This is a boolean that indicates if the conversion function was able to convert or not. If node->converted == false, then the conversion function was NOT possible.

When node->converted == false, only as_str and as_str_len contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check the node->converted value first.

The malloc_json_conv_string() and malloc_json_conv_string_str() interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).

The JSON string conversion does NOT include the JSON double quotes. Thus:

struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false);  /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false);  /* incorrect */

The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The node->as_str will contain a malloced copy of the string with any whitespace stripped off. The orig_len will give the original length (pre-stripping) of the string, whereas as_str_len contains the length of the duplicated string that might be stripped. Thus is node-> orig_len == node-> as_str_len then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

We plan to work on functions that form compound JSON nodes such as JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY next.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

With commit 3179f21 the dynamic array facility as been added.

TODO: Change read_all() to use dynamic array facility, then
finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

With commit 8f9e9f9 we have added code to generate JSON parse tree nodes from the fundament JSON types:

  • integer - malloc_json_conv_int()
  • floating point - malloc_json_conv_float
  • string - malloc_json_conv_string()
  • boolean - malloc_json_conv_bool()
  • null - malloc_json_conv_null()

Each of the above interfaces are given a pointer to the beginning character of the JSON item and a length. This pointer/length interface allows one to point to text within a larger character block. The text does NOT need to be NUL but terminated.

I guess the but was accidentally added; looking at the code I can only presume so since the length is passed in as a parameter and that's used in the call to malloc. A question for you on the subject of malloc though:

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

Consider this JSON:

{ "foo" : 123 }

For the sake of this example, assume:

char *foo = "{ \"foo\" : 123 }";
/*                       ^-- bar == foo+10 */

char *bar = foo+10;   /* point at the '1' in the above JSON string */

Right. This makes sense.

One may call malloc_json_conv_int() as follows:

struct json *node = NULL;
...
node = malloc_json_conv_int(bar, 3);

Of course we can't know the length of the number like that. However in the lexer we could assign a size_t the value of strlen(yytext) so that we do have the length. But perhaps you have built its here this is not actually necessary?

These malloc_json_conv_foo() functions return a pointer to a struct json, a node in the JSON parse tree. The JSON parse tree node is NOT linked. I.e., the parent, prev, and next pointers are NULL. This is because it will the responsibility of functions caller to link in the JSON parse tree into the overall parse tree.

Okay. I will have to think on how to do this. It might be obvious but not in my current state: I was awake quite late for me and awake at 0400 so I'm not awake enough. I probably won't fully process this until I start looking at the code and maybe the book. I actually do think chapter 3 will be of value: it does create a tree for the calculator mostly through bison. Of course other chapters will also be of value but I think chapter 3 will be quite valuable.

The malloc_json_conv_foo() functions will never return NULL. In the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

This makes sense.

For each malloc_json_conv_foo() function, there is a malloc_json_conv_foo_str() function. The malloc_json_conv_foo_str() function interface is given a pointer to a NUL terminated string:

Good to know this. I just looked at the string version and I see it's just a simplified form.

char *bar = "123";
struct json *node = NULL;
...
node = malloc_json_conv_int(bar, NULL);

Is there a mistake here? Because the function prototype looks like:

struct json *
malloc_json_conv_int(char const *str, size_t len)

Ah. Perhaps you made a typo for malloc_json_conv_int_str? That seems what it is (and below suggests it too) but want to be sure I am not missing something.

Like malloc_json_conv_foo() functions , they will NOT return NULL. And as before, in the extreme case of a malloc() (well actually calloc()), these functions will NOT return.

The malloc_json_conv_foo_str() function interface includes a pointer to a string size, instead of a string length. If that pointer is NULL (as was in the above example) no length is stored. However one can be given the length of the string as a side effect of the call as in:

char *bar = "123";
struct json *node = NULL;
size_t len;
...
node = malloc_json_conv_int(bar, &len);

One should use malloc_json_conv_foo() functions to point to a substring inside a larger JSON block, and use malloc_json_conv_foo_str() function interface to point to a C string that contains the JSON item in question.

Would you please give an example of each if you have any in mind?

Neither the malloc_json_conv_foo() functions nor the malloc_json_conv_foo_str() functions modify the text they are given. Moreover each one of them malloced a C string containing a copy of the JSON item. In each of the above examples, node->as_str points to a copy of the JSON item as a NUL terminated C string. Moreover, node->as_str_len contains the length of that NUL terminated C string.

Makes sense.

Each JSON parse tree node as an important structure member: node->converted. This is a boolean that indicates if the conversion function was able to convert or not. If node->converted == false, then the conversion function was NOT possible.

In which case we cannot rely on the struct being valid.

When node->converted == false, only as_str and as_str_len contain valid values. All other aspects of the JSON element union are invalid. Thus before you use other aspects of the JSON element union, you MUST check the node->converted value first.

Right.

The malloc_json_conv_string() and malloc_json_conv_string_str() interfaces include an final boolean argument to indicate off the JSON string conversion should be done under "strict" rules (or NOT).

I noticed the boolean but I haven't really looked at the code to see what that means - yet.

The JSON string conversion does NOT include the JSON double quotes. Thus:

struct json *node = NULL;
char *foo = "foo";
char *bar = "\"bar\"";
...
node = malloc_json_conv_string_str(foo, NULL, false);  /* correct */
...
node = malloc_json_conv_string_str(bar, NULL, false);  /* incorrect */

Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The JSON integer and floating point conversion routines will ignore leading and trailing whitespace. Normally in a JSON document, there should be NO such leading and trailing whitespace. However the C string the numerical value functions ignore leading and trailing whitespace. The node->as_str will contain a malloced copy of the string with any whitespace stripped off. The orig_len will give the original length (pre-stripping) of the string, whereas as_str_len contains the length of the duplicated string that might be stripped. Thus is node-> orig_len == node-> as_str_len then the original JSON string containing the numeric value did NOT contain any leading and trailing whitespace.

No leading whitespace because the regex will extract exactly the values? Or you mean something else? The last sentence makes sense.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments.

What else has to be done btw?

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments.

What else has to be done btw?

We have more edits along the way in dyn_alloc.c and den_alloc.h to better support read_all(). We then will probably expand on dyn_alloc_trest to use read_all() to read into memory, the binary of itself.

Once that has been coded and tested, then we will be ready to code the JSON compound functions.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

The duplicate the JSON integer string as done using malloc() and will be changed to use calloc().

Thanks @xexyl.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

Is there a reason throughout the code you use malloc instead of calloc? I always use calloc as a safety measure and although you always initialise the struct after allocation I wonder if there's you don't use calloc? And in at least one function you use both I see (the string one: the only one I took a look at).

The duplicate the JSON integer string as done using malloc() and will be changed to use calloc().

Thanks @xexyl.

Sure. There are other places that use malloc as well and maybe those should use calloc() also? In the code I've added (as I think I might have said) I use calloc() so those don't need to be converted.

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

With commit 3179f21 the dynamic array facility as been added.

I see there's a lot to process here. If you have any examples in mind that would be great. But I think it might be better to wait on the other helper routines and structs as some of it could change.

TODO: Change read_all() to use dynamic array facility, then finally use dynamic array facility for managing compound JSON parser tree nodes (JSON_OBJECT, JSON_MEMBER, and JSON_ARRAY).

I'll wait on this to be done before I ask anything or make any comments.
What else has to be done btw?

We have more edits along the way in dyn_alloc.c and den_alloc.h to better support read_all(). We then will probably expand on dyn_alloc_test to use read_all() to read into memory, the binary of itself.

Once that has been coded and tested, then we will be ready to code the JSON compound functions.

Sounds good. I eagerly await to see what you come up with! I'll probably be a few days before I can do anything much but I'm sure each day I'll do a bit.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

Would you please give an example of each if you have any in mind?
...
Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The point of the example was to show that the JSON surrounding "'s should not be given to malloc_json_conv_string() nor to malloc_json_conv_string_str().

Assume this JSON document in memory and is pointed by ptr:

{ "foobar" : 12379 }

So you don't have to count, by hand, bytes in the above string, here are a few facts about that string:

  • The address ptr+0 refers to the {.
  • The address ptr+2 refers to the first ".
  • The address ptr+3 refers to the f.
  • The address ptr+8 refers to the f.
  • The address ptr+9 refers to the second ".
  • The encoded JSON string length is 6.
  • The address ptr+13 refers to the 1.
  • The address ptr+17 refers to the 9.
  • The JSON integer length is 5.
  • The address ptr+19 refers to the }.
  • The address ptr+20 refers to the NUL at the end of the JSON document placed there by read_all().

In this case you would NOT want to use malloc_json_conv_string_str() because the string in memory is NOT NUL byte terminated at ptr+9.

The conversion call you would want to make (using those above hand counted byte addresses :) ) is:

node = malloc_json_conv_string(ptr+3, 6, strict);

NOTE: The boolean strict may be true or false depending on if strict JSON is in effect.

After the above call:

  • node->type == JSON_STRING
  • node->element.string.as_str - a malloced string containing just foobar followed by a NUL byte
  • node->element.string.as_str_len == 6
  • node->element.string.converted == true
  • node->parent == NULL
  • node->next == NULL
  • node->prev == NULL

Had node->element.string.converted been false, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.string.converted == true we know that the rest of the struct json_string are valid. In particular:

  • node->element.string.str is the decoded string foobar followed by a NUL byte
  • node->element.string.str_len == 6 - the length of the decoded string
  • node->element.string.same == true - the decoded string is the same as encoded string
  • node->element.string.has_nul == false - the decoded string does NOT have a NUL byte inside it
  • node->element.string.slash == false - no _/ _in the decoded string
  • node->element.string.posix_safe == true - decoded string is "POSIX portable safe plus +"
  • node->element.string.first_alphanum == true - decoded string start with a alphanumeric character
  • node->element.string.upper == false - no UPPER case character are in the decoded string

Accordingly you would NOT want to call malloc_json_conv_int_str(ptr+13, NULL) because this try to integer encode "`12379 }" which is not want you want. Instead you would call (using those above hand counted byte addresses :) ) is:

node = malloc_json_conv_int(ptr+13, 5);

After the above call:

  • node->type == JSON_INT
  • node->element.integer.as_str - a malloced string containing just foobar followed by a NUL byte
  • node->element.integer.as_str_len == 5
  • node->element.integer.converted == true
  • node->parent == NULL
  • node->next == NULL
  • node->prev == NULL

Had node->element.string.converted been false, then the ABOVE JSON parse tree structure fields would have been the ONLY JSON parse tree structure fields with valid values. However, because node->element.integer.converted == true we know that the rest of the struct json_integer are valid. In particular:

  • node->element.integer.orig_len == 5 - original integer string did NOT have leading nor trailing whitespace
  • node->element.integer.is_negative == false
  • node->element.integer.int8_sized == false

Because node->element.integer.int8_sized == false, we know that node->element.integer.as_int8 does NOT contain a valid value and so we ignore it.

Continuing just for a few more of the struct json_integer fields:

  • node->element.integer.int16_sized == true

Because node->element.integer.int16_sized == true, we know the node->element.integer.as_int16 has a valid value:

  • node->element.integer.as_uint16 == 12379

Containing for a bit (pun intended):

  • node->element.integer.int32_sized == true
  • node->element.integer.uint32_sized == true
  • node->element.integer.int64_sized == true
  • node->element.integer.uint64_sized == true
  • node->element.integer.int_sized == true
  • node->element.integer.uint_sized == true
  • ...
  • node->element.integer.as_umaxint == true

Because the above mentioned values are true, their corresponding values are valid:

  • node->element.integer.as_int32 == 12379
  • node->element.integer.as_uint32 == 12379
  • node->element.integer.as_int64 == 12379
  • node->element.integer.as_uint64 == 12379
  • node->element.integer.as_int == 12379
  • node->element.integer.as_uint == 12379
  • ...
  • node->element.integer.as_umaxint == 12379

We hope this helps, @xexyl.

UPDATE: The above example has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

Would you please give an example of each if you have any in mind?
...
Okay this is an interesting one then. Since JSON strings actually have the quotes (json string type) maybe we need to either have these functions strip off the outer "s or else have a function that does this. What are your thoughts here? This btw is because of the regex:

JTYPE_STRING            "(\"(((?=\\)\\([\"\\\/bfnrt]|u[0-9a-fA-F]{4}))|[^\"\\\0-\x1F\x7F]+)*\")"

Of course we don't want to include the "s always since not all types have quotes so this has to be done specially for strings. Should the function(s) (One or both? Which ones if only one?) strip the outer "s? I guess that might be a good idea but only strip off the outermost ones.

The point of the example was to show that the JSON surrounding "'s should not be given to malloc_json_conv_string() nor to malloc_json_conv_string_str().

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

We hope this helps, @xexyl.

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

Yes the JSON surrounding "'s are lexical.

JSON surrounding "'s are lexical just like {, }, :, [, ], , are lexical.

Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:

{
"foo" : "{}[]:,"
}

UPDATE: While your JTYPE_STRING does need the surrounding "'s what you provide to, say, malloc_json_conv_string() will be one byte beyond the first surrounding " and the length will be 2 bytes short of the matched string to NOT pass in the final surrounding ".

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

You might want to refetch it @xexyl.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

I've saved the link and put it in a file for me to look at the comment later on when I have more energy. If you actually answered the above concern in the text I removed then no need to address it - though maybe a quick message saying so would be good just in case I happen to miss it when I look at it more later on. Either way I'm sure the detailed comment will be of value so thank you.

The above document has been expanded on and has undergone minor corrections as there only so much you can so on a iPad. :-)

If you wrote that on the iPad I commend you! That's impressive!

You might want to refetch it @xexyl.

No worries. I just saved the link so I can open it and read it as it stands but thanks for the notice.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

Right but I was getting at how the regex in the scanner will include the "s so should they be stripped off since the functions should not be passed the outer "s?

Yes the JSON surrounding "'s are lexical.

Which means that somewhere they have to be removed so where is the right place? To be clear I'mr referring to the fact that yytext will include the outer "s when a string is matched.

I went to test it but I do see there's a problem with the parser syntax so I can't test it. I might work on that next but it won't be today.

JSON surrounding "'s are lexical just like {, }, :, [, ], , are lexical.

Of course, the tricky bit is that now every instance of those lexical characters are lexical. If they appear inside a JSON encoded string, the are text to be decoded:

{
"foo" : "{}[]:,"
}

I'll have to address this later possibly once I have the parser correct in the above mentioned way.

@lcn2
Copy link
Contributor

lcn2 commented Apr 17, 2022

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

Thanks.

@xexyl
Copy link
Contributor Author

xexyl commented Apr 17, 2022

Should the functions be renamed too to reflect that they use calloc() instead of malloc()?

Goo idea, @xexyl, commit 463ec30 just did that.

Thanks.

I just pushed in the current pull request a fix - you forgot to update jint.c and jfloat.c.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 6, 2022

Congratulations @xexyl this is a significant accomplishment!

I was going to say the same thing to you so I'll do so now: congratulations and a job well done on this! Indeed it is very significant!

Thank you 🙏

Of course. It was wonderful to work with you on it! Do you want me to modify the scanning of NUL bytes to be the other bytes in that range (<= 1f) ? If so I can maybe do it tomorrow. Leaving for now .. well I'll have my phone but shutting laptop down and will be doing other things now.
Enjoy the rest of your day!

See commit 52f04e6 to resolve this low byte pre-scan issue.

So do we think this issue is entirely resolved then ? That would be great!

@xexyl
Copy link
Contributor Author

xexyl commented Nov 6, 2022

With commit 1390b4c error messages now have locations!

Check the log:

Improve error messages in jparse

Added locations to error messages! Now when a syntax error is
encountered one might see for example:

    $ ./jparse test/test_JSON/general.json/bad/auth.missing-line.70.json
    syntax error at line 69 on column 3: }
    ERROR[1]: ./jparse: invalid JSON

I chose this file in particular as it does show a problem: if you look
at that file it looks like it's actually at column 17. This is because
tabs are there and are being counted as one. This should possibly be
fixed but maybe not.

The next step would be to add filenames to the error message but this is
an improvement anyway.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 6, 2022

With commit 1390b4c error messages now have locations!

Check the log:

Improve error messages in jparse

Added locations to error messages! Now when a syntax error is
encountered one might see for example:

    $ ./jparse test/test_JSON/general.json/bad/auth.missing-line.70.json
    syntax error at line 69 on column 3: }
    ERROR[1]: ./jparse: invalid JSON

I chose this file in particular as it does show a problem: if you look
at that file it looks like it's actually at column 17. This is because
tabs are there and are being counted as one. This should possibly be
fixed but maybe not.

The next step would be to add filenames to the error message but this is
an improvement anyway.

Here's another file:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

Should we then get rid of the warning line ? Or I suppose we could improve it by merging them? Though I'm not sure how it looked before. Let's see with another clone:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error line: 1: &
ERROR[1]: ./jparse: invalid JSON

Looks like we could merge them. What do you think ? It might also be that we (though I doubt it) could show all the contents of strings as we tried to do the other day (we'd have to skip scanning of course to test it and I really don't think it will work but it seems like we had more than one place where the token was referred to anyway). Then again that was strings so I don't know. Still what do you think ? Should we merge the two lines by removing the warning and letting yyerror() handle it ?

@lcn2
Copy link
Contributor

lcn2 commented Nov 6, 2022

With commit 1390b4c error messages now have locations!
Check the log:

Improve error messages in jparse

Added locations to error messages! Now when a syntax error is
encountered one might see for example:

    $ ./jparse test/test_JSON/general.json/bad/auth.missing-line.70.json
    syntax error at line 69 on column 3: }
    ERROR[1]: ./jparse: invalid JSON

I chose this file in particular as it does show a problem: if you look
at that file it looks like it's actually at column 17. This is because
tabs are there and are being counted as one. This should possibly be
fixed but maybe not.

The next step would be to add filenames to the error message but this is
an improvement anyway.

Here's another file:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

Should we then get rid of the warning line ? Or I suppose we could improve it by merging them? Though I'm not sure how it looked before. Let's see with another clone:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error line: 1: &
ERROR[1]: ./jparse: invalid JSON

Looks like we could merge them. What do you think ? It might also be that we (though I doubt it) could show all the contents of strings as we tried to do the other day (we'd have to skip scanning of course to test it and I really don't think it will work but it seems like we had more than one place where the token was referred to anyway). Then again that was strings so I don't know. Still what do you think ? Should we merge the two lines by removing the warning and letting yyerror() handle it ?

The mention of the line number seems helpful. The invalid token warning is also useful.

Could the Warning: yylex: line be improved to include the line number? If so when both could be combined. If not, then perhaps both are OK?

Those are just our thoughts for you, @xexyl, to consider and make a decision about.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 6, 2022

With commit 1390b4c error messages now have locations!
Check the log:

Improve error messages in jparse

Added locations to error messages! Now when a syntax error is
encountered one might see for example:

    $ ./jparse test/test_JSON/general.json/bad/auth.missing-line.70.json
    syntax error at line 69 on column 3: }
    ERROR[1]: ./jparse: invalid JSON

I chose this file in particular as it does show a problem: if you look
at that file it looks like it's actually at column 17. This is because
tabs are there and are being counted as one. This should possibly be
fixed but maybe not.

The next step would be to add filenames to the error message but this is
an improvement anyway.

Here's another file:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

Should we then get rid of the warning line ? Or I suppose we could improve it by merging them? Though I'm not sure how it looked before. Let's see with another clone:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error line: 1: &
ERROR[1]: ./jparse: invalid JSON

Looks like we could merge them. What do you think ? It might also be that we (though I doubt it) could show all the contents of strings as we tried to do the other day (we'd have to skip scanning of course to test it and I really don't think it will work but it seems like we had more than one place where the token was referred to anyway). Then again that was strings so I don't know. Still what do you think ? Should we merge the two lines by removing the warning and letting yyerror() handle it ?

The mention of the line number seems helpful. The invalid token warning is also useful.

Could the Warning: yylex: line be improved to include the line number? If so when both could be combined. If not, then perhaps both are OK?

Those are just our thoughts for you, @xexyl, to consider and make a decision about.

Something like the below ?

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json
Warning: yylex: at line 1: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

Or did you have something else in mind ?

@xexyl
Copy link
Contributor Author

xexyl commented Nov 6, 2022

Or maybe better yet:

$ cat test.json 
  &
$ ./jparse  test.json 
Warning: yylex: at line 1 column 3: invalid token: 0x26 = <&>
syntax error at line 1 on column 3: &
ERROR[1]: ./jparse: invalid JSON

?

@lcn2
Copy link
Contributor

lcn2 commented Nov 7, 2022

With commit 1390b4c error messages now have locations!
Check the log:

Improve error messages in jparse

Added locations to error messages! Now when a syntax error is
encountered one might see for example:

    $ ./jparse test/test_JSON/general.json/bad/auth.missing-line.70.json
    syntax error at line 69 on column 3: }
    ERROR[1]: ./jparse: invalid JSON

I chose this file in particular as it does show a problem: if you look
at that file it looks like it's actually at column 17. This is because
tabs are there and are being counted as one. This should possibly be
fixed but maybe not.

The next step would be to add filenames to the error message but this is
an improvement anyway.

Here's another file:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

Should we then get rid of the warning line ? Or I suppose we could improve it by merging them? Though I'm not sure how it looked before. Let's see with another clone:

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json 
Warning: yylex: invalid token: 0x26 = <&>
syntax error line: 1: &
ERROR[1]: ./jparse: invalid JSON

Looks like we could merge them. What do you think ? It might also be that we (though I doubt it) could show all the contents of strings as we tried to do the other day (we'd have to skip scanning of course to test it and I really don't think it will work but it seems like we had more than one place where the token was referred to anyway). Then again that was strings so I don't know. Still what do you think ? Should we merge the two lines by removing the warning and letting yyerror() handle it ?

The mention of the line number seems helpful. The invalid token warning is also useful.
Could the Warning: yylex: line be improved to include the line number? If so when both could be combined. If not, then perhaps both are OK?
Those are just our thoughts for you, @xexyl, to consider and make a decision about.

Something like the below ?

$ ./jparse  test/test_JSON/general.json/bad/all.random.text.json
Warning: yylex: at line 1: invalid token: 0x26 = <&>
syntax error at line 1 on column 1: &
ERROR[1]: ./jparse: invalid JSON

That is more helpful! Yes, that is what we had in mind.

@lcn2
Copy link
Contributor

lcn2 commented Nov 7, 2022

Or maybe better yet:

$ cat test.json 
  &
$ ./jparse  test.json 
Warning: yylex: at line 1 column 3: invalid token: 0x26 = <&>
syntax error at line 1 on column 3: &
ERROR[1]: ./jparse: invalid JSON

?

Even better idea!

@xexyl
Copy link
Contributor Author

xexyl commented Nov 7, 2022

Or maybe better yet:

$ cat test.json 
  &
$ ./jparse  test.json 
Warning: yylex: at line 1 column 3: invalid token: 0x26 = <&>
syntax error at line 1 on column 3: &
ERROR[1]: ./jparse: invalid JSON

?

Even better idea!

A commit that's not yet published:

Improve invalid token warning at lexer level

For instance a file that has:

    $ cat test.json
       &test

would show:

    Warning: yylex: at line 1 column 4: invalid token: 0x26 = <&>
    syntax error at line 1 on column 4: &
    ERROR[1]: ./jparse: invalid JSON

It might be possible to simply change the warning to use yyerror() but
if so it'll be done in a future commit.

It's important to note that unfortunately strlen("\t") == 1 so columns
can be incorrect according to what one might see (and what might be
detected in an editor) and the YYLTYPE structure member first_column. If
there's a solution to this I don't yet know it.

So later on it'll be visible to you.

UPDATE 0

Please see 82705d7.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 7, 2022

Looking at the book I wonder if this might be of use for the NUL / other lower bytes issue:

We would like to detect an unterminated quoted string. One potential solution is to add a new rule to catch unterminated strings as we did in the SQL parser in Chap- ter 4. If a quoted string runs all the way to the end of the line without a closing quote, we print an error:

\"[^\"\n]*\" { yylval.string = yytext; return QSTRING; }
\"[^\"\n]*$ {  warning("Unterminated string"); yylval.string = yytext; return QSTRING; }

This technique of accepting not quite valid input and then reporting it with an error or warning is a powerful one that can be used to improve the error reporting of the compiler. If we had not added this rule, the compiler would have reported the generic “syntax error” message; by reporting the specific error, we can tell the user precisely what to fix. Later in this chapter, we will describe ways to resynchronize and attempt to continue operation after such errors.

What do you think ?

Of course it might be out of scope at this point and could be discussed more when I split this to another repo. My guess is that's the case but it is something that I wonder about anyway.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 7, 2022

With commit de1561a we now have filenames in error messages!

There are a couple possible things that might need to be addressed but they should be easy to fix (except for - at this time at least - the filename has to be global along with the other variables we're forced to have global .. not sure how to do this otherwise for now).

The log looks like:

Include filename in json parser errors (yyerror())

We now include the filename in error messages! This is unfortunately (at
least at this time) done via a global variable declared in jparse.h
(along with other globals that we are forced to use). This could be
considered a work in progress, maybe.

There is one thing I'm not so sure about but if necessary I will change
it at another time (must leave after this). Fortunately it's quite easy
to change. This thing is that prior to calling parse_json_file() (which
takes a char const *filename) we set filename to what we used to pass
into the function directly and then use filename instead. That takes
place in two locations: jsemtblgen.c and jparse_main.c. The possible
problem is that parse_json_file() has a parameter 'filename' so this
might or might not be considered a problem even if it's actually the
same variable. In all places it's a char const *filename.

Should this be a problem it can be easily fixed. Since we rely on other
global variables (until such a time - if this ever comes to pass - that
the scanner also becomes re-entrant) the global nature of filename
should not (at least according to my mind at this hour ? :-) ) be a
problem. If it becomes a problem this commit can be either fixed or
rolled back but it is ideal to have the filename.

A word about stdin: we of course use the typical - and that's exactly
what's printed in that case. I thought about whether or not it should
say "stdin" instead but I decided against that because it might suggest
that there is a file called stdin in the current directory. Of course
the same could be said for '-' itself as it's also possible to have a
file by that name. However that is less likely than 'stdin'. Another
option is possibly _stdin_ or something like that but that can be
decided later.

One more note of filename being global. The parser does not allow more
than one filename or string so there should be no problem of files
changing. However since I specifically first set 'filename' to the
correct argument (passed into main) this wouldn't be a problem anyway.
Thus if we ever do decide to allow for more than one string or file this
should not become a problem.

With this commit I believe error messages are complete as far as
functionality goes (this could change as we discover additional issues
of course). I have tested it with several files. One might note that in
some cases the filename is not printed. An example is when there are
invalid bytes in the file. Another example is an empty file. I believe
in the latter case it's because the action of setting the
yylloc->filename is never allowed to happen since nothing was scanned.
This is just a guess on my part and might or might not be the case. As
for the invalid bytes it's because the parser is never called at this
point: it's a fatal error if invalid bytes are found.

Whether error recovery should be added is another matter but I believe
that that is out of scope of this repo (this might also be out of scope
but it's been bugging me for a while that filenames are not shown so I
finally added it).

I forgot to add examples but since the log is rather long maybe that's better. Here are a couple.

With empty files it won't get the filename as I explained above (though again that's just a guess):

$ ./jparse test/test_JSON/general.json/bad/empty.json 
ERROR[31]: low_byte_scan: len: 0 <= 0 
syntax error at line 0 column 0: empty text
ERROR[1]: ./jparse: invalid JSON

What about invalid bytes ? The filename also won't be shown because the parser is never called. But now let's see about other invalid files including stdin:

$ ./jparse -
"test":
syntax error node type JTYPE_STRING in file - at line 1 column 7: <:>
ERROR[1]: ./jparse: invalid JSON

What about an actual file ? Try it:

 ./jparse  test/test_JSON/general.json/bad/auth.missing-line.92.json 
syntax error in file test/test_JSON/general.json/bad/auth.missing-line.92.json at line 91 column 47: empty text
ERROR[1]: ./jparse: invalid JSON

(It turns out that in some cases empty text will give a filename but perhaps that's because it's not an empty file ... not sure).

One more:

$ ./jparse  test/test_JSON/general.json/bad/info.string.1.json        
Warning: yylex: at line 14 column 15: invalid token: 0x22 = <">
syntax error in file test/test_JSON/general.json/bad/info.string.1.json at line 14 column 15: <">
ERROR[1]: ./jparse: invalid JSON

Going to try resting now or at least take a break ... please let me know if the fact the function shares the same name as a global is a problem even if it's actually the same variable. If it is I can rename it or if you have a better thought I'll take care of that. Or you can skip that specific commit or I can roll it back if necessary.

Hope you're sleeping well my friend!

@xexyl
Copy link
Contributor Author

xexyl commented Nov 7, 2022

With commit ffb89dd the shadowed variable no longer exists and an obscure problem was resolved as well.

I'm going to look at the other comment now and I hope I can reply or at least think about it.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 8, 2022

So... the scanner is now re-entrant! See commit c5daa50:

With this commit the lexer is now also re-entrant (with two caveats and
one todo below)!

This required quite a few changes with both the scanner and the parser
including use of getter and setter functions as well as a local variable
of yyscan_t (actually a void *). In the case of the YY_USER_ACTION macro
it seems the use of the getter macros are not needed but everywhere else
is.

The first caveat is that the filename is still a global. This should be
fixed. If we want to be strictly technical the num_errors variable also
is global and probably should be fixed.

The second is that it appears that yydebug is always global and cannot
be enabled or disabled per scanner/parser instance.

The todo (and right now a caveat) is that now run_flex.sh has to
generate a new file: jparse.lex.ref.h and then copy it to jparse.lex.h
with sorry.tm.ca.h prepended to it just like the run_bison.sh script
does for jparse.tab.ref.h and jparse.tab.h. Once that is done the
Makefile can be updated for this rule. Right now jparse.y includes
jparse.lex.ref.h but once the script has been updated this needs to be
changed to jparse.lex.h. It cannot be included in jparse.h.

Due to the nature of this commit I have it in a different branch so that
any problems can be sorted. However something I just remembered is that
earlier today I made some commits of some of these files on the updates
branch so perhaps this is a problem. If it is I will correct it.

I wonder if you would like to take care of the todo? That would be helpful. I can then update the man pages and other documentation. This would officially make the json parser entirely complete unless we learn something that needs fixing! There is the one caveat that I can address where filename and the number of errors are globals but as I noted the debug level cannot be made local. Still it's a great accomplishment!

As for why I tackled it: it's been annoying me for many days and I had the right amount of focus and energy and patience .. and I am not sure what else to do at this point save maybe the man pages for the json semantics tables which as I said I'm not as clear on the details.

Thank you!

UPDATE 0

No need to do this after all. Done in commit bd49395. All that needs to be done now is make filename local and maybe num_errors should also be local .. if we even need it. I'm not sure we do? What do you think?

@lcn2
Copy link
Contributor

lcn2 commented Nov 8, 2022

All that needs to be done now is make filename local and maybe num_errors should also be local .. if we even need it. I'm not sure we do? What do you think?

The num_errors, if it is needed at all, could be a local variable.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 8, 2022

All that needs to be done now is make filename local and maybe num_errors should also be local .. if we even need it. I'm not sure we do? What do you think?

The num_errors, if it is needed at all, could be a local variable.

Right. I'm not sure it is needed now. If it is how to go about it is I think tricky. Originally it was useful but I don't think it's useful now. Removing it would mean the only global related to the parser is filename - well that and the debug variable but as I said that's global no matter what.

I will maybe look at this sometime soon. The filename one I'm not so clear on how to resolve yet.

@lcn2
Copy link
Contributor

lcn2 commented Nov 8, 2022

I will maybe look at this sometime soon. The filename one I'm not so clear on how to resolve yet.

Does the _ filename one_ need to be opened as a new issue?

@xexyl
Copy link
Contributor Author

xexyl commented Nov 8, 2022

I will maybe look at this sometime soon. The filename one I'm not so clear on how to resolve yet.

Does the _ filename one_ need to be opened as a new issue?

I wouldn't think so. It's on my mind I just have to come up with how to do it probably after both a bit more reading and experimenting with it. I will possibly look at it in the coming days. Once it's done the only thing left to do with this would be to test more than one parser but that's entirely out of scope of this repo so I'm not going to even consider it.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 9, 2022

With commit 0ff387d I have fixed a segfault under linux in chkentry which happened as a result of us no longer being able to use yyin in the scanner.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 9, 2022

With commit a3683bc a memory leak in jparse was also fixed.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 9, 2022

With c506150 the low byte scan function had return values (in some cases) fixed. This was discovered when looking at the num_errors global. It seems to me that in some cases that boolean is used where the is_valid is not and so it's still necessary. But is it?

I think if we enforce that the bool *is_valid is not NULL we can always just set *is_valid = false and then we can use that in the tools in order to identify that the json was / is invalid. This would, I believe, allow us to get rid of the num_errors which I think is ideal.

What do you think about this?

--

I might be taking a break real soon but I hope to look at the filename issue later on .. though if I had to guess whether I'll get that done today I'd say it's not too likely. However I might start feeling a bit better a while later. I did at least sleep in some but not as much as I'd like of course. I could have possibly stayed asleep a bit longer but I felt like I would have to sit up soon so I did and now I might or might not be able to lie down again. Anyway hope you're sleeping well!

@xexyl
Copy link
Contributor Author

xexyl commented Nov 9, 2022

With commit 55b6aad the filename is no longer global!

This did require some changes with some of the functions (calling semantics) but everything should work fine. I did not test it under fedora but I will do that a bit later. I don't foresee any problems with this.

So with the exception of debug level which we cannot control and the number of errors which we probably no longer need we're entirely re-entrant!

UPDATE 0

Tested under CentOS and fedora linux. All good. Also helped me discover a compilation error that for some reason was not picked up under macOS. See commit fb400c7. It seemed kind of wrong at the time but since it didn't issue a compilation error I didn't think anything beyond it.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 10, 2022

Just ticked this:

json-parser-task-done

...and as for commit f8a8030 and changing the json parser version: good call. The API changes besides the fact it's re-entrant now should suggest it as well (though maybe you also refer to that .. it kind of would be the API as well).

I just noticed there is a macOS update and an iOS update and probably an iPadOS update so I'm going to look at doing those but after that maybe (or during - download and installation for the latter two and download for the former) I hope to find something to do with the repo as I don't feel like anything else (though I'm sure I'll come up with something).

Anyway thought you'd like to see the ticked task. I guess you do agree with it?

@xexyl
Copy link
Contributor Author

xexyl commented Nov 10, 2022

As for num_errors see commit 02a9308: all global variables are now gone from the parser and scanner! The bool is_valid must not be NULL but that's not a problem and I did of course enforce it.

So what's left to do with the parser is besides making any necessary changes as the public reviews it and/or after IOCCCMOCK I should update the documentation. But some of that will also not happen until after those mentioned events.

@lcn2
Copy link
Contributor

lcn2 commented Nov 10, 2022

Anyway thought you'd like to see the ticked task. I guess you do agree with it?

Yes.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 10, 2022

Anyway thought you'd like to see the ticked task. I guess you do agree with it?

Yes.

Great.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 13, 2022

The other day you made a comment that made me wonder if I should actually consider making a separate repo of this code before the IOCCCMOCK is run. Did I misinterpret this? Either way what do you think? Here are my thoughts on that:

Pros for making a new repo now

  • We can make sure it can be self-contained.
  • We could invite a wider audience in using it (assuming anyone wants to :-) ) so that we might discover problems as well.
  • We can also have a chance to test the re-entrancy (though it should be fine it's obviously wayyyyy out of scope for this repo to test it and it's obviously not needed here).

There probably are other things as well.

Cons against doing it now

  • If we wait until after the mock contest we can make changes from things we learn which might make it better in the first place (this is a strong one but some of the pros above are also quite strong I would think?).
  • It would save me the time and effort - for now of course - to write more documentation and doing the actual isolation of files etc. :-) (this is actually a big one atm though I suspect we'll be working on this one together).
  • If we do do it now and more problems are discovered it might cause us to focus on those problems instead of the contest and it's highly unlikely any problems discovered will be relevant to the contest since we have extensive test suites for the tools including both the parser and chkentry! (This last one I thought of right as I was about to click 'Comment' but it's also a big reason to not do it now .. maybe ?)

There probably are other cons as well. :-) (or maybe that's :-( ? :-) )

Comments, thoughts and suggestions welcome!

@lcn2
Copy link
Contributor

lcn2 commented Nov 14, 2022

The other day you made a comment that made me wonder if I should actually consider making a separate repo of this code before the IOCCCMOCK is run. Did I misinterpret this? Either way what do you think? Here are my thoughts on that:

Pros for making a new repo now

  • We can make sure it can be self-contained.
  • We could invite a wider audience in using it (assuming anyone wants to :-) ) so that we might discover problems as well.
  • We can also have a chance to test the re-entrancy (though it should be fine it's obviously wayyyyy out of scope for this repo to test it and it's obviously not needed here).

There probably are other things as well.

Cons against doing it now

  • If we wait until after the mock contest we can make changes from things we learn which might make it better in the first place (this is a strong one but some of the pros above are also quite strong I would think?).
  • It would save me the time and effort - for now of course - to write more documentation and doing the actual isolation of files etc. :-) (this is actually a big one atm though I suspect we'll be working on this one together).
  • If we do do it now and more problems are discovered it might cause us to focus on those problems instead of the contest and it's highly unlikely any problems discovered will be relevant to the contest since we have extensive test suites for the tools including both the parser and chkentry! (This last one I thought of right as I was about to click 'Comment' but it's also a big reason to not do it now .. maybe ?)

There probably are other cons as well. :-) (or maybe that's :-( ? :-) )

Comments, thoughts and suggestions welcome!

There is value in the general public using the tool and providing feedback. While some of that feedback may be IOCCC specific, other feedback is likely to be JSON parser specific. What we learn from that will be valuable to the JSON parser code.

So we suggest that the Cons against doing it now outweighs the pros.

@xexyl
Copy link
Contributor Author

xexyl commented Nov 14, 2022

Okay but I am not sure about what you mean. It seems a bit ambiguous. It might be because I am tired or it might be something else.

Do you suggest it should be done now? If so we can discuss how to go about it. I am not at the laptop but I can say that I know the following has to be done:

  • Make :) a new Makefile for the repo.
  • Copy the run flex and run bison scripts.
  • Make the jparse.md file be a new README.md.
  • Copy jparse.l and .y plus the backup files to the repo.
  • Copy the man page.

But what other files should be copied over? I guess some will have to be modified too but in what way I do not know.

If you think it should be done and you get back to me on these I can start working on it hopefully starting tomorrow.

I will quickly look at the other threads and then I am done with this for the day. Will be getting ready to sleep in a bit. I hope you have a good night! This morning I was able to sleep in a little bit (just a little bit) which is good. I am hoping to reverse the process so that in time I wake up somewhere between 430 to 530 in the morning. I am okay waking up at that hour.

Anyway let me know what you think about the above and I will get back to you most likely within 24 hours from that point. Good night!

Edit 0

I am very sorry for the appalling typo for man page where it said msn. Typing on the phone is very often a big problem for me. Maybe I have fat fingers or maybe I just am terrible typing on the phone but either way it seems every time I make even a somewhat lengthy message I make at least a typo if not several!

@lcn2
Copy link
Contributor

lcn2 commented Nov 15, 2022

Okay but I am not sure about what you mean. It seems a bit ambiguous. It might be because I am tired or it might be something else.

Do you suggest it should be done now? If so we can discuss how to go about it. I am not at the laptop but I can say that I know the following has to be done:

  • Make :) a new Makefile for the repo.

  • Copy the run flex and run bison scripts.

  • Make the jparse.md file be a new README.md.

  • Copy jparse.l and .y plus the backup files to the repo.

  • Copy the man page.

But what other files should be copied over? I guess some will have to be modified too but in what way I do not know.

If you think it should be done and you get back to me on these I can start working on it hopefully starting tomorrow.

I will quickly look at the other threads and then I am done with this for the day. Will be getting ready to sleep in a bit. I hope you have a good night! This morning I was able to sleep in a little bit (just a little bit) which is good. I am hoping to reverse the process so that in time I wake up somewhere between 430 to 530 in the morning. I am okay waking up at that hour.

Anyway let me know what you think about the above and I will get back to you most likely within 24 hours from that point. Good night!

Edit 0

I am very sorry for the appalling typo for man page where it said msn. Typing on the phone is very often a big problem for me. Maybe I have fat fingers or maybe I just am terrible typing on the phone but either way it seems every time I make even a somewhat lengthy message I make at least a typo if not several!

We think you should wait .. and there is lots more todo in related IOCCC stuff

@xexyl
Copy link
Contributor Author

xexyl commented Nov 15, 2022

We think you should wait ..

That's better for me too.

... and there is lots more todo in related IOCCC stuff

I'm kind of at loss of what else to do besides the last two man pages. What else comes to mind? I'm holding off on going through everything to look for consistencies and typos etc. until we're ready for that review. But what else needs to be done?

I have a zoom meeting in under an hour. After that I'll be leaving for the day though I might be able to reply to comments a while later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants