Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comply with RFC 8259 in strict mode #194

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

dominickpastore
Copy link

@dominickpastore dominickpastore commented May 25, 2020

This PR brings strict mode in compliance with RFC 8259 (as well as some other changes):

At the same time, this led to some significant changes to non-strict mode (consistent with comments I previously made on issue #154). In summary, non-strict mode is now limited to the following three deviations from RFC 8259: 1) Primitives can be made of any character that doesn't have special meaning in JSON (whitespace and any of {}[],:"), 2) Strings can contain any character and invalid backslash escapes, and 3) Primitives can be object keys, not just strings. Reasoning for these changes is below. (But see also comment later.)

Unenclosed objects, missing values, and missing commas

Non-strict mode previously allowed the following examples to parse:

"server": "127.0.0.1"
"port": 80
"message": "Hello world"
{"a":1 "b":2}
{
  "a":1,
  "b"
}

I believe this was intentional; however, this relaxed validation carried over into strict mode in several places (see issues #52, #131, #158). I added a simple state machine to the parser to fix the issues in strict mode, but fixing strict mode without affecting non-strict mode would have been difficult. As a result, non-strict mode no longer allows the above examples. I can think of ways we could make it work, but the added complexity didn't seem worth it.

Arrays and objects as keys

Non-strict mode no longer allows arrays and objects as object keys. This eliminates issue #193, but breaks backward compatibility for some non-RFC-8259-compliant JSON. Here is the reasoning:

  • When we add a value to a key, we add one to the key's size and the key becomes the value's parent. This does a couple nice things:
    • Keys can be identified by checking for non-arrays and non-objects with size 1
    • The total number of tokens in any JSON object (or array) is the recursive sum of that object's size and its children's sizes, plus one for the object itself (very convenient for skipping array items or members of an object)
  • When arrays and objects are keys, not only are those promises broken, but implementation is non-trivial, since size is already used for their own contents. This is the cause of issue Parsing issues with objects and arrays as keys in non-strict mode #193. I could think of three potential fixes:
    • Don't increment size for a key when it's an array or object. This fixes Parsing issues with objects and arrays as keys in non-strict mode #193 but still breaks the promises above, and adds noticeable complexity to the code. I don't like this solution.
    • Differentiate between container-child relationship and key-value relationship. Add a way to identify keys without using size: either a new key field in jsmntok_t or modify jsmntype_t:
      typedef enum {
        JSMN_UNDEFINED = 0x0,
        JSMN_OBJECT = 0x1,
        JSMN_ARRAY = 0x2,
        JSMN_STRING = 0x4,
        JSMN_PRIMITIVE = 0x8,
        JSMN_OBJECT_KEY = 0x11,
        JSMN_ARRAY_KEY = 0x12,
        JSMN_STRING_KEY = 0x14,
        JSMN_PRIMITIVE_KEY = 0x18,
      } jsmntype_t;
      
      and then add a keytok field alongside supertok in jmsn_parser. This could be enabled only when JSMN_STRICT is not defined. I don't hate this idea, but it is extra work for what is ultimately non-RFC-8259-compliant JSON. It still doesn't seem with the effort.
    • Don't allow objects and arrays to be keys, sidestepping the issue entirely. This minimizes the amount of extra code for non-strict mode. This is the solution I chose. If really missed, we could add array/object keys back as a separate change.

Conclusion

These are some pretty significant changes to jsmn, but seeing the number of issues about invalid input being accepted, I hope they are well received. I have some ideas for future improvements also, but I will discuss that on the Roadmap issue (#159).

@dominickpastore
Copy link
Author

dominickpastore commented May 25, 2020

Pull request is in draft mode for now as I need to make a couple changes to fix parsing multiple JSON objects at once and improve validation for primitives, but I wanted to go ahead and get it out there in case there is any potential discussion.

Ensure primitives are "true", "false", "null", or an RFC 8259 compliant
number. (Still need to add test cases.)
String parsing previously did not differ between strict and non-strict
modes, but was not fully compliant with RFC 8259. RFC 8259 requires that
control characters (code points < 0x20) be escaped. This is now enforced
in strict mode. In addition, non-strict mode now does *no* validations
on string contents, much like primitives in non-strict mode.
@dominickpastore
Copy link
Author

The remaining changes are now complete, and I've added test cases for the new functionality. Marking as "Ready to review."

@dominickpastore dominickpastore marked this pull request as ready for review May 27, 2020 02:52
@dominickpastore dominickpastore mentioned this pull request May 27, 2020
@themobiusproject
Copy link

This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.

My jsmn_t ends up being:

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value
#endif
} jsmntype_t;

and the jsmn_parser ends up as

typedef struct jsmn_parser {
  jsmnint_t pos;            //!< offset in the JSON string
  jsmnint_t toknext;        //!< next token to allocate
  jsmnint_t toksuper;       //!< superior token node, e.g. parent object or array
#ifdef JSMN_STRICT
  jsmnenumtype_t expected;  //!< Expected jsmn type(s)
#endif
} jsmn_parser;

With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:

if (token->type & JSMN_OBJECT) {
  parser->expected = (JSMN_STRING | JSMN_CLOSE);
} else {
  parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE);
}

and then check parser->expected against what the next item actually is

if ((parser->expected & JSMN_PRIMITIVE) == 0) {
  // error handling
}
// continue with parsing

This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.

@themobiusproject
Copy link

I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value

  JSMN_OBJECT_KEY    = (JSMN_KEY | JSMN_OBJECT),
  JSMN_ARRAY_KEY     = (JSMN_KEY | JSMN_ARRAY),
  JSMN_STRING_KEY    = (JSMN_KEY | JSMN_STRING),
  JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE),
#endif
} jsmntype_t;

@dominickpastore
Copy link
Author

This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.

My jsmn_t ends up being:

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value
#endif
} jsmntype_t;

and the jsmn_parser ends up as

typedef struct jsmn_parser {
  jsmnint_t pos;            //!< offset in the JSON string
  jsmnint_t toknext;        //!< next token to allocate
  jsmnint_t toksuper;       //!< superior token node, e.g. parent object or array
#ifdef JSMN_STRICT
  jsmnenumtype_t expected;  //!< Expected jsmn type(s)
#endif
} jsmn_parser;

With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:

if (token->type & JSMN_OBJECT) {
  parser->expected = (JSMN_STRING | JSMN_CLOSE);
} else {
  parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE);
}

and then check parser->expected against what the next item actually is

if ((parser->expected & JSMN_PRIMITIVE) == 0) {
  // error handling
}
// continue with parsing

This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.

That is an interesting approach. Do you plan to share the implementation in a pull request?

@dominickpastore
Copy link
Author

I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.

typedef enum {
  JSMN_UNDEFINED = 0x00,
  JSMN_OBJECT    = 0x01,    //!< Object
  JSMN_ARRAY     = 0x02,    //!< Array
  JSMN_STRING    = 0x04,    //!< String
  JSMN_PRIMITIVE = 0x08,    //!< Other primitive: number, boolean (true/false) or null

#ifdef JSMN_STRICT
  JSMN_ANY_TYPE  = 0x0F,    //!< (OBJECT | ARRAY | STRING | PRIMITIVE)

  JSMN_CLOSE     = 0x10,    //!< Close OBJECT '}' or ARRAY ']'
  JSMN_DELIMITER = 0x20,    //!< Colon ':' after KEY, Comma ',' after VALUE

  JSMN_KEY       = 0x40,    //!< is a key
  JSMN_VALUE     = 0x80,    //!< is a value

  JSMN_OBJECT_KEY    = (JSMN_KEY | JSMN_OBJECT),
  JSMN_ARRAY_KEY     = (JSMN_KEY | JSMN_ARRAY),
  JSMN_STRING_KEY    = (JSMN_KEY | JSMN_STRING),
  JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE),
#endif
} jsmntype_t;

It seems you have already noticed, but for the benefit of other readers, I use a new jsmnstate_t enum in the same way as your extended jsmntype_t:

typedef enum {
  JSMN_STATE_ROOT = 0x0,          /* At root */
  JSMN_STATE_OBJ_NEW = 0x16,      /* Expecting object key or } */
  JSMN_STATE_OBJ_KEY = 0x12,      /* Expecting object key */
  JSMN_STATE_OBJ_COLON = 0x13,    /* Expecting object colon */
  JSMN_STATE_OBJ_VAL = 0x10,      /* Expecting object value */
  JSMN_STATE_OBJ_COMMA = 0x15,    /* Expecting object comma or } */
  JSMN_STATE_ARRAY_NEW = 0x24,    /* Expecting array item or ] */
  JSMN_STATE_ARRAY_ITEM = 0x20,   /* Expecting array item */
  JSMN_STATE_ARRAY_COMMA = 0x25   /* Expecting array comma or ] */
} jsmnstate_t;

I probably could have used explicit constants for individual bits to make the code more readable and maintainable (in fact, I probably should add it still):

typedef enum {
  JSMN_DELIMITER = 0x1,         /* Expecting colon if JSMN_KEY, comma if not */
  JSMN_KEY = 0x2,               /* Expecting an object key */
  JSMN_CAN_CLOSE = 0x4,         /* Object or array can be closed */
  
  JSMN_CONTAINER_OBJ = 0x10,    /* Inside an object */
  JSMN_CONTAINER_ARRAY = 0x20,  /* Inside an array */
  
  JSMN_STATE_ROOT = 0x0,		/* At root */
  JSMN_STATE_OBJ_NEW = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_CAN_CLOSE),
  JSMN_STATE_OBJ_KEY = (JSMN_CONTAINER_OBJ | JSMN_KEY),
  JSMN_STATE_OBJ_COLON = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_DELIMITER),
  JSMN_STATE_OBJ_VAL = (JSMN_CONTAINER_OBJ),
  JSMN_STATE_OBJ_COMMA = (JSMN_CONTAINER_OBJ | JSMN_DELIMITER | JSMN_CAN_CLOSE),
  JSMN_STATE_ARRAY_NEW = (JSMN_CONTAINER_ARRAY | JSMN_CAN_CLOSE),
  JSMN_STATE_ARRAY_ITEM = (JSMN_CONTAINER_ARRAY),
  JSMN_STATE_ARRAY_COMMA = (JSMN_CONTAINER_ARRAY | JSMN_DELIMITER | JSMN_CAN_CLOSE)
} jsmnstate_t;

The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.

@dominickpastore
Copy link
Author

One other thing I forgot to put in the description: calling jsmn_parse() with tokens set to NULL skips a lot of the validation that would otherwise happen, for the sake of performance. (The assumption being that such a call would almost always be followed by a call where tokens has been allocated, so no need to do all the validations twice.)

I will add an extra bullet point to the description for this.

@dominickpastore
Copy link
Author

@themobiusproject I made some improvements based on your suggestion to use bitwise logic for the enum values. I think it helps readability a lot.

@dominickpastore
Copy link
Author

dominickpastore commented Jun 1, 2020

I've added a branch with a fourth non-strict extension in my own fork: valueless-keys. It adds the ability to parse objects with missing values, e.g.:

{
  "a": 1,
  "b",
  "c": 3
}

I already have a bunch of pull requests pending here, so I'm not going to submit another one right now, but I thought I should at least mention it since this was originally allowed under non-strict mode.

@themobiusproject
Copy link

Let me finish up cleaning up a few bits so all of the previous tests pass again and I will at least put in a link to my fork. I have a lot of changes beyond this (cmake, visual studio, xcode projects, doxygen comments, tests in cmocka, extra json.{c,h} for wrapping jsmn) that don't necessarily go with just this issue and would have to pull the correct pieces for a clean pull request.

@themobiusproject
Copy link

@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.

master, right now, only has two tests left as FIXME and they are non-strict tests which I haven't been looking into. I cleared out some of the original strict type checks as the parser->expected versions covered them.

I also like your idea of setting a flag to avoid strict checks a second time if it has already been done once to find the number of tokens.

The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.

Also, regardless of how long the enum is, it won't touch change compile size or performance, so having each and every option does seem like a good idea; I'll just ignore looking at it in the code.

@dominickpastore
Copy link
Author

@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.

Just want to clarify: if you weren't already planning to create pull requests, don't feel like you have to just because I brought it up. I was just curious.

@themobiusproject
Copy link

I was looking at it previously because I like what I have come up with and would be happy to share it with others, especially through the official repo. Since this pushed me to finish the bugs that I had, I was going to get back to it.

case '\r':
case '\n':
case ' ':
case ',':
case ':':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the rfc, I don't believe that a primitive can be followed by a ':', '{', or '[' and can only be followed by '\0' when it is the lone element in the json string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch's purpose is simply to ensure there aren't extra characters in the primitive, like 1.234true. As long as the character after is not allowed in a primitive, the check is satisfied. Proper validation of the next character happens elsewhere (in the main parsing loop).

This is important because jsmn will parse multiple "JSON texts" (in RFC 8259 terminology) in one string, if there are more than one present. For example, this is perfectly valid input for jsmn and will simply parse to two separate arrays:

["a", "b", "c"] [1, 2, 3]

Since this pull request adds the ability for strings and primitives to be complete JSON texts on their own, this check ensures that 1.234true will be rejected rather than parsed as two separate primitives. (Strictly speaking, you could say they should be parsed as two primitives, since consecutive objects and arrays need not be separated by whitespace to be parsed by jsmn. But, consider a case like 1.234.5: this would become ambiguous under those rules. It could be 1.2 34.5 or 1.23 4.5. For that matter, even 1.234true could be 1.234 true, 1.23 4 true, 1.2 34 true, etc. Requiring at least one non-primitive character as separation resolves this ambiguity.)

Parsing multiple JSON texts is also the reason we might see a { or [ following a primitive. : cannot follow a primitive in strict RFC 8259 compliance, but it can in non-strict mode. In another pull request (#195), I add the ability to select specific non-strict features, so it is possible to have strict primitive validation but allow primitives as object keys.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing multiple JSON texts is also the reason we might see a { or [ following a primitive. : cannot follow a primitive in strict RFC 8259 compliance, but it can in non-strict mode. In another pull request (#195), I add the ability to select specific non-strict features, so it is possible to have strict primitive validation but allow primitives as object keys.

For the cases of :, {, and [, your non-strict mode starts down on line 367 and the code up here is all for strict RFC 8259 compliance. It is true that \0 should be caught later if the end of the js string is deeper in an object or array, but it is an easy and inexpensive check right here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all valid input strings in strict mode:

  • true["a", "b", "c"] (Parses as a primitive followed by an array. [ follows a primitive.)
  • true{"a": 1} (Parses as a primitive followed by an object. { follows a primitive.)
  • true (Parses as an unenclosed primitive. \0 follows a primitive.)

Then, regarding the colon:

: can never follow a primitive in strict mode, but I put it there anticipating enhancements like PR #195, which is built on top of this one. I encourage anyone reading to take a look at that PR for the full details, but to summarize the relevant part: It takes each use of #if[n]def JSMN_STRICT and replaces it with #if[n]def JSMN_PERMISSIVE_PRIMITIVES, JSMN_PERMISSIVE_STRINGS, or JSMN_PRIMITIVE_KEYS, allowing specific non-strict features to be selected. Notably, this means if JSMN_PRIMITIVE_KEYS is defined but JSMN_PERMISSIVE_PRIMITIVES is not, that : in the switch statement becomes important.

@noodlecollie
Copy link

I have a quick question regarding the improvements to strict mode: do these changes also fix the following?

{ 
  "array":
  [ 
    "missing a comma"
    "between these items"
  ]
}

Even with strict mode enabled, this results in the same tokens as if a comma had been present between the two items. jsonlint.com rejects the JSON above as not being valid. If it's fixed in this PR then great; otherwise, I'll make a separate issue.

@dominickpastore
Copy link
Author

@x6herbius, this PR should fix that. That said, I'm not sure if this repository has been abandoned. The maintainers seem to have disappeared. Thus, while I had some other ideas for improvements and even submitted PRs for two of them (#195, #196), I've given up on making further contributions.

I do have a couple more changes I made on my own fork (see valueless-keys and unsigned-types branches there), but I don't plan to submit PRs for them unless these existing PRs get accepted.

PR #197 may also be worth checking out. They've made similar sorts of improvements (including proper RFC compliance, I believe) through a different style of implementation.

@pt300
Copy link
Collaborator

pt300 commented Oct 19, 2020

@dominickpastore I'm still here. I haven't done much as it's been difficult for me due to how big the changes feel. While JSMN is not in the best shape at the moment, it is being dependent on by quite some people and I honestly fear breaking anything just because I did not analyse the changes properly.

To give this repo a bit of life I will try to make a new experimental branch and possibly merge some of the PRs there. Maybe that way it will be possible to get better feedback from community about the changes. And I hope you won't feel like your contributions are a waste.

@dominickpastore
Copy link
Author

dominickpastore commented Oct 19, 2020

@pt300 That's good to hear. I was a bit concerned when @themobiusproject and I added some discussion to some of the roadmap issues (#159 and #154) and it seemed to go unnoticed.

I don't want to assume my PRs are the best for the project, either (especially now that there is also PR #197 for a lot of the same stuff). I made the changes because they were what I needed. But they did align with some parts of the roadmap, so I figured I'd submit them to help kickstart the discussion again.

Other thing I would say: This is just my opinion, but it sounds like you are putting too much pressure on yourself. I'd bet, chances are, as long as JSMN can parse standard JSON, 90% of people will be happy. For the rest, it's up to them to speak up if they have a problem with the roadmap. And even if they don't speak up, nothing is forcing them to stop using the current version--that's the beauty of open source.

@themobiusproject
Copy link

@dominickpastore @pt300 I am still hanging around here as well.

Both @dominickpastore's PR #194, #195, and #196 and my PR #197 are all valid options. I had once tried to break my PR into smaller changes but continued having chicken and egg problems so all of the changes are lumped into one. I also will comment, @pt300, dominick's and my changes won't merge together as we have both made many fundamental changes that won't jive together (though I know I have swiped some ideas from dominick's code to fix issues in mine).

I certainly don't feel like my contributions have been a waste. There was someone who posted in my PR with a problem that he was having so it seems like someone has been able to benefit what I have done.

@alexanderankin
Copy link

it may be useful to note which branch is the most up to date and recommended for new projects, on whatever repository/fork, while issues of backwards compatibility and breaking get resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment