Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: JSON Schema-to-GBNF additionalProperties bugs (and other minor quirks) #7789

Closed
HanClinto opened this issue Jun 6, 2024 · 2 comments
Closed
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale

Comments

@HanClinto
Copy link
Collaborator

HanClinto commented Jun 6, 2024

What happened?

While debugging json-schema-to-gbnf grammars, I noticed a few bugs / quirks and wanted to write them down somewhere.

additionalProperties seems to default to false (not matching spec).

By default, additional properties should be permitted. However, providing a schema like:

{
  "type": "object",
  "properties": {
    "number": { "type": "number" },
    "street_name": { "type": "string" },
    "street_type": { "enum": ["Street", "Avenue", "Boulevard"] }
  }
}

Then it correctly passes on these strings:

{ "number": 1600, "street_name": "Pennsylvania", "street_type":"Avenue"}
{ "street_name": "Pennsylvania" }
{ "number": 1600, "street_name": "Pennsylvania" }
{}

But then it improperly fails on the string:

{ "number": 1600, "street_name": "Pennsylvania", "street_type":"Avenue", "direction":"NW"}

This is clearly given in the json-schema docs as an example of a string that should match this schema, so we're doing something wrong.

Explicit "additionalProperties"=true behavior is even worse.

If we change the above grammar to:

{
  "type": "object",
  "properties": {
    "number": { "type": "number" },
    "street_name": { "type": "string" },
    "street_type": { "enum": ["Street", "Avenue", "Boulevard"] }
  },
  "additionalProperties": true
}

Then things really start to go awry. These strings should all pass (indeed, they passed before when we didn't explicitly set anything for additionalProperties, but instead are failing now:

{"number":1600,"street_name":"Pennsylvania","street_type":"Avenue"}
{ "street_name": "Pennsylvania" }
{ "number": 1600, "street_name": "Pennsylvania" }

And our sample with an additional property still doesn't match:

{ "number": 1600, "street_name": "Pennsylvania", "street_type":"Avenue", "direction":"NW"}

The only string that matches out of the original is the empty object ({}).

Looking at the generated GBNF, there is some weird stuff going on. Here is the GBNF with additionalProperties set implicitly:

char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
decimal-part ::= [0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
integral-part ::= [0-9] | [1-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
number ::= ("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space
number-kv ::= "\"number\"" space ":" space number
number-rest ::= ( "," space street-name-kv )? street-name-rest
root ::= "{" space  (number-kv number-rest | street-name-kv street-name-rest | street-type-kv )? "}" space
space ::= " "?
street-name-kv ::= "\"street_name\"" space ":" space string
street-name-rest ::= ( "," space street-type-kv )?
street-type ::= "\"Street\"" | "\"Avenue\"" | "\"Boulevard\""
street-type-kv ::= "\"street_type\"" space ":" space street-type
string ::= "\"" char* "\"" space

And here is the GBNF from additionalProperties set explicitly to true:

additional-kv ::= string ":" space additional-value
additional-kvs ::= additional-kv ( "," space additional-kv )*
additional-value ::= object
array ::= "[" space ( value ("," space value)* )? "]" space
boolean ::= ("true" | "false") space
char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
decimal-part ::= [0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
integral-part ::= [0-9] | [1-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9] ([0-9])?)?)?)?)?)?)?)?)?)?)?)?)?)?)?
null ::= "null" space
number ::= ("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space
number-kv ::= "\"number\"" space ":" space number
number-rest ::= ( "," space street-name-kv )? street-name-rest
object ::= "{" space ( string ":" space value ("," space string ":" space value)* )? "}" space
root ::= "{" space  (number-kv number-rest | street-name-kv street-name-rest | street-type-kv street-type-rest | additional-kvs )? "}" space
space ::= " "?
street-name-kv ::= "\"street_name\"" space ":" space string
street-name-rest ::= ( "," space street-type-kv )? street-type-rest
street-type ::= "\"Street\"" | "\"Avenue\"" | "\"Boulevard\""
street-type-kv ::= "\"street_type\"" space ":" space street-type
street-type-rest ::= additional-kvs
string ::= "\"" char* "\"" space
value ::= object | array | string | number | boolean | null

The key differences to note here are how street-type-rest is now being defined (even though it was never defined in the original), and additional-kvs seems to be getting appended to each property without a comma in between (nor an optional flag).

I haven't yet wrapped my brain around what all is going on with that, but I wanted to lay out how far I'd gotten on my own.

Unlike strings, enums don't support spaces between properties and values.

This is definitely in the "quirk" more than "bug" category, but when using a schema like:

{
  "type": "object",
  "properties": {
    "number": { "type": "number" },
    "street_name": { "type": "string" },
    "street_type": { "enum": ["Street", "Avenue", "Boulevard"] }
  }
}

Then validating against it means that:

{ "number": 1600, "street_type":"Avenue"}

is a valid string, but adding spaces around the enum value causes either of the following to fail:

{ "number": 1600, "street_type": "Avenue"}
{ "number": 1600, "street_type":"Avenue" }

Interestingly, adding spaces around a string value works fine, and these match the generated grammar just fine:

{ "number": 1600, "street_name": "Pennsylvania" }

Unsupported Attributes

We should probably build a list of unsupported attributes and note them in the documentation -- some that I've noticed thus far:

  • exclusiveMinimum (probably can't be handled for anything except for special cases of 0, requiring either the presence of a - or not)
  • uniqueItems -- not sure how we could support this without a regex engine that supports capture groups and lookbehinds and whatnot.

Name and Version

version: 3093 (7672ade)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

@HanClinto HanClinto added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jun 6, 2024
HanClinto added a commit to HanClinto/llama.cpp that referenced this issue Jun 6, 2024
…ing the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
@ochafik
Copy link
Collaborator

ochafik commented Jun 9, 2024

@HanClinto thanks for reporting these!

We should probably build a list of unsupported attributes and note them in the documentation

Agree! Been wanting to add a JSON schemas section within grammars/README.md, I'll crack at it and will probably also update the stale grammar examples.

HanClinto added a commit to HanClinto/llama.cpp that referenced this issue Jun 12, 2024
…ing the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
HanClinto added a commit to HanClinto/llama.cpp that referenced this issue Jun 12, 2024
…ing the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
HanClinto added a commit to HanClinto/llama.cpp that referenced this issue Jun 22, 2024
…ing the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.
HanClinto added a commit that referenced this issue Jun 22, 2024
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in #7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in #7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
arthw pushed a commit to arthw/llama.cpp that referenced this issue Jun 30, 2024
* Adding simple bare-bones test for end-to-end integration test for json validation against auto-generated JSON-schema grammars.

* Adding additional examples as documented in ggerganov#7789 . Also adding the ability to automatically output improperly failing grammars to debug output files so they can more easily be examined in the gbnf-validator program.

* Uncommenting formerly commented tests so that they fail for others who are attempting to reproduce the bugs.

* Merging improved schema test methods added by @ochafik in ggerganov#7797

* Adding #define to temporarily remove failing tests so that this PR can pass CI, but still be useful for other PRs that want to leverage the framework.

* Fixing nits from ochafik. Removing escape slashes, adding additional failing cases, fixing some other strings.

* Fixing grammar indentation to be consistent throughout file.
@github-actions github-actions bot added the stale label Jul 10, 2024
@HanClinto
Copy link
Collaborator Author

As of #7840 and #7797, I believe this is resolved. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) stale
Projects
None yet
Development

No branches or pull requests

2 participants