-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow generating arbitrary (schemaless) JSON #892
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #892 +/- ##
==========================================
+ Coverage 56.80% 60.43% +3.63%
==========================================
Files 63 64 +1
Lines 4625 4658 +33
==========================================
+ Hits 2627 2815 +188
+ Misses 1998 1843 -155 ☔ View full report in Codecov by Sentry. |
@@ -297,19 +336,21 @@ def _gen_json( | |||
) | |||
raise ValueError(f"Unsupported type in schema: {target_type}") | |||
|
|||
raise ValueError(f"Can't process JSON node: {json_schema}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this still get raised somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The fact that empty schemas imply arbitrary json makes it such that we can't use the absence of keys to raise this exception. I expect that the schema validation we do also implies that we can't get conflicting or incompatible keys... So maybe all that's left is some assertion about extra/unhandled keys?
E.g. we should raise errors if users ask to match a pattern, provide an integer range, etc. -- any other situation you can think of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pitfall to be aware of here if you decide to implement more complex checking before falling back to any
generation is that there are a few human-readable-only JSON schema elements: https://json-schema.org/understanding-json-schema/reference/annotations
So e.g. {"description": "Any arbitrary JSON object"}
is considered an empty schema under the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some logic for validating keys of the json schema to ensure we only allow keys that are either on a whitelist of "implemented" or "ignored" (title, description, ...). An explicit blacklist of "things we don't support" feels like a far more fragile approach, hence me not taking it.
@riedgar-ms would you mind taking a look at the general approach to validation?
Note that tests are now failing because the validation raises an exception telling us that we don't support the required
key that we used in a lot of our tests for object. Fixing that added to my TODOs. It's weirdly non-trivial to get the commas right when separating mixtures of optional and non-optional values...
…ell with setting temperature)
This reverts commit 9bdd40d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
tests/library/test_json.py
Outdated
@@ -638,6 +639,7 @@ def test_bad_with_prefix( | |||
): | |||
schema_obj = { | |||
"prefixItems": self.prefix_schema_obj, | |||
"items": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? If you're only adding capabilities, surely the existing tests should remain the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially the old test and behavior were wrong. According to the json schema specification, "Omitting this keyword has the same assertion behavior as an empty schema.". Previously, we were treating an omitted "items" keyword as "no items not explicitly provided by prefixItems are allowed", but the behavior should instead be "put anything you want here".
I'll take a look to see if we can be a little more explicit about this behavior in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. So we need to make sure this is added wherever it's needed, and then in the new 'unconstrained' tests that it works both ways - when "items": True
explicitly, and when items
is omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some tests for these too :)
tests/library/test_json.py
Outdated
@@ -861,6 +863,39 @@ def test_nested_ref(self, temperature): | |||
# The actual check | |||
generate_and_check(target_obj, schema_obj, desired_temperature=temperature) | |||
|
|||
@pytest.mark.parametrize("temperature", [None, 0.1, 1]) | |||
def test_multiple_refs_to_same_def(self, temperature): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the extra test case
tests/library/test_json.py
Outdated
@@ -323,7 +323,8 @@ def test_bad_object(self, bad_string, good_bytes, failure_byte, allowed_bytes): | |||
"type": "object", | |||
"properties": { | |||
"a" : {"type": "integer"} | |||
} | |||
}, | |||
"additionalProperties": false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed to an existing test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the "items" discussion above. Not specifying additionalProperties is the same as specifying an empty schema, i.e. "anything goes". Old test behavior enforced that leaving this empty implied "no additionalProperties", which is wrong
tests/library/test_json.py
Outdated
|
||
class TestEmptySchemas: | ||
empty_schema = "{}" | ||
nested_empty_schema = """{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also do a case like
{
"properties" : {
"a": {},
"b": "number"
},
"type" : "object"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what if a #def
has an empty schema (or a nested one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas -- will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added both of these tests!
…n be punted to future PR
@riedgar-ms ready for a final review when you have time. |
"description", | ||
"default", | ||
"examples", | ||
"required", # TODO: implement and remove from ignored list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to add "required" to the ignored key list for now. Currently the grammar is just a little bit over-zealous and considers every property of an object to be required, so this isn't a terrible failure mode. "Fixing" this felt beyond the scope of this PR, but I added a new issue to track this: #906
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a solid addition, thanks!
Calling `guidance.json` with an empty schema generates arbitrary JSON. This closes guidance-ai#887 -- to quote @wjn0, there are several motivations for this: - APIs such as OpenAI allow users to request only valid JSON be generated sans schema, so in some sense this would give feature parity for local LLMs. - Large JSON schemas often include "arbitrary" properties, e.g. properties that are allowed to be any valid JSON value: https://json-schema.org/understanding-json-schema/basics#hello-world!
Calling
guidance.json
with an empty schema generates arbitrary JSON.This closes #887 -- to quote @wjn0, there are several motivations for this: