Allow generating arbitrary (schemaless) JSON #892

hudson-ai · 2024-06-06T18:28:47Z

Calling guidance.json with an empty schema generates arbitrary JSON.
This closes #887 -- to quote @wjn0, there are several motivations for this:

APIs such as OpenAI allow users to request only valid JSON be generated sans schema, so in some sense this would give feature parity for local LLMs.
Large JSON schemas often include "arbitrary" properties, e.g. properties that are allowed to be any valid JSON value: https://json-schema.org/understanding-json-schema/basics#hello-world!

codecov-commenter · 2024-06-06T18:36:58Z

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.43%. Comparing base (14862a1) to head (84ce68c).
Report is 2 commits behind head on main.

Files	Patch %	Lines
guidance/library/_json.py	93.33%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
+ Coverage   56.80%   60.43%   +3.63%     
==========================================
  Files          63       64       +1     
  Lines        4625     4658      +33     
==========================================
+ Hits         2627     2815     +188     
+ Misses       1998     1843     -155

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

riedgar-ms · 2024-06-06T19:06:36Z

guidance/library/_json.py

@@ -297,19 +336,21 @@ def _gen_json(
            )
        raise ValueError(f"Unsupported type in schema: {target_type}")

-    raise ValueError(f"Can't process JSON node: {json_schema}")


Shouldn't this still get raised somewhere?

Good question. The fact that empty schemas imply arbitrary json makes it such that we can't use the absence of keys to raise this exception. I expect that the schema validation we do also implies that we can't get conflicting or incompatible keys... So maybe all that's left is some assertion about extra/unhandled keys?

E.g. we should raise errors if users ask to match a pattern, provide an integer range, etc. -- any other situation you can think of?

One pitfall to be aware of here if you decide to implement more complex checking before falling back to any generation is that there are a few human-readable-only JSON schema elements: https://json-schema.org/understanding-json-schema/reference/annotations

So e.g. {"description": "Any arbitrary JSON object"} is considered an empty schema under the spec.

I added some logic for validating keys of the json schema to ensure we only allow keys that are either on a whitelist of "implemented" or "ignored" (title, description, ...). An explicit blacklist of "things we don't support" feels like a far more fragile approach, hence me not taking it.

@riedgar-ms would you mind taking a look at the general approach to validation?

Note that tests are now failing because the validation raises an exception telling us that we don't support the required key that we used in a lot of our tests for object. Fixing that added to my TODOs. It's weirdly non-trivial to get the commas right when separating mixtures of optional and non-optional values...

…ell with setting temperature)

…s needed

This reverts commit 9bdd40d.

riedgar-ms

Thank you for working on this!

riedgar-ms · 2024-06-12T16:47:03Z

tests/library/test_json.py

@@ -638,6 +639,7 @@ def test_bad_with_prefix(
    ):
        schema_obj = {
            "prefixItems": self.prefix_schema_obj,
+            "items": False,


Why is this needed? If you're only adding capabilities, surely the existing tests should remain the same?

Essentially the old test and behavior were wrong. According to the json schema specification, "Omitting this keyword has the same assertion behavior as an empty schema.". Previously, we were treating an omitted "items" keyword as "no items not explicitly provided by prefixItems are allowed", but the behavior should instead be "put anything you want here".

I'll take a look to see if we can be a little more explicit about this behavior in the tests.

Understood. So we need to make sure this is added wherever it's needed, and then in the new 'unconstrained' tests that it works both ways - when "items": True explicitly, and when items is omitted.

Added some tests for these too :)

riedgar-ms · 2024-06-12T16:48:01Z

tests/library/test_json.py

@@ -861,6 +863,39 @@ def test_nested_ref(self, temperature):
        # The actual check
        generate_and_check(target_obj, schema_obj, desired_temperature=temperature)

+    @pytest.mark.parametrize("temperature", [None, 0.1, 1])
+    def test_multiple_refs_to_same_def(self, temperature):


Like the extra test case

riedgar-ms · 2024-06-12T16:48:19Z

tests/library/test_json.py

@@ -323,7 +323,8 @@ def test_bad_object(self, bad_string, good_bytes, failure_byte, allowed_bytes):
            "type": "object",
            "properties": {
                "a" : {"type": "integer"}
-            }
+            },
+            "additionalProperties": false


Why is this change needed to an existing test?

Same as the "items" discussion above. Not specifying additionalProperties is the same as specifying an empty schema, i.e. "anything goes". Old test behavior enforced that leaving this empty implied "no additionalProperties", which is wrong

riedgar-ms · 2024-06-12T16:52:40Z

tests/library/test_json.py

+
+class TestEmptySchemas:
+    empty_schema = "{}"
+    nested_empty_schema = """{


Could you also do a case like

{ "properties" : { "a": {}, "b": "number" }, "type" : "object" }

Also, what if a #def has an empty schema (or a nested one)?

Good ideas -- will do

Added both of these tests!

…n be punted to future PR

hudson-ai · 2024-06-19T00:44:40Z

@riedgar-ms ready for a final review when you have time.

hudson-ai · 2024-06-19T00:46:43Z

guidance/library/_json.py

+    "description",
+    "default",
+    "examples",
+    "required", # TODO: implement and remove from ignored list


Decided to add "required" to the ignored key list for now. Currently the grammar is just a little bit over-zealous and considers every property of an object to be required, so this isn't a terrible failure mode. "Fixing" this felt beyond the scope of this PR, but I added a new issue to track this: #906

riedgar-ms

Looks like a solid addition, thanks!

@wjn0

Calling `guidance.json` with an empty schema generates arbitrary JSON. This closes guidance-ai#887 -- to quote @wjn0, there are several motivations for this: - APIs such as OpenAI allow users to request only valid JSON be generated sans schema, so in some sense this would give feature parity for local LLMs. - Large JSON schemas often include "arbitrary" properties, e.g. properties that are allowed to be any valid JSON value: https://json-schema.org/understanding-json-schema/basics#hello-world!

hudson-ai added 4 commits June 6, 2024 11:21

initial attempt at allowing 'any' json schemas

d24e09c

typo

accf649

Move constant strings to top-level

9bdd40d

Move ANY closure to _gen_json_any

97c7c2e

hudson-ai requested a review from riedgar-ms June 6, 2024 18:29

Make default (no-arg) json grammar generate arbitrary json

9d11b0c

riedgar-ms reviewed Jun 6, 2024

View reviewed changes

hudson-ai added 14 commits June 6, 2024 14:14

Encapsulate ANY definition dict inside closure; cache

e8a6cb3

Cache other closure

8a10303

Add test to ensure caching doesn't affect multiple refs to same def

cec5f50

Tests for generating json from empty schema

e5cf7f7

Top-level caching == bad idea (globally shared grammar doesn't play w…

efa50e2

…ell with setting temperature)

Allow items and properties to take on boolean values

c211566

Simplify implementation of _gen_json_any -- no definitions or closure…

d3b4b04

…s needed

remove all caching for now

4de5074

Revert "Move constant strings to top-level"

3b89ce3

This reverts commit 9bdd40d.

typing, documentation

84ce68c

Keyword enum

c22f614

Add basic validation of json schemas to ensure we handle all keys

a4fb809

3.10 compat

a74fcfa

rename validation method

7628736

riedgar-ms reviewed Jun 12, 2024

View reviewed changes

hudson-ai added 6 commits June 18, 2024 12:29

Merge branch 'main' into json_any

1629021

temporarily add required to ignored list of keys so implementation ca…

8ca6f6d

…n be punted to future PR

test case where object has one empty schema and one nonempty schema

4b0e6c3

Test cases for items/additionalProperties empty, {}, True, False

ca6907c

consolidate empty items and additionalProperties tests

92181d7

Add test for empty definition

4dc18c1

hudson-ai changed the title ~~[WIP] Allow generating arbitrary (schemaless) JSON~~ Allow generating arbitrary (schemaless) JSON Jun 19, 2024

hudson-ai commented Jun 19, 2024

View reviewed changes

riedgar-ms approved these changes Jun 19, 2024

View reviewed changes

Merge branch 'main' into json_any

a51b64d

hudson-ai merged commit e2c5b3d into guidance-ai:main Jun 19, 2024
99 of 103 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow generating arbitrary (schemaless) JSON #892

Allow generating arbitrary (schemaless) JSON #892

hudson-ai commented Jun 6, 2024

codecov-commenter commented Jun 6, 2024 •

edited

Loading

riedgar-ms Jun 6, 2024

hudson-ai Jun 6, 2024

wjn0 Jun 7, 2024

hudson-ai Jun 12, 2024

riedgar-ms left a comment

riedgar-ms Jun 12, 2024

hudson-ai Jun 12, 2024

riedgar-ms Jun 12, 2024

hudson-ai Jun 19, 2024

riedgar-ms Jun 12, 2024

riedgar-ms Jun 12, 2024

hudson-ai Jun 12, 2024

riedgar-ms Jun 12, 2024

riedgar-ms Jun 12, 2024

hudson-ai Jun 12, 2024

hudson-ai Jun 19, 2024

hudson-ai commented Jun 19, 2024

hudson-ai Jun 19, 2024

riedgar-ms left a comment

Allow generating arbitrary (schemaless) JSON #892

Allow generating arbitrary (schemaless) JSON #892

Conversation

hudson-ai commented Jun 6, 2024

codecov-commenter commented Jun 6, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riedgar-ms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudson-ai commented Jun 19, 2024

Choose a reason for hiding this comment

riedgar-ms left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 6, 2024 •

edited

Loading