Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow generating arbitrary (schemaless) JSON #892

Merged
merged 26 commits into from
Jun 19, 2024

Conversation

hudson-ai
Copy link
Collaborator

Calling guidance.json with an empty schema generates arbitrary JSON.
This closes #887 -- to quote @wjn0, there are several motivations for this:

@hudson-ai hudson-ai requested a review from riedgar-ms June 6, 2024 18:29
@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.43%. Comparing base (14862a1) to head (84ce68c).
Report is 2 commits behind head on main.

Files Patch % Lines
guidance/library/_json.py 93.33% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #892      +/-   ##
==========================================
+ Coverage   56.80%   60.43%   +3.63%     
==========================================
  Files          63       64       +1     
  Lines        4625     4658      +33     
==========================================
+ Hits         2627     2815     +188     
+ Misses       1998     1843     -155     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -297,19 +336,21 @@ def _gen_json(
)
raise ValueError(f"Unsupported type in schema: {target_type}")

raise ValueError(f"Can't process JSON node: {json_schema}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still get raised somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The fact that empty schemas imply arbitrary json makes it such that we can't use the absence of keys to raise this exception. I expect that the schema validation we do also implies that we can't get conflicting or incompatible keys... So maybe all that's left is some assertion about extra/unhandled keys?

E.g. we should raise errors if users ask to match a pattern, provide an integer range, etc. -- any other situation you can think of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pitfall to be aware of here if you decide to implement more complex checking before falling back to any generation is that there are a few human-readable-only JSON schema elements: https://json-schema.org/understanding-json-schema/reference/annotations

So e.g. {"description": "Any arbitrary JSON object"} is considered an empty schema under the spec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some logic for validating keys of the json schema to ensure we only allow keys that are either on a whitelist of "implemented" or "ignored" (title, description, ...). An explicit blacklist of "things we don't support" feels like a far more fragile approach, hence me not taking it.

@riedgar-ms would you mind taking a look at the general approach to validation?

Note that tests are now failing because the validation raises an exception telling us that we don't support the required key that we used in a lot of our tests for object. Fixing that added to my TODOs. It's weirdly non-trivial to get the commas right when separating mixtures of optional and non-optional values...

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@@ -638,6 +639,7 @@ def test_bad_with_prefix(
):
schema_obj = {
"prefixItems": self.prefix_schema_obj,
"items": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? If you're only adding capabilities, surely the existing tests should remain the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially the old test and behavior were wrong. According to the json schema specification, "Omitting this keyword has the same assertion behavior as an empty schema.". Previously, we were treating an omitted "items" keyword as "no items not explicitly provided by prefixItems are allowed", but the behavior should instead be "put anything you want here".

I'll take a look to see if we can be a little more explicit about this behavior in the tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. So we need to make sure this is added wherever it's needed, and then in the new 'unconstrained' tests that it works both ways - when "items": True explicitly, and when items is omitted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests for these too :)

@@ -861,6 +863,39 @@ def test_nested_ref(self, temperature):
# The actual check
generate_and_check(target_obj, schema_obj, desired_temperature=temperature)

@pytest.mark.parametrize("temperature", [None, 0.1, 1])
def test_multiple_refs_to_same_def(self, temperature):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the extra test case

@@ -323,7 +323,8 @@ def test_bad_object(self, bad_string, good_bytes, failure_byte, allowed_bytes):
"type": "object",
"properties": {
"a" : {"type": "integer"}
}
},
"additionalProperties": false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed to an existing test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the "items" discussion above. Not specifying additionalProperties is the same as specifying an empty schema, i.e. "anything goes". Old test behavior enforced that leaving this empty implied "no additionalProperties", which is wrong


class TestEmptySchemas:
empty_schema = "{}"
nested_empty_schema = """{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also do a case like

{
  "properties" : {
    "a": {},
    "b": "number"
  },
  "type" : "object"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if a #def has an empty schema (or a nested one)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good ideas -- will do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both of these tests!

@hudson-ai hudson-ai changed the title [WIP] Allow generating arbitrary (schemaless) JSON Allow generating arbitrary (schemaless) JSON Jun 19, 2024
@hudson-ai
Copy link
Collaborator Author

@riedgar-ms ready for a final review when you have time.

"description",
"default",
"examples",
"required", # TODO: implement and remove from ignored list
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to add "required" to the ignored key list for now. Currently the grammar is just a little bit over-zealous and considers every property of an object to be required, so this isn't a terrible failure mode. "Fixing" this felt beyond the scope of this PR, but I added a new issue to track this: #906

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a solid addition, thanks!

@hudson-ai hudson-ai merged commit e2c5b3d into guidance-ai:main Jun 19, 2024
99 of 103 checks passed
Dobiichi-Origami pushed a commit to Dobiichi-Origami/guidance that referenced this pull request Jun 24, 2024
Calling `guidance.json` with an empty schema generates arbitrary JSON.
This closes guidance-ai#887 -- to quote @wjn0, there are several motivations for
this:
- APIs such as OpenAI allow users to request only valid JSON be
generated sans schema, so in some sense this would give feature parity
for local LLMs.
- Large JSON schemas often include "arbitrary" properties, e.g.
properties that are allowed to be any valid JSON value:
https://json-schema.org/understanding-json-schema/basics#hello-world!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow generating arbitrary (schemaless) JSON
4 participants