Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar Builder for JSON Schema #619

Merged

Conversation

riedgar-ms
Copy link
Collaborator

@riedgar-ms riedgar-ms commented Feb 8, 2024

A very basic guidance grammar builder for JSON schema. Handles the basic types, although the set of valid strings is more restricted than JSON (because making new things acceptable is preferable to rejecting things previously allowed).

There is no support for cross references; that will have to wait until a future PR. There is also no support for constraints. I do not expect constraints such as multipleOf or exclusiveMinimum to be supportable. It is possible that some of the constraints such as minItems, maxLength and pattern may be supportable.

@riedgar-ms riedgar-ms changed the title [WIP] Grammar Builder for JSON Schema Grammar Builder for JSON Schema Feb 8, 2024
@slundberg
Copy link
Collaborator

This is great @riedgar-ms ! Do you have any thoughts about how this might connect to Pedantic as well? For example what if people have a pydantic spec, should they export a schema from pydantic and then we build a grammar from that? I know there was some pydantic work #559 and I was wondering if this could be a layer for that. Thoughts?

@riedgar-ms
Copy link
Collaborator Author

This is great @riedgar-ms ! Do you have any thoughts about how this might connect to Pedantic as well? For example what if people have a pydantic spec, should they export a schema from pydantic and then we build a grammar from that? I know there was some pydantic work #559 and I was wondering if this could be a layer for that. Thoughts?

I was assuming that once this was in place, we could use:
https://docs.pydantic.dev/latest/concepts/json_schema/

@Harsha-Nori
Copy link
Collaborator

One thing -- I really liked that I could use Annotated type annotations to specify constraints, e.g. patterns that string fields have to match on.

I couldn't preserve that behavior without making the EarleyParser scream at me though, despite generative forward passes working perfectly (i.e. the EarleyParser gave exceptions for strings that were generated via the grammar). Not sure if that's a bug in the parser or if using it with the gen interface is not a supported use-case.

@hudson-ai, thanks for all the enthusiasm and progress here! Do you have a code snippet to reproduce the Parser issue on your fork? I'm happy to take a look.

Comment on lines +11 to +17
def to_compact_json(target: any) -> str:
# See 'Compact Encoding':
# https://docs.python.org/3/library/json.html
# Since this is ultimately about the generated
# output, we don't need to worry about pretty printing
# and whitespace
return json.dumps(target, separators=(",", ":"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to generate compact JSONs, which users can then pretty-print themselves.

@hudson-ai
Copy link
Collaborator

@hudson-ai, thanks for all the enthusiasm and progress here! Do you have a code snippet to reproduce the Parser issue on your fork? I'm happy to take a look.

My pleasure. This library is the best thing since sliced bread, and I really wanted the ability to generate structured data according to a schema -- pure coincidence that we made parallel implementations on the same day!

I changed the implementation of gen_str on my fork, which ameliorates this issue, but here's a minimal(ish) snippet to reproduce:

from guidance._grammar import Byte, GrammarFunction, Join, Select, select
from guidance.library._gen import gen
from guidance._parser import EarleyCommitParser, ParserException

_QUOTE = Byte(b'"')
_COMMA = Byte(b',')
_OPEN_BRACKET = Byte(b'[')
_CLOSE_BRACKET = Byte(b']')

def check_string_with_grammar(input_string: str, grammar: GrammarFunction):
    parser = EarleyCommitParser(grammar)

    print(f"Checking {input_string}")
    for c in input_string:
        print(f"Working on: {c}")
        print(f"Valid next bytes: {parser.valid_next_bytes()}")
        next_byte = bytes(c, encoding="utf8")
        print(f"Consuming: {next_byte}")
        parser.consume_byte(next_byte)

def gen_str():
    # using _SAFE_STR instead of gen here fixes the error
    return Join([_QUOTE, gen(stop='"'), _QUOTE])

def gen_list_of_str() -> GrammarFunction:
    s = Select([], capture_name=None, recursive=True)
    s.values = [gen_str(), Join([s,  _COMMA,  gen_str()])]
    return _OPEN_BRACKET + select([_CLOSE_BRACKET, Join([s, _CLOSE_BRACKET])])

def test_list_of_str():
    example = '["a","b","c","d"]'
    grammar = gen_list_of_str()
    check_string_with_grammar(example, grammar)

# This errs, but it shouldn't?
test_list_of_str()

@hudson-ai
Copy link
Collaborator

I hypothesize that the issue has to do with the stop kwarg in gen (although check_string_with_grammar('"cake"', gen_str()) is fine...)

@hudson-ai
Copy link
Collaborator

@Harsha-Nori I split the parser problem out into its own issue as it doesn't really belong in this PR. Hope I was clear enough over there.
#624

Comment on lines 20 to 29
def check_string_with_grammar(input_string: str, grammar: GrammarFunction):
parser = EarleyCommitParser(grammar)

print(f"Checking {input_string}")
for c in input_string:
print(f"Working on: {c}")
print(f"Valid next bytes: {parser.valid_next_bytes()}")
next_byte = bytes(c, encoding="utf8")
print(f"Consuming: {next_byte}")
parser.consume_byte(next_byte)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should shift this logic over to the GrammarFunction.match() method -- I'd slowly like to use that method as the main check for grammar compliance across the codebase

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 67 to 72
for c in simple_json_string:
print(f"Working on: {c}")
print(f"Valid next bytes: {parser.valid_next_bytes()}")
next_byte = bytes(c, encoding="utf8")
print(f"Consuming: {next_byte}")
parser.consume_byte(next_byte)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume this can leverage the check_string_with_grammar function defined earlier right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be changed to use GrammarFunction.match()

@Harsha-Nori
Copy link
Collaborator

@riedgar-ms I think this looks great so far! Would be nice to have a test that leverages an actual guidance.models object -- can be Mock, if we need to -- to verify that the end-to-end grammar constrained generation works against a full model object.

@riedgar-ms
Copy link
Collaborator Author

@riedgar-ms I think this looks great so far! Would be nice to have a test that leverages an actual guidance.models object -- can be Mock, if we need to -- to verify that the end-to-end grammar constrained generation works against a full model object.

Added a smoke test using the Mock model.

hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024
hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024
hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024
@Harsha-Nori
Copy link
Collaborator

Thanks @riedgar-ms for writing this up! I think there's a few more open questions here -- layering in commit_point() appropriately, use of Byte strings vs. regular strings, etc. -- but we can handle those later iteratively. Really nice to see some solid tests here, including use of the end to end guidance.models.Mock. I'm comfortable merging this in as a WIP for now, since it isn't user facing yet and we don't want to diverge too much from main, and we can keep working together on pushing support further. Great stuff!

@Harsha-Nori Harsha-Nori merged commit 3217a29 into guidance-ai:main Feb 15, 2024
5 checks passed
@Harsha-Nori
Copy link
Collaborator

Also, thanks @hudson-ai for all the comments and implementations you've done on your fork! Excited to see this continue to progress with your help too :).

@riedgar-ms riedgar-ms deleted the riedgar-ms/json-schema-to-grammar-01 branch March 15, 2024 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants