Grammar Builder for JSON Schema #619

riedgar-ms · 2024-02-08T16:44:07Z

A very basic guidance grammar builder for JSON schema. Handles the basic types, although the set of valid strings is more restricted than JSON (because making new things acceptable is preferable to rejecting things previously allowed).

There is no support for cross references; that will have to wait until a future PR. There is also no support for constraints. I do not expect constraints such as multipleOf or exclusiveMinimum to be supportable. It is possible that some of the constraints such as minItems, maxLength and pattern may be supportable.

…ema-to-grammar-01

slundberg · 2024-02-08T21:33:42Z

This is great @riedgar-ms ! Do you have any thoughts about how this might connect to Pedantic as well? For example what if people have a pydantic spec, should they export a schema from pydantic and then we build a grammar from that? I know there was some pydantic work #559 and I was wondering if this could be a layer for that. Thoughts?

riedgar-ms · 2024-02-08T21:42:32Z

This is great @riedgar-ms ! Do you have any thoughts about how this might connect to Pedantic as well? For example what if people have a pydantic spec, should they export a schema from pydantic and then we build a grammar from that? I know there was some pydantic work #559 and I was wondering if this could be a layer for that. Thoughts?

I was assuming that once this was in place, we could use:
https://docs.pydantic.dev/latest/concepts/json_schema/

Harsha-Nori · 2024-02-12T18:16:50Z

One thing -- I really liked that I could use Annotated type annotations to specify constraints, e.g. patterns that string fields have to match on.

I couldn't preserve that behavior without making the EarleyParser scream at me though, despite generative forward passes working perfectly (i.e. the EarleyParser gave exceptions for strings that were generated via the grammar). Not sure if that's a bug in the parser or if using it with the gen interface is not a supported use-case.

@hudson-ai, thanks for all the enthusiasm and progress here! Do you have a code snippet to reproduce the Parser issue on your fork? I'm happy to take a look.

Harsha-Nori · 2024-02-12T18:40:15Z

tests/test_json_schema_to_grammar.py

+def to_compact_json(target: any) -> str:
+    # See 'Compact Encoding':
+    # https://docs.python.org/3/library/json.html
+    # Since this is ultimately about the generated
+    # output, we don't need to worry about pretty printing
+    # and whitespace
+    return json.dumps(target, separators=(",", ":"))


Seems reasonable to generate compact JSONs, which users can then pretty-print themselves.

hudson-ai · 2024-02-12T18:56:29Z

@hudson-ai, thanks for all the enthusiasm and progress here! Do you have a code snippet to reproduce the Parser issue on your fork? I'm happy to take a look.

My pleasure. This library is the best thing since sliced bread, and I really wanted the ability to generate structured data according to a schema -- pure coincidence that we made parallel implementations on the same day!

I changed the implementation of gen_str on my fork, which ameliorates this issue, but here's a minimal(ish) snippet to reproduce:

from guidance._grammar import Byte, GrammarFunction, Join, Select, select
from guidance.library._gen import gen
from guidance._parser import EarleyCommitParser, ParserException

_QUOTE = Byte(b'"')
_COMMA = Byte(b',')
_OPEN_BRACKET = Byte(b'[')
_CLOSE_BRACKET = Byte(b']')

def check_string_with_grammar(input_string: str, grammar: GrammarFunction):
    parser = EarleyCommitParser(grammar)

    print(f"Checking {input_string}")
    for c in input_string:
        print(f"Working on: {c}")
        print(f"Valid next bytes: {parser.valid_next_bytes()}")
        next_byte = bytes(c, encoding="utf8")
        print(f"Consuming: {next_byte}")
        parser.consume_byte(next_byte)

def gen_str():
    # using _SAFE_STR instead of gen here fixes the error
    return Join([_QUOTE, gen(stop='"'), _QUOTE])

def gen_list_of_str() -> GrammarFunction:
    s = Select([], capture_name=None, recursive=True)
    s.values = [gen_str(), Join([s,  _COMMA,  gen_str()])]
    return _OPEN_BRACKET + select([_CLOSE_BRACKET, Join([s, _CLOSE_BRACKET])])

def test_list_of_str():
    example = '["a","b","c","d"]'
    grammar = gen_list_of_str()
    check_string_with_grammar(example, grammar)

# This errs, but it shouldn't?
test_list_of_str()

hudson-ai · 2024-02-12T19:00:25Z

I hypothesize that the issue has to do with the stop kwarg in gen (although check_string_with_grammar('"cake"', gen_str()) is fine...)

…ema-to-grammar-01

hudson-ai · 2024-02-12T19:49:50Z

@Harsha-Nori I split the parser problem out into its own issue as it doesn't really belong in this PR. Hope I was clear enough over there.
#624

Harsha-Nori · 2024-02-12T21:34:59Z

tests/test_json_schema_to_grammar.py

+def check_string_with_grammar(input_string: str, grammar: GrammarFunction):
+    parser = EarleyCommitParser(grammar)
+
+    print(f"Checking {input_string}")
+    for c in input_string:
+        print(f"Working on: {c}")
+        print(f"Valid next bytes: {parser.valid_next_bytes()}")
+        next_byte = bytes(c, encoding="utf8")
+        print(f"Consuming: {next_byte}")
+        parser.consume_byte(next_byte)


I think we should shift this logic over to the GrammarFunction.match() method -- I'd slowly like to use that method as the main check for grammar compliance across the codebase

Harsha-Nori · 2024-02-12T21:35:43Z

tests/test_json_schema_to_grammar.py

+    for c in simple_json_string:
+        print(f"Working on: {c}")
+        print(f"Valid next bytes: {parser.valid_next_bytes()}")
+        next_byte = bytes(c, encoding="utf8")
+        print(f"Consuming: {next_byte}")
+        parser.consume_byte(next_byte)


Assume this can leverage the check_string_with_grammar function defined earlier right?

This should all be changed to use GrammarFunction.match()

Harsha-Nori · 2024-02-12T21:37:52Z

@riedgar-ms I think this looks great so far! Would be nice to have a test that leverages an actual guidance.models object -- can be Mock, if we need to -- to verify that the end-to-end grammar constrained generation works against a full model object.

riedgar-ms · 2024-02-13T18:40:17Z

@riedgar-ms I think this looks great so far! Would be nice to have a test that leverages an actual guidance.models object -- can be Mock, if we need to -- to verify that the end-to-end grammar constrained generation works against a full model object.

Added a smoke test using the Mock model.

Harsha-Nori · 2024-02-15T18:43:45Z

Thanks @riedgar-ms for writing this up! I think there's a few more open questions here -- layering in commit_point() appropriately, use of Byte strings vs. regular strings, etc. -- but we can handle those later iteratively. Really nice to see some solid tests here, including use of the end to end guidance.models.Mock. I'm comfortable merging this in as a WIP for now, since it isn't user facing yet and we don't want to diverge too much from main, and we can keep working together on pushing support further. Great stuff!

Harsha-Nori · 2024-02-15T18:47:12Z

Also, thanks @hudson-ai for all the comments and implementations you've done on your fork! Excited to see this continue to progress with your help too :).

riedgar-ms added 24 commits February 8, 2024 09:49

Starting work on a JSON schema convertor

63d3e8d

Correcting test a little

63d2174

Extending things a little

66b1c49

Some simple integers

dffb57b

Get objects (vaguely) working

f16adbc

Need jsonschema now for testing

95230cc

Merge remote-tracking branch 'upstream/main' into riedgar-ms/json-sch…

6438024

…ema-to-grammar-01

Python 3.8.....

af8304f

We can generate compact JSON

0a51608

Add a comment

ecfda45

More work

735cfcd

linting

881012f

Starting to think on arrays

2935bad

Simple lists

be80b32

A bit more list work

b534ba1

Add note

cdc7bb0

Prepping for booleans

a15dec9

Add booleans

dae9e8b

Prep for floating point

1511e70

Getting numbers working

06ab111

Inf/NaN note

0473f55

Some commenting

fb9adcb

Hacking for 'null' in JSON

5beef67

Extra test of object containing a list

af73d24

riedgar-ms changed the title ~~[WIP] Grammar Builder for JSON Schema~~ Grammar Builder for JSON Schema Feb 8, 2024

riedgar-ms added 3 commits February 8, 2024 17:05

Add ParserException

61d50c6

Fixing the lists

819dd68

linting

aca0564

riedgar-ms added 6 commits February 12, 2024 10:07

A little refactoring

4d81da8

Refactor a little

baaeff8

Tweak the number

d4a02ed

Forgot full stop

3d58a6c

Broadening the safe string

3c9e013

linting

b536ddc

Undo black

e40d9ac

Harsha-Nori reviewed Feb 12, 2024

View reviewed changes

riedgar-ms added 3 commits February 12, 2024 14:20

Add type hints to match()

5fa3637

Missed change

f187cce

Merge remote-tracking branch 'upstream/main' into riedgar-ms/json-sch…

b399ccf

…ema-to-grammar-01

hudson-ai mentioned this pull request Feb 12, 2024

EarleyParser accepts stop kwarg of gen as a valid byte #624

Open

Harsha-Nori reviewed Feb 12, 2024

View reviewed changes

riedgar-ms added 2 commits February 12, 2024 16:52

Switch tests to use GrammarFunction.match()

fdf86fc

Add a test using the Mock model

c3858f8

riedgar-ms requested a review from Harsha-Nori February 13, 2024 18:40

hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024

Start migrating Richard Edgar's tests from guidance-aiGH-619

25b9240

hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024

Migrate more tests from guidance-aiGH-619

642dfbd

hudson-ai added a commit to hudson-ai/guidance that referenced this pull request Feb 14, 2024

Use int and number implementations from guidance-aiGH-619

757f753

Harsha-Nori merged commit 3217a29 into guidance-ai:main Feb 15, 2024
5 checks passed

riedgar-ms deleted the riedgar-ms/json-schema-to-grammar-01 branch March 15, 2024 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar Builder for JSON Schema #619

Grammar Builder for JSON Schema #619

riedgar-ms commented Feb 8, 2024 •

edited

Loading

slundberg commented Feb 8, 2024

riedgar-ms commented Feb 8, 2024

Harsha-Nori commented Feb 12, 2024

Harsha-Nori Feb 12, 2024

hudson-ai commented Feb 12, 2024

hudson-ai commented Feb 12, 2024

hudson-ai commented Feb 12, 2024

Harsha-Nori Feb 12, 2024

riedgar-ms Feb 12, 2024

Harsha-Nori Feb 12, 2024

riedgar-ms Feb 12, 2024

Harsha-Nori commented Feb 12, 2024

riedgar-ms commented Feb 13, 2024

Harsha-Nori commented Feb 15, 2024

Harsha-Nori commented Feb 15, 2024

Grammar Builder for JSON Schema #619

Grammar Builder for JSON Schema #619

Conversation

riedgar-ms commented Feb 8, 2024 • edited Loading

slundberg commented Feb 8, 2024

riedgar-ms commented Feb 8, 2024

Harsha-Nori commented Feb 12, 2024

Harsha-Nori Feb 12, 2024

Choose a reason for hiding this comment

hudson-ai commented Feb 12, 2024

hudson-ai commented Feb 12, 2024

hudson-ai commented Feb 12, 2024

Harsha-Nori Feb 12, 2024

Choose a reason for hiding this comment

riedgar-ms Feb 12, 2024

Choose a reason for hiding this comment

Harsha-Nori Feb 12, 2024

Choose a reason for hiding this comment

riedgar-ms Feb 12, 2024

Choose a reason for hiding this comment

Harsha-Nori commented Feb 12, 2024

riedgar-ms commented Feb 13, 2024

Harsha-Nori commented Feb 15, 2024

Harsha-Nori commented Feb 15, 2024

riedgar-ms commented Feb 8, 2024 •

edited

Loading