-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grammar Builder for JSON Schema #619
Grammar Builder for JSON Schema #619
Conversation
…ema-to-grammar-01
This is great @riedgar-ms ! Do you have any thoughts about how this might connect to Pedantic as well? For example what if people have a pydantic spec, should they export a schema from pydantic and then we build a grammar from that? I know there was some pydantic work #559 and I was wondering if this could be a layer for that. Thoughts? |
I was assuming that once this was in place, we could use: |
@hudson-ai, thanks for all the enthusiasm and progress here! Do you have a code snippet to reproduce the Parser issue on your fork? I'm happy to take a look. |
def to_compact_json(target: any) -> str: | ||
# See 'Compact Encoding': | ||
# https://docs.python.org/3/library/json.html | ||
# Since this is ultimately about the generated | ||
# output, we don't need to worry about pretty printing | ||
# and whitespace | ||
return json.dumps(target, separators=(",", ":")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to generate compact JSONs, which users can then pretty-print themselves.
My pleasure. This library is the best thing since sliced bread, and I really wanted the ability to generate structured data according to a schema -- pure coincidence that we made parallel implementations on the same day! I changed the implementation of from guidance._grammar import Byte, GrammarFunction, Join, Select, select
from guidance.library._gen import gen
from guidance._parser import EarleyCommitParser, ParserException
_QUOTE = Byte(b'"')
_COMMA = Byte(b',')
_OPEN_BRACKET = Byte(b'[')
_CLOSE_BRACKET = Byte(b']')
def check_string_with_grammar(input_string: str, grammar: GrammarFunction):
parser = EarleyCommitParser(grammar)
print(f"Checking {input_string}")
for c in input_string:
print(f"Working on: {c}")
print(f"Valid next bytes: {parser.valid_next_bytes()}")
next_byte = bytes(c, encoding="utf8")
print(f"Consuming: {next_byte}")
parser.consume_byte(next_byte)
def gen_str():
# using _SAFE_STR instead of gen here fixes the error
return Join([_QUOTE, gen(stop='"'), _QUOTE])
def gen_list_of_str() -> GrammarFunction:
s = Select([], capture_name=None, recursive=True)
s.values = [gen_str(), Join([s, _COMMA, gen_str()])]
return _OPEN_BRACKET + select([_CLOSE_BRACKET, Join([s, _CLOSE_BRACKET])])
def test_list_of_str():
example = '["a","b","c","d"]'
grammar = gen_list_of_str()
check_string_with_grammar(example, grammar)
# This errs, but it shouldn't?
test_list_of_str() |
I hypothesize that the issue has to do with the |
@Harsha-Nori I split the parser problem out into its own issue as it doesn't really belong in this PR. Hope I was clear enough over there. |
tests/test_json_schema_to_grammar.py
Outdated
def check_string_with_grammar(input_string: str, grammar: GrammarFunction): | ||
parser = EarleyCommitParser(grammar) | ||
|
||
print(f"Checking {input_string}") | ||
for c in input_string: | ||
print(f"Working on: {c}") | ||
print(f"Valid next bytes: {parser.valid_next_bytes()}") | ||
next_byte = bytes(c, encoding="utf8") | ||
print(f"Consuming: {next_byte}") | ||
parser.consume_byte(next_byte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should shift this logic over to the GrammarFunction.match()
method -- I'd slowly like to use that method as the main check for grammar compliance across the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/test_json_schema_to_grammar.py
Outdated
for c in simple_json_string: | ||
print(f"Working on: {c}") | ||
print(f"Valid next bytes: {parser.valid_next_bytes()}") | ||
next_byte = bytes(c, encoding="utf8") | ||
print(f"Consuming: {next_byte}") | ||
parser.consume_byte(next_byte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume this can leverage the check_string_with_grammar
function defined earlier right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should all be changed to use GrammarFunction.match()
@riedgar-ms I think this looks great so far! Would be nice to have a test that leverages an actual |
Added a smoke test using the |
Thanks @riedgar-ms for writing this up! I think there's a few more open questions here -- layering in |
Also, thanks @hudson-ai for all the comments and implementations you've done on your fork! Excited to see this continue to progress with your help too :). |
A very basic
guidance
grammar builder for JSON schema. Handles the basic types, although the set of valid strings is more restricted than JSON (because making new things acceptable is preferable to rejecting things previously allowed).There is no support for cross references; that will have to wait until a future PR. There is also no support for constraints. I do not expect constraints such as
multipleOf
orexclusiveMinimum
to be supportable. It is possible that some of the constraints such asminItems
,maxLength
andpattern
may be supportable.