Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add extra support to JSON string schema #927

Merged

Conversation

riedgar-ms
Copy link
Collaborator

@riedgar-ms riedgar-ms commented Jun 25, 2024

Looking to add support for pattern, minLength and maxLength to strings in JSON schema. This should address half of #925

This has lead to the creation of a few new library functions to help out, and consolidation into a new _sequences.py library file. It is also hooked into _regex.py.

@hudson-ai
Copy link
Collaborator

Thanks for getting this started! Just a few comments for you as you're working on this...

  1. I'm a little bit wary about exposing direct regex generation via pattern as it will be difficult to enforce that a user's regex is properly escaped and will lead to something that is JSON loads-able... Although maybe all we have to do is document that using this functionality is "at your own risk"
  2. In the same vein, it may be impossible to enforce both a pattern and min/max length...
  3. I think you can do away with the direct select call inside _gen_json_string and just use a regex (optionally adding {min,max})... Take a look at the regex I sent on the related issue -- may be helpful

@riedgar-ms
Copy link
Collaborator Author

Thanks for getting this started! Just a few comments for you as you're working on this...

  1. I'm a little bit wary about exposing direct regex generation via pattern as it will be difficult to enforce that a user's regex is properly escaped and will lead to something that is JSON loads-able... Although maybe all we have to do is document that using this functionality is "at your own risk"
  2. In the same vein, it may be impossible to enforce both a pattern and min/max length...
  3. I think you can do away with the direct select call inside _gen_json_string and just use a regex (optionally adding {min,max})... Take a look at the regex I sent on the related issue -- may be helpful

For (1), I'd absolutely say that it's "at your own risk" functionality.

For (2), I'm quite confident that we cannot simultaneously enforce both a pattern and a min/maxLength. Imagine something like "pattern": "aaaa", "maxLength": 2 which is impossible to satisfy.

Of more immediate concern is that my negative test cases are failing really weirdly :-/

@hudson-ai
Copy link
Collaborator

Ah, the {b'"'} != {b'"'} isn't as weird as it looks -- it's a repr problem. Annoyingly valid_next_bytes doesn't actually return a set of bytes; it returns a set of Bytes and ByteRanges (which look like bytes in their reprs).

@riedgar-ms
Copy link
Collaborator Author

Ah, the {b'"'} != {b'"'} isn't as weird as it looks -- it's a repr problem. Annoyingly valid_next_bytes doesn't actually return a set of bytes; it returns a set of Bytes and ByteRanges (which look like bytes in their reprs).

That was on my "to investigate" list; was suspecting something like that.

Of more immediate concern: if you look closely at the failing tests, I believe you'll find that they ought to be passing.

Comment on lines 26 to 27
from ._at_most_n_repeats import at_most_n_repeats
from ._exactly_n_repeats import exactly_n_repeats
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to use/reuse these to reduce redundant code in the regex implementation :)

Comment on lines 123 to 129
lm += exactly_n_repeats(value=select(STRING_CHARS), n_repeats=min_length)
lm += at_most_n_repeats(value=select(STRING_CHARS), n_repeats=(max_length - min_length))
else:
lm += select(
STRING_CHARS,
recurse=True,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality feels reusable! I.e. repeat(value, min, max) with sane behaviors when min and max are unset. This logic happens in the regex code too, so you should be able to consolidate

@codecov-commenter
Copy link

codecov-commenter commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.

Project coverage is 59.18%. Comparing base (ba86754) to head (bd30725).

Files Patch % Lines
guidance/library/_json.py 96.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #927      +/-   ##
==========================================
+ Coverage   57.18%   59.18%   +1.99%     
==========================================
  Files          64       63       -1     
  Lines        4711     4733      +22     
==========================================
+ Hits         2694     2801     +107     
+ Misses       2017     1932      -85     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pytest.mark.parametrize(
["bad_string", "good_bytes", "failure_byte", "allowed_bytes"],
[
('""', b'"', b'"', None),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped to be able to use Select(STRING_CHARS)._values for the 'allowed_bytes' in this 'too short' case, but that wasn't quite getting it right. So I'm allowing None to be passed, to turn off that check

@riedgar-ms riedgar-ms changed the title [WIP] [Feature] Add extra support to JSON string schema [Feature] Add extra support to JSON string schema Jun 26, 2024
model += at_most_n_repeats(value=value, n_repeats=(max_length - min_length))
elif min_length is not None:
model += exactly_n_repeats(value=value, n_repeats=min_length)
model += select([optional(value)], recurse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reuse zero_or_more here to reduce code duplication?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had actually been planning on reimplementing those two functions in terms of sequence(). Of course, something isn't quite right, and the errors I'm getting are very Pythonic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh fair enough. Sorry about your python 😂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one test involving the mocked model which is interfering with zero_or_more() being a synonym for sequence(). It's related to the max_tokens argument on gen, so not a core part of this PR, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you open an issue if it feels worthy? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It could also be something strange in the way the mock model is working. But it would be good to run down.

@hudson-ai
Copy link
Collaborator

Looking good! One ask -- could you maybe reuse the sequence code inside of _regex.py so we don't have two implementations of it?

Copy link
Collaborator

@hudson-ai hudson-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @riedgar-ms

@riedgar-ms riedgar-ms merged commit 137130f into guidance-ai:main Jun 27, 2024
100 checks passed
@riedgar-ms riedgar-ms deleted the riedgar-ms/json-string-length-01 branch June 27, 2024 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants