Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enable user-passed separators keyword to guidance.json #1044

Merged
merged 75 commits into from
Oct 22, 2024

Conversation

hudson-ai
Copy link
Collaborator

Main changes:

  1. Refactor json generation code to live in a class with guidance-decorated methods (allows separators to be set at the instance level rather than having to pass them to every function call; note that definitions are now also stored on the instance)
  2. Add separators kwarg that controls the strings we will use for item and key separators (", " and ": " by default). This mirrors the json.dumps API.
  3. Add whitespace_flexible kwarg to enable old default behavior of "letting the LLM decide" on whitespace. This is no longer the default behavior, as it is substantially less efficient from a guidance-acceleration standpoint.
  4. Deprecate compact kwarg in favor of separators = (",", ":")
  5. Proper handling of const and enum types to ensure correct lexical boundaries (these were previously incorrectly handled from the point of view of whitespace flexibility).
  6. Change existing tests to reflect the new default behavior
  7. Add explicit whitespace tests for cases where separators or whitespace_flexible are passed

Comment on lines 952 to 957
if whitespace_flexible:
skip_regex = r"[\x20\x0A\x0D\x09]+"
# Strip whitespace from separators since we'll handle whitespace ourselves
separators = (separators[0].strip(), separators[1].strip())
else:
skip_regex = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use some input on this. Is it sensible to allow users to pass both separators and whitespace_flexible? If so, is it sensible to strip whitespace off of the separators?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit tricky.... the best guide would probably be what json.dumps() does when you set both separators and indent, but I know that's not quite the same (especially for commas). I don't think we should get too stuck on that point, though, so long as we produce valid JSON. If users have a format which absolutely must be followed, then they can always do json.dumps(json.loads(....), separators=...., indent=.....) themselves.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed... also, planning on allowing indent to be passed in the future (mirroring the json.dumps interface), but we need to add a bit of machinery to the low-level parser code to support that. I don't think it's actually imperative we give the whitespace_flexible option, especially once we enable indent. But I am not super confident on that, as we really don't know how whitespace-rigidity affects distribution shift...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For dumps, the behavior is that separators defaults to

  • (", ", ": ") if indent is None
  • (",", ": ") otherwise

indent makes no modifications to separators if they are passed by the user.

It may be sensible to do something similar with whitespace_flexible, defaulting to (",", ":") but otherwise "trusting" that the user means what they say when they pass their own. I think that non-None values of indent should raise exceptions when combined with whitespace_flexible though. What do you think @riedgar-ms?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a reasonable approach. I really wouldn't get too hung up on this, though:

  1. So long as we produce valid JSON, the users will have all manner of tools for reformatting it later
  2. Any extra constraints on the output need to have documentation to the effect that forcing a particular whitespace format (in all forms) risks pushing the model 'off distribution' and then refer them back to the previous point

@hudson-ai
Copy link
Collaborator Author

I'm open to allowing users to pass regular expressions for the seps, but note that adding "flexible whitespace" to those isn't going to give the same behavior as top-level flexible whitespace (which can, for example, add newlines and indents after opening braces).

@hudson-ai
Copy link
Collaborator Author

Not included in this PR, but planned future work: add indent kwarg (again mirroring the json.dumps API).

@codecov-commenter
Copy link

codecov-commenter commented Oct 8, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 96.56652% with 8 lines in your changes missing coverage. Please review.

Project coverage is 63.89%. Comparing base (917fe35) to head (ede4b67).

Files with missing lines Patch % Lines
guidance/library/_json.py 96.56% 8 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1044      +/-   ##
==========================================
- Coverage   71.96%   63.89%   -8.07%     
==========================================
  Files          63       63              
  Lines        4769     4795      +26     
==========================================
- Hits         3432     3064     -368     
- Misses       1337     1731     +394     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hudson-ai
Copy link
Collaborator Author

@Harsha-Nori @riedgar-ms ping for visibility

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hudson-ai hudson-ai merged commit 855ce5b into guidance-ai:main Oct 22, 2024
100 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants