-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enable user-passed separators
keyword to guidance.json
#1044
Conversation
…ethods (no unbound methods)
guidance/library/_json.py
Outdated
if whitespace_flexible: | ||
skip_regex = r"[\x20\x0A\x0D\x09]+" | ||
# Strip whitespace from separators since we'll handle whitespace ourselves | ||
separators = (separators[0].strip(), separators[1].strip()) | ||
else: | ||
skip_regex = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use some input on this. Is it sensible to allow users to pass both separators
and whitespace_flexible
? If so, is it sensible to strip whitespace off of the separators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit tricky.... the best guide would probably be what json.dumps()
does when you set both separators
and indent
, but I know that's not quite the same (especially for commas). I don't think we should get too stuck on that point, though, so long as we produce valid JSON. If users have a format which absolutely must be followed, then they can always do json.dumps(json.loads(....), separators=...., indent=.....)
themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed... also, planning on allowing indent
to be passed in the future (mirroring the json.dumps interface), but we need to add a bit of machinery to the low-level parser code to support that. I don't think it's actually imperative we give the whitespace_flexible option, especially once we enable indent. But I am not super confident on that, as we really don't know how whitespace-rigidity affects distribution shift...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dumps
, the behavior is that separators
defaults to
(", ", ": ")
ifindent
isNone
(",", ": ")
otherwise
indent
makes no modifications to separators
if they are passed by the user.
It may be sensible to do something similar with whitespace_flexible
, defaulting to (",", ":")
but otherwise "trusting" that the user means what they say when they pass their own. I think that non-None
values of indent
should raise exceptions when combined with whitespace_flexible
though. What do you think @riedgar-ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a reasonable approach. I really wouldn't get too hung up on this, though:
- So long as we produce valid JSON, the users will have all manner of tools for reformatting it later
- Any extra constraints on the output need to have documentation to the effect that forcing a particular whitespace format (in all forms) risks pushing the model 'off distribution' and then refer them back to the previous point
I'm open to allowing users to pass regular expressions for the seps, but note that adding "flexible whitespace" to those isn't going to give the same behavior as top-level flexible whitespace (which can, for example, add newlines and indents after opening braces). |
Not included in this PR, but planned future work: add |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1044 +/- ##
==========================================
- Coverage 71.96% 63.89% -8.07%
==========================================
Files 63 63
Lines 4769 4795 +26
==========================================
- Hits 3432 3064 -368
- Misses 1337 1731 +394 ☔ View full report in Codecov by Sentry. |
@Harsha-Nori @riedgar-ms ping for visibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Main changes:
definitions
are now also stored on the instance)separators
kwarg that controls the strings we will use for item and key separators (", "
and": "
by default). This mirrors thejson.dumps
API.whitespace_flexible
kwarg to enable old default behavior of "letting the LLM decide" on whitespace. This is no longer the default behavior, as it is substantially less efficient from a guidance-acceleration standpoint.compact
kwarg in favor ofseparators = (",", ":")
const
andenum
types to ensure correct lexical boundaries (these were previously incorrectly handled from the point of view of whitespace flexibility).separators
orwhitespace_flexible
are passed