Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change JSON serialization to custom json.dumps #31100

Merged

Conversation

junrae6454
Copy link
Contributor

What does this PR do?

  • This PR modifies the apply_chat_template function to improve handling of non-ASCII characters in JSON output by setting ensure_ascii to False. This change ensures that characters such as "안녕?" and emojis are correctly rendered in their original form rather than as escaped sequences.
  • In the original tojson implementation of Jinja2, the characters "<", ">", "&", and "'" are escaped to ensure safe HTML output. However, in apply_chat_template, we use json.dumps directly to preserve the original content as much as possible.
    jinja2 default tojson

relative issue

#31041
#31079

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@CISC
Copy link
Contributor

CISC commented May 29, 2024

Yes, this makes sense, didn't catch the escaping of HTML characters.

May I suggest also removing the sort_keys parameter so it goes back to the default of not sorting keys, this way the caller has control over the order, which can improve inference.

@junrae6454
Copy link
Contributor Author

@Rocketknight1
When you have a moment, could you please review my latest pull request? Your feedback would be greatly appreciated.

@Rocketknight1
Copy link
Member

Sorry for the delay! This makes sense to me, however there is a risk of affecting other Jinja implementations. @xenova how does huggingface/jinja deal with |tojson?

@CISC
Copy link
Contributor

CISC commented Jun 12, 2024

@Rocketknight1 AFAICT huggingface.js/jinja doesn't support tojson filter.

@xenova
Copy link
Contributor

xenova commented Jun 12, 2024

No, not yet, but it should be really easy to implement!

@Rocketknight1
Copy link
Member

Okay! In that case, pinging @amyeroberts for core maintainer review.

Brief summary: We often want to flatten lists/dicts into a string in a chat template, especially when it deals with tools. Jinja was originally designed for templating in webpages, and as a result it escapes special HTML characters by default, but this is not what we want in a chat template. This PR swaps the default tojson filter in Jinja for Python's json.dumps() function, which doesn't have this awkward feature.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM - thanks for updating!

@Rocketknight1
Copy link
Member

@junrae6454 Thanks for the PR! I made one small change - I moved the lambda function out into a proper function, and exposed a couple of the other arguments, as I feel like they might be useful later for rendering JSON.

Will merge once tests pass

@Rocketknight1 Rocketknight1 merged commit 17896f6 into huggingface:main Jun 12, 2024
21 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jun 14, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
itazap pushed a commit that referenced this pull request Jun 17, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
itazap pushed a commit that referenced this pull request Jun 17, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
itazap pushed a commit that referenced this pull request Jun 17, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
itazap pushed a commit that referenced this pull request Jun 18, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
itazap pushed a commit that referenced this pull request Jun 20, 2024
* Change JSON serialization to custom json.dumps to prevent escaping of "<", ">", "&", "'"

* caller has control over the order, remove sort_key=True

* Move tojson into a proper function and expose a couple of other args

---------

Co-authored-by: jun.4 <jun.4@kakaobrain.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants