Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure JSON representation is compact. #3363

Closed
BERRADA-Omar opened this issue Oct 24, 2024 Discussed in #3204 · 5 comments
Closed

Ensure JSON representation is compact. #3363

BERRADA-Omar opened this issue Oct 24, 2024 Discussed in #3204 · 5 comments
Labels
perf Issues relating to performance

Comments

@BERRADA-Omar
Copy link
Contributor

Discussed in #3204

Originally posted by alleml May 17, 2024
httpx.Request blindly dumps provided json to ascii characters by invoking json.dumps without any parameters. Currently there is no way to provide ensure_ascii=False argument to the Request instance (or even better to the client).

It leads to the following scenario:

>>> httpx.Request(method="POST", url="http://localhost:8000/test", json={"field": "cześć"}).content
b'{"field": "cze\\u015b\\u0107"}'

when I need properly encoded unicode bytes: b"cze\xc5\x9b\xc4\x87"

@BERRADA-Omar
Copy link
Contributor Author

BERRADA-Omar commented Oct 24, 2024

This bug have a huge impact on the openai package (1.52.0).
Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding. The encoding should be UTF-8 !

For example :

  • input text in French : "salut ça va ? très bien ou pas bien ?"
  • GPT receives in payload : "salut \u00e7a va ? tr\u00e8s bien ou pas bien ?"

This could have huge impact on text understanding by LLMs, and also could generate more costs for intesive llm users.

Here's a comparison in the official openai tokenizer to prove the point :
UTF-8 correct encoding ==> 11 tokens for GPT-4o
image

ASCII encoding ==> 21 tokens !!!
image

Please have this issue fixed.

@BERRADA-Omar
Copy link
Contributor Author

For those who are looking for a workaround, you can override the method that causes the issue in the entry point of your app (main.py for example).

The patch would be :

import httpx._content
from httpx import ByteStream
from json import dumps as json_dumps

def custom_httpx_encode_json(json: Any) -> tuple[dict[str, str], ByteStream]:
    # disable ascii for json_dumps 
    body = json_dumps(json, ensure_ascii=False).encode("utf-8") # <------------------------------------------------
    content_length = str(len(body))
    content_type = "application/json"
    headers = {"Content-Length": content_length, "Content-Type": content_type}
    return headers, ByteStream(body)

# fix encoding utf-8 bug
httpx._content.encode_json = custom_httpx_encode_json

@tomchristie
Copy link
Member

Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding.

We're compliant with the JSON spec here, and parsers will be decoding text escape sequences into the correct unicode text. Tho you are correct that a more compact format would be sensible here. Would you like to submit a pull request resolving this?

The following related issues could all be resolved together...

  • We should use ensure_ascii=False for more compact text representations.
  • We should use separators = (',', ':') for more compact list and object representations.
  • We should use allow_nan=False to disallow invalid Infinity and NaN representations.

Useful review points are the UNICODE_JSON, COMPACT_JSON and STRICT_JSON settings in REST framework where we previously worked through these design issues.

@BERRADA-Omar
Copy link
Contributor Author

Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding.

We're compliant with the JSON spec here, and parsers will be decoding text escape sequences into the correct unicode text. Tho you are correct that a more compact format would be sensible here. Would you like to submit a pull request resolving this?

The following related issues could all be resolved together...

  • We should use ensure_ascii=False for more compact text representations.
  • We should use separators = (',', ':') for more compact list and object representations.
  • We should use allow_nan=False to disallow invalid Infinity and NaN representations.

Useful review points are the UNICODE_JSON, COMPACT_JSON and STRICT_JSON settings in REST framework where we previously worked through these design issues.

Thank you for your reply.

I will submit a pull request to resolve the issue.
Regards

@tomchristie tomchristie changed the title httpx forces converting json content to ascii Ensure JSON representation is compact. Oct 25, 2024
@tomchristie tomchristie added the perf Issues relating to performance label Oct 25, 2024
BERRADA-Omar added a commit to BERRADA-Omar/httpx that referenced this issue Oct 26, 2024
tomchristie added a commit that referenced this issue Oct 28, 2024
Co-authored-by: Tom Christie <tom@tomchristie.com>
@tomchristie
Copy link
Member

Thanks @BERRADA-Omar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Issues relating to performance
Projects
None yet
Development

No branches or pull requests

2 participants