Ensure JSON representation is compact. #3363

BERRADA-Omar · 2024-10-24T19:08:04Z

Discussed in #3204

^{Originally posted by alleml May 17, 2024}
httpx.Request blindly dumps provided json to ascii characters by invoking json.dumps without any parameters. Currently there is no way to provide ensure_ascii=False argument to the Request instance (or even better to the client).

It leads to the following scenario:

>>> httpx.Request(method="POST", url="http://localhost:8000/test", json={"field": "cześć"}).content
b'{"field": "cze\\u015b\\u0107"}'

when I need properly encoded unicode bytes: b"cze\xc5\x9b\xc4\x87"

The text was updated successfully, but these errors were encountered:

BERRADA-Omar · 2024-10-24T19:16:25Z

This bug have a huge impact on the openai package (1.52.0).
Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding. The encoding should be UTF-8 !

For example :

input text in French : "salut ça va ? très bien ou pas bien ?"
GPT receives in payload : "salut \u00e7a va ? tr\u00e8s bien ou pas bien ?"

This could have huge impact on text understanding by LLMs, and also could generate more costs for intesive llm users.

Here's a comparison in the official openai tokenizer to prove the point :
UTF-8 correct encoding ==> 11 tokens for GPT-4o

ASCII encoding ==> 21 tokens !!!

Please have this issue fixed.

BERRADA-Omar · 2024-10-24T19:21:04Z

For those who are looking for a workaround, you can override the method that causes the issue in the entry point of your app (main.py for example).

The patch would be :

import httpx._content
from httpx import ByteStream
from json import dumps as json_dumps

def custom_httpx_encode_json(json: Any) -> tuple[dict[str, str], ByteStream]:
    # disable ascii for json_dumps 
    body = json_dumps(json, ensure_ascii=False).encode("utf-8") # <------------------------------------------------
    content_length = str(len(body))
    content_type = "application/json"
    headers = {"Content-Length": content_length, "Content-Type": content_type}
    return headers, ByteStream(body)

# fix encoding utf-8 bug
httpx._content.encode_json = custom_httpx_encode_json

tomchristie · 2024-10-25T10:03:19Z

Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding.

We're compliant with the JSON spec here, and parsers will be decoding text escape sequences into the correct unicode text. Tho you are correct that a more compact format would be sensible here. Would you like to submit a pull request resolving this?

The following related issues could all be resolved together...

We should use ensure_ascii=False for more compact text representations.
We should use separators = (',', ':') for more compact list and object representations.
We should use allow_nan=False to disallow invalid Infinity and NaN representations.

Useful review points are the UNICODE_JSON, COMPACT_JSON and STRICT_JSON settings in REST framework where we previously worked through these design issues.

BERRADA-Omar · 2024-10-25T10:27:32Z

Openai package uses httpx to make it requests, and it turns out that when we make call in French, the GPT models receives text in the wrong encoding.

We're compliant with the JSON spec here, and parsers will be decoding text escape sequences into the correct unicode text. Tho you are correct that a more compact format would be sensible here. Would you like to submit a pull request resolving this?

The following related issues could all be resolved together...

We should use ensure_ascii=False for more compact text representations.

We should use separators = (',', ':') for more compact list and object representations.

We should use allow_nan=False to disallow invalid Infinity and NaN representations.

Useful review points are the UNICODE_JSON, COMPACT_JSON and STRICT_JSON settings in REST framework where we previously worked through these design issues.

Thank you for your reply.

I will submit a pull request to resolve the issue.
Regards

Co-authored-by: Tom Christie <tom@tomchristie.com>

tomchristie · 2024-10-28T15:06:52Z

Thanks @BERRADA-Omar

tomchristie changed the title ~~httpx forces converting json content to ascii~~ Ensure JSON representation is compact. Oct 25, 2024

tomchristie added the perf Issues relating to performance label Oct 25, 2024

BERRADA-Omar added a commit to BERRADA-Omar/httpx that referenced this issue Oct 26, 2024

Ensure JSON representation is compact. encode#3363

fa44a61

BERRADA-Omar mentioned this issue Oct 26, 2024

Ensure JSON representation is compact. #3363 #3367

Merged

3 tasks

tomchristie added a commit that referenced this issue Oct 28, 2024

Ensure JSON representation is compact. #3363 (#3367)

9fd6f0c

Co-authored-by: Tom Christie <tom@tomchristie.com>

tomchristie mentioned this issue Oct 28, 2024

Update CHANGELOG.md #3372

Merged

tomchristie closed this as completed Oct 28, 2024

tomchristie mentioned this issue Nov 7, 2024

Graceful upgrade path for 0.28. #3394

Merged

tomchristie mentioned this issue Nov 15, 2024

Version 0.28.0 #3404

Closed

tomchristie mentioned this issue Nov 28, 2024

Version 0.28.0. #3419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure JSON representation is compact. #3363

Ensure JSON representation is compact. #3363

BERRADA-Omar commented Oct 24, 2024

BERRADA-Omar commented Oct 24, 2024 •

edited

Loading

BERRADA-Omar commented Oct 24, 2024

tomchristie commented Oct 25, 2024

BERRADA-Omar commented Oct 25, 2024

tomchristie commented Oct 28, 2024

Ensure JSON representation is compact. #3363

Ensure JSON representation is compact. #3363

Comments

BERRADA-Omar commented Oct 24, 2024

Discussed in #3204

BERRADA-Omar commented Oct 24, 2024 • edited Loading

BERRADA-Omar commented Oct 24, 2024

tomchristie commented Oct 25, 2024

BERRADA-Omar commented Oct 25, 2024

tomchristie commented Oct 28, 2024

BERRADA-Omar commented Oct 24, 2024 •

edited

Loading