-
-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure JSON representation is compact. #3363
Comments
This bug have a huge impact on the openai package (1.52.0). For example :
This could have huge impact on text understanding by LLMs, and also could generate more costs for intesive llm users. Here's a comparison in the official openai tokenizer to prove the point : ASCII encoding ==> 21 tokens !!! Please have this issue fixed. |
For those who are looking for a workaround, you can override the method that causes the issue in the entry point of your app (main.py for example). The patch would be : import httpx._content
from httpx import ByteStream
from json import dumps as json_dumps
def custom_httpx_encode_json(json: Any) -> tuple[dict[str, str], ByteStream]:
# disable ascii for json_dumps
body = json_dumps(json, ensure_ascii=False).encode("utf-8") # <------------------------------------------------
content_length = str(len(body))
content_type = "application/json"
headers = {"Content-Length": content_length, "Content-Type": content_type}
return headers, ByteStream(body)
# fix encoding utf-8 bug
httpx._content.encode_json = custom_httpx_encode_json |
We're compliant with the JSON spec here, and parsers will be decoding text escape sequences into the correct unicode text. Tho you are correct that a more compact format would be sensible here. Would you like to submit a pull request resolving this? The following related issues could all be resolved together...
Useful review points are the |
Thank you for your reply. I will submit a pull request to resolve the issue. |
Thanks @BERRADA-Omar |
Discussed in #3204
Originally posted by alleml May 17, 2024
httpx.Request blindly dumps provided json to ascii characters by invoking json.dumps without any parameters. Currently there is no way to provide
ensure_ascii=False
argument to the Request instance (or even better to the client).It leads to the following scenario:
when I need properly encoded unicode bytes:
b"cze\xc5\x9b\xc4\x87"
The text was updated successfully, but these errors were encountered: