Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default encoding to utf-8 in normalize_header_key and normalize_header_value functions #3238

Open
ZM25XC opened this issue Jul 11, 2024 · 5 comments

Comments

@ZM25XC
Copy link

ZM25XC commented Jul 11, 2024

Description

I have encountered decoding errors with some requests that use ASCII encoding. Changing the default encoding to UTF-8 resolves these errors. I propose updating the normalize_header_key and normalize_header_value functions in _utils.py to use UTF-8 as the default encoding.

Steps to Reproduce

  1. Call normalize_header_key or normalize_header_value with a non-ASCII string and no encoding specified.
  2. Observe the decoding failure with ASCII encoding.
  3. Change the encoding to UTF-8 and observe that the error is resolved.

Example Code

header_key_unicode = "内容类型"
normalized_key_unicode = normalize_header_key(header_key_unicode, lower=True)
# This raises a UnicodeEncodeError with ASCII encoding.

normalized_key_unicode_utf8 = normalize_header_key(header_key_unicode, lower=True, encoding="utf-8")
print(normalized_key_unicode_utf8)  # Works correctly with UTF-8 encoding.

Proposed Solution

Modify the _utils.py file to use UTF-8 as the default encoding:

def normalize_header_key(
    value: str | bytes,
    lower: bool,
    encoding: str | None = None,
) -> bytes:
    """
    Coerce str/bytes into a strictly byte-wise HTTP header key.
    """
    if isinstance(value, bytes):
        bytes_value = value
    else:
        bytes_value = value.encode(encoding or "utf-8")

    return bytes_value.lower() if lower else bytes_value

def normalize_header_value(value: str | bytes, encoding: str | None = None) -> bytes:
    """
    Coerce str/bytes into a strictly byte-wise HTTP header value.
    """
    if isinstance(value, bytes):
        return value
    return value.encode(encoding or "utf-8")

Rationale

Using UTF-8 as the default encoding ensures that the functions can handle a wider range of input values without raising an error. UTF-8 encoding is capable of encoding a larger set of characters compared to ASCII.

@iamjatinyadav
Copy link

hey guys, can I work on this?

@ZM25XC
Copy link
Author

ZM25XC commented Jul 14, 2024

hey guys, can I work on this?
Thank you for your interest in this issue! I am currently working on a Pull Request to address this problem. If you have any suggestions or would like to review the changes, your input would be greatly appreciated.

@tomchristie
Copy link
Member

tomchristie commented Jul 23, 2024

Heya, thanks for the consideration... I think this may be valid, tho could you re-frame it so that you're describing the issue against public API. For example describe this using httpx.Headers(...). We can then unpick exactly what should and should not be valid.

Edit: Okay, I see PR #3241 now. We don't want to support utf-8 in header keys which have a constrained set of allowed characters. I do think it's a sensible default for header values tho.

@ZM25XC
Copy link
Author

ZM25XC commented Jul 24, 2024

嘿,谢谢你的考虑......我认为这可能是有效的,您能否重新构建它,以便您描述针对公共 API 的问题。例如,使用 来描述这一点。然后,我们可以准确地选择哪些内容应该有效,哪些内容不应该有效。httpx.Headers(...)

编辑:好的,我现在看到 PR #3241。我们不希望在具有一组受约束的允许字符的标头键中支持 utf-8。我确实认为这是标头值的明智默认值。

In a request, if the headers contain non-ASCII characters and the encoding is not specified, an error will be raised. For example:
Clip_2024-07-24_20-39-32

from httpx import Headers

x = Headers(
    {
        "Referer": "https://www.google.com/search?q=テスト",
    }
)
print(x)

In the example code, the Referer field contains the Japanese word テスト, which will raise an error and interrupt the request.

The same issue can occur in responses. If the server's response does not specify the encoding, httpx will use ASCII encoding to decode テスト, resulting in an error.

Therefore, I suggest changing the default encoding from ASCII to UTF-8 to better handle such cases.
Clip_2024-07-24_20-42-15

@Cycloctane
Copy link

@ZM25XC UTF-8 is not accepted in headers. You can only use ISO-8859-1 in HTTP headers according to rfc.

If you want to put utf8 characters in the url and pass it to referer, you should url encode all utf8 characters first. In fact, this is how this url actually is.

from httpx import Headers
from urllib.parse import quote

x = Headers(
    {
        "Referer": f"https://www.google.com/search?q={quote('テスト')}",
    }
)
print(x)
# Headers({'referer': 'https://www.google.com/search?q=%E3%83%86%E3%82%B9%E3%83%88'})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants