add more type checking in params/form-data #3170

trim21 · 2024-04-17T11:54:22Z

Initially raised as discussion bytes in form is handled in a very strange way #3168

Problem

we should check input value types. Current behavoir which simply convert input value to str(v) is not very ideal.

for example:

import httpx


r = httpx.post(
    "https://httpbin.org/post",
    data={
        "content": bytes.fromhex(
            "4f5f13d16da9f10d013933673a1ef4e7b974dbe3"
        ),  # random non-printable bytes
    },
)
print(r.json()["form"])

will get {'content': "b'O_\\x13\\xd1m\\xa9\\xf1\\r\\x0193g:\\x1e\\xf4\\xe7\\xb9t\\xdb\\xe3'"}

Proposal

add type checking and more coerce.

for example:

form

convert int/float to str with str(v)
bool (Represent as JSON-style "true"/"false")
None (Represent as empty string)
Enum (Use .value and then as above.)
TypeError for anything else.

query

except about, query should support bytes with percent encoding and do not throw.

multipart_data

not sure how to handle this....

This will cause breaking changes, so we should consider do a minor version bump.

related: #1500 #1539 #1608

httpx/httpx/_content.py

Lines 136 to 149 in 4b85e6c

    
           def encode_urlencoded_data( 
        
               data: RequestData, 
        
           ) -> tuple[dict[str, str], ByteStream]: 
        
               plain_data = [] 
        
               for key, value in data.items(): 
        
                   if isinstance(value, (list, tuple)): 
        
                       plain_data.extend([(key, primitive_value_to_str(item)) for item in value]) 
        
                   else: 
        
                       plain_data.append((key, primitive_value_to_str(value))) 
        
               body = urlencode(plain_data, doseq=True).encode("utf-8") 
        
               content_length = str(len(body)) 
        
               content_type = "application/x-www-form-urlencoded" 
        
               headers = {"Content-Length": content_length, "Content-Type": content_type} 
        
               return headers, ByteStream(body)

httpx/httpx/_utils.py

Lines 56 to 68 in 4b85e6c

    
           def primitive_value_to_str(value: PrimitiveData) -> str: 
        
               """ 
        
               Coerce a primitive data type into a string value. 
        
               Note that we prefer JSON-style 'true'/'false' for boolean values here. 
        
               """ 
        
               if value is True: 
        
                   return "true" 
        
               elif value is False: 
        
                   return "false" 
        
               elif value is None: 
        
                   return "" 
        
               return str(value)

The text was updated successfully, but these errors were encountered:

trim21 · 2024-05-31T13:38:16Z

there is also another issue, query string should support bytes natively.

for example:

import httpx
import urllib.parse

r = httpx.get(
    "https://httpbin.org/get",
    params={"q": bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")},
)

print(urllib.parse.quote(bytes.fromhex("E1EE0E2734986F5419BB6CB6252BD9377183440E")))

print(r.text)

expected url should be https://httpbin.org/get?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E, instead of ValueError or TypeError, or currently https://httpbin.org/get?q=b\"\\xe1\\xee\\x0e'4\\x98oT\\x19\\xbbl\\xb6%25%2B\\xd97q\\x83D\\x0e\"

tomchristie · 2024-06-03T10:43:17Z

query string should support bytes natively.

I'm not convinced by that. Why?

trim21 · 2024-06-03T10:47:31Z

query string should support bytes natively.

I'm not convinced by that. Why?

because percent encoding can take native bytes as input

trim21 · 2024-06-03T10:53:08Z

when we say it support "string", it's actually assuming percent encoded bytes is utf8 content.

consider this:

from urllib.parse import quote, quote_from_bytes


utf8_s = "你好"

assert quote(utf8_s) == quote_from_bytes(utf8_s.encode())

https://example.com/?q=你好 is actually https://example.com/?q=%E4%BD%A0%E5%A5%BD

that's also what urllib.parse.quote is doing by default, it encode str (utf8 by default) to bytes then use quote_from_bytes

tomchristie · 2024-06-03T11:18:33Z

I think this is confusing a couple of separate issues...

Should str-coercion be performed.
Should bytes be a supported type.

I'm think there's a good case that we shouldn't str-coerce types outside of the expected range. I don't currently think there's a good case that we should support bytes.

trim21 · 2024-06-03T11:24:02Z

I would vote for both...

consider this:

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

looks fine, right?

but actually request url is ?q=%EF%BF%BD%EF%BF%BD%0E%274%EF%BF%BDoT%19%EF%BF%BDl%EF%BF%BD%25%2B%EF%BF%BD7q%EF%BF%BDD%0E&a=1, which it not right.

correct value of q is bytes.fromhex('e1ee0e2734986f5419bb6cb6252bd9377183440e'), but httpx send url with q bytes.fromhex('efbfbdefbfbd0e2734efbfbd6f5419efbfbd6cefbfbd252befbfbd3771efbfbd440e')

also send bytes as query value is supported by requests， which is handled correctly

cknv · 2024-06-03T11:34:29Z

Just to chip in a bit here:

I have been using httpx to interact with a legacy system and as test client to mimic that legacy system calling back into our starlette based API.
Some of the query fields can occasionally be bytes, these bytes are either just bytes that only has meaning to systems outside of my control, or UTF-16 encoded strings (which as an aside can also mean that I have to produce a mixed content query string).

In order to get the correct values through httpx and not 'b"value"' I have had to resort to manually building the query myself with the stdlib and appending it on the URL.
This does work, but it took some time to arrive at and I wouldn't exactly call it "ergonomic".

But I also realize that my case is not common.

tomchristie · 2024-06-03T16:57:58Z

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

trim21 · 2024-06-03T17:01:42Z

consider this (...)

httpx.get("https://example.com/?q=%E1%EE%0E%274%98oT%19%BBl%B6%25%2B%D97q%83D%0E", params={"a": 1})

This is valid, tho simplify it...

>>> r = httpx.get('https://www.example.com?q=%EE', params={'a': '1'})
>>> r.request.url
URL('https://www.example.com?q=%EF%BF%BD&a=1')

Simplify it more...

>>> httpx.QueryParams('q=%EE')
QueryParams('q=%EF%BF%BD')

Eh?

Root cause is...

>>> from urllib.parse import unquote
>>> from urllib.parse import quote
>>> quote(unquote('%EE'))  # stdlib behaving similarly
'%EF%BF%BD'
>>> unquote('%EE')  # The hex code here isn't a valid UTF-8 codepoint, and is being replaced.
'�'
>>> unquote('%EE', errors='strict')  # We'd raise an error if we decoded it with 'strict'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tom.christie/.pyenv/versions/3.10.6/lib/python3.10/urllib/parse.py", line 667, in unquote
    append(unquote_to_bytes(bits[i]).decode(encoding, errors))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

yes, that's a problem, I think.

if quote_from_bytes(unquote_to_bytes("%EE")) then everything is fine, in this case.

tomchristie · 2024-06-03T17:10:25Z

@cknv Could you give a (simplified) example please?

cknv · 2024-06-03T19:08:45Z

Sure @tomchristie! Get ready for some flashback to the dark ages, before UTF8 won the encoding wars.

from urllib.parse import urlencode
import httpx

client = httpx.Client()

values = {
    "expected-as-utf8": "Hello world",
    "expected-as-int": 1234,
    "expected-as-latin1": "Hello world".encode("latin1"),
    "expected-as-binary": b"\x01\x02\x03\x04",
}

encoded_query = urlencode(values)

# replaced .get with .build_request
# as making requests does not seem important for demonstrating my workaround
correctly_encoded_request = client.build_request(
    method="GET",
    url=f"http://localhost?{encoded_query}",
)
incorrecty_encoded_request = client.build_request(
    method="GET",
    url="http://localhost",
    params=values,
)

# compare and contrast:
print(correctly_encoded_request.url.params)
print(incorrecty_encoded_request.url.params)

When actually rereading the code, it wasn't UTF-16 that was used, but instead actually latin1, so I replaced that in the example, although either of them would need to be encoded into bytes.

Narrowing our focus to just the plain byte value, it should be expected-as-binary=%01%02%03%04, and my workaround indeed does produce that.
Without the workaround and just feeding the values dict into the request builders params the value is reprd (via str) and produces expected-as-binary=b%27%5Cx01%5Cx02%5Cx03%5Cx04%27 which is the percent encoded version of expected-as-binary=b'\x01\x02\x03\x04'.

The latin1 value works without the workaround as long as you stick to the subset that is identical with ASCII, but we do at times get data through that isn't just that subset.

To add some context the server receiving the request is written in C, and has a very lax attitude to encodings because it's all just bytes anyway.

If httpx would percent encode bytes, I would probably be able to remove that workaround. However I would not be surprised if there are other ways to send bytes over http either established in an RFC or just by tradition, if so I would not know which one is the most appropriate, but it appears like percent encoding is at least in the RFC.

Finally, I would like to reiterate that I know that this is way outside the mainstream, I have mostly accepted my workaround, but I at least wanted to raise my hand when someone else expressed a similar problem.

Graeme22 · 2024-07-30T16:24:50Z

+1 for using .value for Enums in param encoding

trim21 · 2024-08-27T14:52:57Z

any news on this？

tomchristie · 2024-08-27T18:21:19Z

Currently I'd suggest we sharpen up on this and raise type errors on non-string values... #3176 (comment)

HTML form encoding only supports strings, let's help our users be explicit here.

trim21 · 2024-08-27T23:51:38Z

HTML form encoding only supports strings, let's help our users be explicit here.

This is limited by html/browser, not HTTP protocol.

httpx is a HTTP client for Python, not a browser. We should support more low level operations. For example, we won't do CORS check like browser, right?

Also this is not entirely true. HTML form is post method with application/x-www-form-urlencoded or multipart/form-data encoded data as body, application/x-www-form-urlencoded is same encoding we used in url query, which suport any binary data.

And in non-utf8 worlds, browsers allow you send non-utf8 string as form value. but we don't have gbk string in python, so we need to use bytes.

And browsers also have URLSearchParams API supporting any binary data.

https://developer.mozilla.org/en-US/docs/Web/API/URLSearchParams#preserving_plus_signs

this not only a question should bytes be a supported type, it's a question should we support non-utf8 encoded string

Which I think the answer should be YES.

trim21 · 2024-08-28T01:01:11Z

Currently I'd suggest we sharpen up on this and raise type errors on non-string values... #3176 (comment)

HTML form encoding only supports strings, let's help our users be explicit here.

After seconds thoughts, I think HTML have different definitions of string as we have in Python worlds.

In HTML, string can be any encoding， but it's false in python.

tomchristie · 2024-09-27T10:57:55Z

I'm going to close this off in favor of the broader issue #3176.

trim21 changed the title ~~add more type checking in headers/params/form-data~~ add more type checking in params/form-data Apr 17, 2024

trim21 mentioned this issue Apr 17, 2024

fix: add type checking in form data #3173

Open

3 tasks

trim21 mentioned this issue May 31, 2024

feat: support bytes as query value #3212

Open

3 tasks

trim21 mentioned this issue Aug 27, 2024

feat: add max reponse body limit go-resty/resty#830

Merged

trim21 mentioned this issue Sep 25, 2024

Misleading error when header value is None #3313

Closed

tomchristie closed this as completed Sep 27, 2024

trim21 mentioned this issue Sep 27, 2024

Stricter typing for request parameters. #3176

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add more type checking in params/form-data #3170

add more type checking in params/form-data #3170

trim21 commented Apr 17, 2024 •

edited

Loading

trim21 commented May 31, 2024 •

edited

Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

Loading

cknv commented Jun 3, 2024 •

edited

Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 •

edited

Loading

tomchristie commented Jun 3, 2024

cknv commented Jun 3, 2024

Graeme22 commented Jul 30, 2024

trim21 commented Aug 27, 2024

tomchristie commented Aug 27, 2024

trim21 commented Aug 27, 2024 •

edited

Loading

trim21 commented Aug 28, 2024

tomchristie commented Sep 27, 2024

add more type checking in params/form-data #3170

add more type checking in params/form-data #3170

Comments

trim21 commented Apr 17, 2024 • edited Loading

Problem

Proposal

form

query

multipart_data

trim21 commented May 31, 2024 • edited Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited Loading

cknv commented Jun 3, 2024 • edited Loading

tomchristie commented Jun 3, 2024

trim21 commented Jun 3, 2024 • edited Loading

tomchristie commented Jun 3, 2024

cknv commented Jun 3, 2024

Graeme22 commented Jul 30, 2024

trim21 commented Aug 27, 2024

tomchristie commented Aug 27, 2024

trim21 commented Aug 27, 2024 • edited Loading

trim21 commented Aug 28, 2024

tomchristie commented Sep 27, 2024

trim21 commented Apr 17, 2024 •

edited

Loading

trim21 commented May 31, 2024 •

edited

Loading

trim21 commented Jun 3, 2024 •

edited

Loading

trim21 commented Jun 3, 2024 •

edited

Loading

cknv commented Jun 3, 2024 •

edited

Loading

trim21 commented Jun 3, 2024 •

edited

Loading

trim21 commented Aug 27, 2024 •

edited

Loading