-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash on multipart/form-data post #1743
Conversation
1e4c3ba
to
f48499c
Compare
@@ -409,7 +409,8 @@ def post(self): | |||
out.add(field.name, ff) | |||
else: | |||
value = yield from field.read(decode=True) | |||
if content_type.startswith('text/'): | |||
if content_type is None or \ | |||
content_type.startswith('text/'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in case of None
, you really cannot be sure if is it safe to decode or not. Better leave data as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm also thinking about this, but most post() use cases do not return bytes. When user call post(), maybe he always want something same returned from either multipart or url-encoded data.
If the user cares about raw data (bytes), he may call multipart() directly and process the post data himself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you post a form with fields like textboxes in a browser like Firefox, e.g.
<form method="post" enctype="multipart/form-data">
<input type="hidden" name="p1" value="v1"/>
<input type="submit"/>
</form>
The browser usually do not set Content-Type for subpart of the post.
Files are not affected by this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kxepal, we should not decode data if we do not know content-type
it would very hard to reason about exception if one occurs from this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a hard decision, and I am in an open mind about this. There are three ways for this situation:
- When no Content-Type is provided, assume it is a utf-8 string
- When no Content-Type is provided, always keep it as bytes
- When no Content-Type is provided, first try to parse it as an utf-8 string (with "strict"), and when exception occurs, return the raw bytes
Each has their own advantages and disadvantages. I'm looking at the code which is processing application/x-www-form-urlencoded data and it is:
data = yield from self.read()
if data:
charset = self.charset or 'utf-8'
out.extend(
parse_qsl(
data.rstrip().decode(charset),
encoding=charset))
Notice that this piece of code assume charset to be utf-8 when no charset is provided through Content-Type header (notice that a %NN encoded character is really a byte). It always decode data into string. So I suggest using the same strategy for multipart/form-data format.
As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8. There is nothing a developer can do about this. If multipart/form-data post data is parsed into bytes, a developer is forced to check the data type of post() every time if he wants to accept both format. To decide to not decode a bytes object is easy, but the user may be suprised to see that the return type for multipart/form-data and application/x-www-from-urlencoded is so different. And he would also have a hard time when some tools or browsers actually provide the Content-Type header.
After we have a conclusion maybe we should add it into the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no Content-Type is provided, always keep it as bytes
Will be fine in all cases. Browsers just are another HTTP clients with own specifics.
As I have said, a lot of browsers do not send Content-Type header for sub parts of the form data - in most times, they are indeed encoded into utf-8.
They actually do this for simple input fields, not file inputs. I'm worry about "in most times" part of your post, but in anyway, there are no reasons here to make any preferences for browsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, according to RFC 7578
4.4. Content-Type Header Field for Each Part
Each part MAY have an (optional) "Content-Type" header field, which
defaults to "text/plain". If the contents of a file are to be sent,
the file data SHOULD be labeled with an appropriate media type, if
known, or "application/octet-stream".
It really SHOULD be considered as "text/plain"... And if "text/plain" is decoded to unicode with the default encoding as utf-8, it should be same for content without a content-type header.
I'm also testing the simple HTML page with Firefox, Internet Explorer and Edge, they all send the text without a content-type header - even when the input field contains non-ASCII characters.
Anyway, if you do not change your mind, I don't mind to change the logic to what you are considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI
https://tools.ietf.org/html/rfc7578#page-5
and also these chapters
5.1.2. Interpreting Forms and Creating multipart/form-data Data
Some applications of this specification will supply a character
encoding to be used for interpretation of the multipart/form-data
body. In particular, HTML 5 [W3C.REC-html5-20141028] useso the content of a "charset" field, if there is one;
o the value of an accept-charset attribute of the
element, if
there is one;o the character encoding of the document containing the form, if it
is US-ASCII compatible;o otherwise, UTF-8.
5.1.3. Parsing and Interpreting Form Data
While this specification provides guidance for the creation of
multipart/form-data, parsers and interpreters should be aware of the
variety of implementations. File systems differ as to whether and
how they normalize Unicode names, for example. The matching of form
elements to form-data parts may rely on a fuzzier match. In
particular, some multipart/form-data generators might have followed
the previous advice of [RFC2388] and used the "encoded-word" method
of encoding non-ASCII values, as described in [RFC2047]:encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
Others have been known to follow [RFC2231], to send unencoded UTF-8,
or even to send strings encoded in the form-charset.For this reason, interpreting multipart/form-data (even from
conforming generators) may require knowing the charset used in form
encoding in cases where the charset field value or a charset
parameter of a "text/plain" Content-Type header field is not
supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love RFC references! Thanks for them. I guess RFC-7578#4.4 is pretty clear instructs what to do in this case so can follow it.
@fafhrd91 are you ok with as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok
3c618da
to
a6dae6d
Compare
@hubo1016 please add yourself to contributors list |
@fafhrd91 Done. Where and what should I add to CHANGES.rst? |
add it to 2.0 branch, I am planing to release 2.0.3 today thanks! |
@hubo1016 do not worry about change, I will add entry |
What do these changes do?
When multipart/form-data format data is posted to aiohttp server and is processed by request.post(), if there are fields without filename and "Content-Type" header, the request crashes on checking content_type.startswith("text/"). Many browsers and tools generates this kind of post data.
Are there changes in behavior for the user?
No. There may be different opinions on whether to decode the data to unicode string or leave it as bytes, but it should be better than crashing.
Related issue number
Checklist
CONTRIBUTORS.txt
CHANGES.rst
#issue_number
format at the end of changelog message. Use Pull Request number if there are no issues for PR or PR covers the issue only partially.