-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't quote boundary parameter for multipart unless necessary #2545
Conversation
Also accept byte-strings, although only in a limited form due to an architectural restriction. Also properly quote boundaries when they contain `"` or `\` characters. Some servers don't understand quoted boundaries in multipart documents, which is why it should only be used when necessary.
c63ed75
to
1aaf422
Compare
Codecov Report
@@ Coverage Diff @@
## master #2545 +/- ##
==========================================
+ Coverage 97.78% 97.78% +<.01%
==========================================
Files 36 36
Lines 7225 7236 +11
Branches 1260 1262 +2
==========================================
+ Hits 7065 7076 +11
Misses 55 55
Partials 105 105
Continue to review full report at Codecov.
|
aiohttp/helpers.py
Outdated
# ; any VCHAR, except delimiters | ||
# VCHAR = %x21-7E | ||
valid_tchar_regex = br"^[!#$%&'*+\-.^_`|~\w]+$" | ||
if re.match(valid_tchar_regex, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use compiled regular exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the rationale for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiled regexs are faster, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but only if they are cached properly so they aren't recompiled every time. Python does some internal caching for regular expressions already, so unless we need a lot of regular expressions, the internal cache should cover most use cases.
In fact, if we don't cache the regex and compile it every time we need it, we actually degrade performance compared to having Python (or the re module) handle caching automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to cache regexps at module level, like [here](https://github.com/aio-libs/aiohttp/blob/master/aiohttp/helpers.py#L452
re
module has 512 slots for LRU cache of compiled regexps.
Thus the program may degrade performance if other code uses non-complied exceptions for solving own tasks. If the total regexps number exceeds 512 -- re
will recompile them constantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will only be run once for every multipart document sent, so I believe the performance impact will be minimal at bestworst, but we can do it anyway if that's your standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
- It is the aiohttp standard.
- The function runs once per multipart document but for high-load site it could clash with other handlers and their functionality.
aiohttp/helpers.py
Outdated
return value | ||
|
||
invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]" | ||
if re.match(invalid_qdtext_char_regex, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compile regex
aiohttp/helpers.py
Outdated
quoted_value_content = value.replace(b'\\', b'\\\\') # %x5C | ||
quoted_value_content = value.replace(b'"', b'\\"') # %x22 | ||
|
||
return b'"' + quoted_value_content + b'"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bytes here in and out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTTP spec operates on bytes, which is why I designed this helper function to follow that. The only reason why we need to convert back to a string is because the underlying Payload API requires a string for header values. Or at least I think so. If it doesn't, I'll be glad to remove the conversion back to a unicode string.
I could of course special-case the function for just our situation here, but it should go into multipart.py then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to use bytes in the actual multipart body, so we should try to convert exactly that onto a potentially quoted parameter value.
aiohttp/multipart.py
Outdated
except (UnicodeEncodeError, UnicodeDecodeError): | ||
raise ValueError('boundary should contain ASCII only chars') \ | ||
from None | ||
boundary_value = format_parameter_value(self._boundary).decode('ascii') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it looks strange. We take string boundary, turns it into bytes, and then turn it back to stirng to apply formatting. This wouldn't be a bottleneck for sure, but looks like useless work.
aiohttp/multipart.py
Outdated
except UnicodeEncodeError: | ||
raise ValueError('boundary should contains ASCII only chars') | ||
ctype = 'multipart/{}; boundary="{}"'.format(subtype, boundary) | ||
if isinstance(boundary, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the case for binary boundary input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true to how it is actually used and defined in the spec. In this particular situations, users don't get more value out of it because we have to convert the boundary back to a unicode string for the Payload header. Otherwise characters above 0x80 could be used inside the boundary.
Ideally, all headers would work on byte strings (only or additionally), but I didn't consider this to be in the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*bytes above 0x80 could be used. HTTP is an ascii-readable byte-based protocol.
aiohttp/multipart.py
Outdated
if isinstance(boundary, bytes): | ||
boundary.decode('ascii') | ||
self._boundary = boundary | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this if-else
still exists, would be good to move default boundary generation here into another branch. No point to do isinstance check if boundary passed as None since we know that it will not be a bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would complicate (or at least worsen code beauty) of the next part when we try to catch encoding errors.
aiohttp/multipart.py
Outdated
self._boundary = boundary | ||
else: | ||
self._boundary = boundary.encode('ascii') | ||
except (UnicodeEncodeError, UnicodeDecodeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to minimize exception catch scope.
Regarding the header stuff, I think I can implement byte string headers and automatic conversion using Regarding the motivation of that suggested change: Also, what about my |
Aiohttp treats header value as an utf-8. I think a lot of servers does the same nowadays. |
Pseudonymin in |
Yes, historically aiohttp uses utf-8 for headers with a backdoor for getting bytes for communication with malformed client/server, maybe we need a hooks for sending custom bytes too. Anyway for me there is no reason for switching to byte-based headers, especially for boundary -- the value perfectly fits in ascii, isn't it? |
No, technically bytes 0x80 to 0xff are allowed as well. When quoted, that is. It's the obs-text (or char?) rule in the abnf I included in the issue and in the comment for the helper function. (I'm on mobile, otherwise I'd quote it myself.) Again, if you don't feel good about the weird conversion between bytes and unicode, I can remove it and it won't be backwards incompatible. It wouldn't allow any bytes higher than 0x7f, but the current implementation doesn't as well and it's perfectly backwards compatible. It just bothers me in a way that the specification isn't respected entirely. |
@FichteFoll don't get me wrong.
We can return to discussion if somebody will complain aiohttp in not supporting non-ascii boundaries. |
I agree with this sentiment. Will adjust the PR tomorrow. Do you want me to squash? Otherwise I'd just commit on top of this so we can keep the other commit in case this ever comes up. It might end up being useful, maybe. |
Just merge master and add required changes. |
aiohttp/helpers.py
Outdated
|
||
invalid_qdtext_char_regex = br"[\x00-\x08\x0A-\x1F\x7F]" | ||
if re.match(invalid_qdtext_char_regex, value): | ||
raise ValueError("parameter value contains invalid characters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line is not covered by tests
Also - fix an error when escaping backslashes in the quoted parameter, - cache compiled regular expressions (by aiohttp standards), and - cover all lines in tests.
There is little use in having this property public. If a use case arises, it could easily be made public in the future.
@kxepal please review again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGFM, but there are two bits remains that concerns me.
aiohttp/multipart.py
Outdated
if re.match(self._valid_tchar_regex, value): | ||
return value.decode('ascii') # cannot fail | ||
|
||
if re.match(self._invalid_qdtext_char_regex, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean re.search
. Because match will not match anything if string starts with valid chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Fixed that and another issue with the previous regex not considering that $
also matches before a newline.
tests/test_multipart.py
Outdated
def test_bad_boundary(self): | ||
with self.assertRaises(ValueError): | ||
aiohttp.multipart.MultipartWriter(boundary='тест') | ||
with self.assertRaises(ValueError): | ||
aiohttp.multipart.MultipartWriter(boundary='\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about boundary=test\n
? (;
@FichteFoll, Thanks! @asvetlov? |
.. attribute:: boundary | ||
|
||
The byte-string representation of the boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type bothers me.
We pass optional str
value into constructor but the property is byteish.
Maybe the method should return self._boundary.decode('ascii')
?
It is a breaking change but we could do it in aiohttp 3.0.
Amount of affected user code is very low IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is unchanged. I just documented it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see.
But maybe now is the time for behavior changing?
Current one is confusing a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, imo everything should be saved as byte-strings anyway (as I explained before), so having the boundary as a byte-string doesn't seem far off. It's only used that way, too, so either we'd add an additional property that converts the boundary from a unicode to a byte string, or we encode it at every point we need it (not a fan).
I'd say we leave this for now. I would rather spend time to convert all header strings to byte-strings rather than unwrapping the boundary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear it was bytes all the time. Let's change that for 3.0! This could be done with separate PR with breaking change notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's merge.
Technically single Pull Request can have several records in CHANGES/
but I'm ok with separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FichteFoll I have a feeling that all headers should be either strings or bytes.
All means really all, including request.headers
etc.
This change in too intrusive, the ship has sailed many years ago.
Also I think 99% aiohttp users are happy with string headers but byte header values will confuse them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this should ideally be standardized on a single header type. For compatibility, all unicode strings could be converted to byte-strings using ASCII encoding, so that the outgoing API doesn't change much except that it allows byte-strings for all headers too.
However, this goes way beyond the scope of this PR and would be a large architectural consideration that I'm sure you're going to have a much better time deciding on. It might be something worth considering for the next major release, i.e. 3.0, as long as someone puts in the work.
@FichteFoll thanks! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs. |
What do these changes do?
Only quote multipart boundary when necessary.
Also accept byte-strings, although only in a limited form due to anarchitectural restriction.
Also properly quote boundaries when they contain
"
or\
characters.Are there changes in behavior for the user?
Boundaries provided by the user, which contain
"
or\
characters, are properly escaped and invalid characters such as\n
cause an exception.Related issue number
Fixes #2544.
Checklist
CONTRIBUTORS.txt
CHANGES
folder<issue_id>.<type>
for example (588.bug)issue_id
change it to the pr id after creating the pr.feature
: Signifying a new feature..bugfix
: Signifying a bug fix..doc
: Signifying a documentation improvement..removal
: Signifying a deprecation or removal of public API..misc
: A ticket has been closed, but it is not of interest to users.Is it okay for me to add myself to
CONTRIBUTORS.txt
with my pseudonym? I saw nobody else doing it, which is why I didn't for now.