multipart.BodyPartReader appends CRLF #454

vbraun · 2015-08-02T20:03:23Z

I'm trying to handle binary flies uploaded with a polymer file-upload component (https://github.com/winhowes/file-upload). On the browser side (chrome) that just sticks the file in a FormData object, so I believe the sent form is correct. I'm using

        reader = aiohttp.MultipartReader.from_response(request)
        first_part = yield from reader.next()
        data = yield from first_part.read()

to parse it on the server side and always get a trailing CRLF added to data.

As an example, the browser sends

[...]
0100500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0     003
0100520  \0  \0  \0  \0  \0  \0   +  \0  \0  \0  \0  \0  \0  \0  \0  \0
0100540  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0  \0  \0
0100560  \0  \0  \0  \0  \0  \0  \r  \n   -   -   -   -   -   -   W   e
0100600   b   K   i   t   F   o   r   m   B   o   u   n   d   a   r   y
0100620   9   X   r   p   F   i   8   S   X   a   A   2   9   t   P   v
0100640   -   -  \r  \n
0100644

and data (incorrectly) ends with [...]\0\0\0\0\0\0\r\n. Looking at the aiohttp tests, this is by design:

class PartReaderTestCase(TestCase):

    def setUp(self):
        super().setUp()
        self.boundary = b'--:'

    def test_next(self):
        obj = aiohttp.multipart.BodyPartReader(
            self.boundary, {}, Stream(b'Hello, world!\r\n--:'))
        result = yield from obj.next()
        self.assertEqual(b'Hello, world!\r\n', result)

But that seems wrong to me, the CRLF is supposed to be part of the boundary and should NOT be included in the body part.

The relevant RFC http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html says: "NOTE: The CRLF preceding the encapsulation line is considered part of the boundary so that it is possible to have a part that does not end with a CRLF (line break). Body parts that must be considered to end with line breaks, therefore, should have two CRLFs preceding the encapsulation line, the first of which is part of the preceding body part, and the second of which is part of the encapsulation boundary."

@kxepal feature #273

The text was updated successfully, but these errors were encountered:

kxepal · 2015-08-02T20:36:26Z

mmm...true. And I know how I avoid that problem: in all my cases binary data falls into read_chunk branch as they always have content length defined while lenghtless parts always fit one line. Eventually.

The problem is in readline method which is used when body part length is unknown since that's the only safe way to parse it. However, it also means that to fix this issue it have to look forward on one line in order to strip accidental CRLF from the last data line. I always though that _unread buffer was a hack, but it seems it have to be a legit solution.

The test therefore is valid for the wrong behaviour (:

Will prepare the fix soon. If you have own ideas about feel free to share (:

kxepal · 2015-08-02T21:21:15Z

@vbraun see #455 PR

vbraun · 2015-08-02T21:46:37Z

Thank, I'll give it a try tomorrow!

I also didn't find find a way to read in chunks except by using readline; Thats not really ideal either as there is no guarantee that there are any newlines. The boundary marker has a fixed maximal length (70 bytes) so it should be easy enough to read + scan in chunks. Am I missing something?

kxepal · 2015-08-02T22:02:11Z

@vbraun That's easy:

reader = aiohttp.MultipartReader.from_response(request)
first_part = yield from reader.next()
while True:
    chunk = yield from first_part.read_chunk()
    if not chunk:
        break
    fobj.write(chunk)

The only requirement is that body part must define Content-Length header.

Ideally, of course it's possible to read by chunks lengthless parts, but it's hard: boundary may be split across two or, even worse, 3+ chunks and I don't know what the sane algorithm there should be to correctly merge such partial reads, avoid false positive ahead reads and so on. Too many issues, pitfalls and edge cases for one missed header.

#454: Don't append CRLF for the last read line

asvetlov · 2015-08-03T19:42:28Z

Fixed by #455

vbraun · 2015-08-03T19:51:24Z

Thanks for the quick fix!

And yes, I should have said that Chrome doesn't send a Content-Length header so read_chunk doesn't work.

The boundary is guaranteed to be <= 70 bytes so all you have to do is read ahead by 70 bytes. Just wrapping a stream to stop at given boundaries is straightforward. What makes chunked reads difficult with the current implementation is that this is not abstracted into a stream wrapper but handled across MultipartReader and BodyPartReader, handing _unread pieces around.

lock · 2019-10-29T17:02:26Z

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

asvetlov added a commit that referenced this issue Aug 3, 2015

Merge pull request #455 from kxepal/issue/454

dc69644

#454: Don't append CRLF for the last read line

asvetlov closed this as completed Aug 3, 2015

lock bot added the outdated label Oct 29, 2019

lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multipart.BodyPartReader appends CRLF #454

multipart.BodyPartReader appends CRLF #454

vbraun commented Aug 2, 2015

kxepal commented Aug 2, 2015

kxepal commented Aug 2, 2015

vbraun commented Aug 2, 2015

kxepal commented Aug 2, 2015

asvetlov commented Aug 3, 2015

vbraun commented Aug 3, 2015

lock bot commented Oct 29, 2019

multipart.BodyPartReader appends CRLF #454

multipart.BodyPartReader appends CRLF #454

Comments

vbraun commented Aug 2, 2015

kxepal commented Aug 2, 2015

kxepal commented Aug 2, 2015

vbraun commented Aug 2, 2015

kxepal commented Aug 2, 2015

asvetlov commented Aug 3, 2015

vbraun commented Aug 3, 2015

lock bot commented Oct 29, 2019