Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

vgovtva · 2023-07-20T14:02:17Z

Noticed a slowness in the code when trying to parse objects with a lot of '\r\n' characters in a row.
It seems that the code does too many callbacks to on_part_data, causing potentially many operations, which are not at all necessary (see example and output below).

py version: python 3.9.2
python-multipart version: 0.0.6

Here's a demonstration

$ cat form.py
import timeit
import requests
from multipart.multipart import parse_options_header, MultipartParser


class TestFormDataParser:
    def __init__(self, headers):
        # Parse the Content-Type header to get the multipart boundary.
        self.headers = headers
        _, params = parse_options_header(self.headers['Content-Type'])
        self.charset = params.get(b'charset', 'utf-8')
        if type(self.charset) == bytes:
            self.charset = self.charset.decode('latin-1')
        try:
            self.boundary = params[b'boundary']
        except KeyError:
            raise MultiPartException('Missing boundary in multipart.')

        self.callbacks = {
            'on_part_begin': lambda: None,
            'on_part_data': self.on_part_data,
            'on_part_end': lambda: None,
            'on_header_field': lambda *_: None,
            'on_header_value': lambda *_: None,
            'on_header_end': lambda: None,
            'on_headers_finished': lambda: None,
            'on_end': lambda: None,
        }
        self.f = MultipartParser(self.boundary, self.callbacks)
        self.on_data_calls = 0

    def on_part_data(self, *_):
        self.on_data_calls += 1

    def parse(self, data):
        self.f.write(data)
        self.f.finalize()
        print(f'on_part_data was called {self.on_data_calls} times')

def test():

    test_data = [
        (b'fo' * 300000 + b'\r\n' + b'foobar', "b'fo' * 300000 + b'\\r\\n' + b'foobar',"),
        (b'fo' * 290000 + b'\r\n' * 10000 + b'foobar',"b'fo' * 290000 + b'\\r\\n' * 10000 + b'foobar'"),
        (b'fo' * 150000 + b'\r\n' * 150000 + b'foobar',"b'fo' * 150000 + b'\r\n' * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n') * 150000 + b'foobar',"(b'fo' + b'\\r\\n') * 150000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 5) * 50000 + b'foobar',"(b'fo' + b'\\r\\n' * 5) * 50000 + b'foobar'"),
        ((b'fo' + b'\r\n' * 50) * 5000 + b'foobar',"(b'fo' + b'\\r\\n' * 50) * 5000 + b'foobar'"),
        (b'fo' + b'\r\n' * 300000 + b'foobar',"b'fo' + b'\\r\\n' * 300000 + b'foobar'"),
    ]

    def run_it(data):
        prep = requests.Request('POST', 'http://example.com', files={'sample': sample, 'other': b'baz'}).prepare()
        parser = TestFormDataParser(prep.headers)
        parser.parse(prep.body)

    _min = 100000
    _max = 0
    iters = 1
    for sample, data_format in test_data:
        sample_size = len(sample)
        crlf_count = sample.count(b'\r\n')
        measurement_time = timeit.timeit(lambda: run_it(sample), setup="pass",number=iters)
        print(f'data format: {data_format}')
        print(f'It took {measurement_time:.3f}s to run {iters} iter with sample size: {sample_size}.')
        print(f'Sample contain {crlf_count} \\r\\n characters.\n\n')
        if measurement_time < _min:
            _min = measurement_time
        if measurement_time > _max:
            _max = measurement_time

    print(f'Biggest difference is {_max-_min:.2f}s ({_max/_min*100:.2f}%) for sample of the same size')

if __name__ == '__main__':
    test()

and here's the output that I get:

on_part_data was called 4 times
data format: b'fo' * 300000 + b'\r\n' + b'foobar',
It took 0.005s to run 1 iter with sample size: 600008.
Sample contain 1 \r\n characters.


on_part_data was called 9986 times
data format: b'fo' * 290000 + b'\r\n' * 10000 + b'foobar'
It took 0.057s to run 1 iter with sample size: 600006.
Sample contain 10000 \r\n characters.


on_part_data was called 149986 times
data format: b'fo' * 150000 + b'
' * 150000 + b'foobar'
It took 1.337s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n') * 150000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.


on_part_data was called 14 times
data format: (b'fo' + b'\r\n' * 5) * 50000 + b'foobar'
It took 0.004s to run 1 iter with sample size: 600006.
Sample contain 250000 \r\n characters.


on_part_data was called 170002 times
data format: (b'fo' + b'\r\n' * 50) * 5000 + b'foobar'
It took 1.463s to run 1 iter with sample size: 510006.
Sample contain 250000 \r\n characters.


on_part_data was called 299986 times
data format: b'fo' + b'\r\n' * 300000 + b'foobar'
It took 2.096s to run 1 iter with sample size: 600008.
Sample contain 300000 \r\n characters.


Biggest difference is 2.09s (56888.98%) for sample of the same size

In our use case this can lead to some really slow upload times. For example, a 9MB file filled with CR LF can take 2000s+, and in applications based on FastAPI or Starlette (web frameworks), this may lead to blocking the main thread for seconds and up to a minute at a time (depending on size of chunks in the request stream).

This is an edge case, so to say, but because of it, we are forced to use a different formdata parser.

(If any python webserver uses this to parse incoming data, one can quite easily DoS the service)

The text was updated successfully, but these errors were encountered:

rdenadai · 2023-09-21T11:03:10Z

At first glance and investigating a little bit ... looks like this line is being called multiple times by your example code.

https://github.com/andrew-d/python-multipart/blob/master/multipart/multipart.py#L1327

Adding a small elif c in (CR, LF): and keep raising the index seems to work at first glance ... but i wonder for a better approach here.

if boundary[index] == c:
    # If we found a match for our boundary, we send the
    # existing data.
    if index == 0:
        data_callback('part_data')

    # The current character matches, so continue!
    index += 1
elif c in (CR, LF):
    index += 1
else:
    index = 0

Still, in some cases, you still would have lots of calls...

on_part_data was called 4 times
data format: b'fo' * 300000 + b'\r\n' + b'foobar',
It took 0.006s to run 1 iter with sample size: 600008.
Sample contain 1 \r\n characters.

on_part_data was called 3 times
data format: b'fo' * 290000 + b'\r\n' * 10000 + b'foobar'
It took 0.044s to run 1 iter with sample size: 600006.
Sample contain 10000 \r\n characters.

on_part_data was called 3 times
data format: b'fo' * 150000 + b'
' * 150000 + b'foobar'
It took 0.591s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.

on_part_data was called 14 times
data format: (b'fo' + b'\r\n') * 150000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 150000 \r\n characters.

on_part_data was called 6 times
data format: (b'fo' + b'\r\n' * 5) * 50000 + b'foobar'
It took 0.005s to run 1 iter with sample size: 600006.
Sample contain 250000 \r\n characters.

on_part_data was called 10002 times
data format: (b'fo' + b'\r\n' * 50) * 5000 + b'foobar'
It took 1.081s to run 1 iter with sample size: 510006.
Sample contain 250000 \r\n characters.

on_part_data was called 3 times
data format: b'fo' + b'\r\n' * 300000 + b'foobar'
It took 1.157s to run 1 iter with sample size: 600008.
Sample contain 300000 \r\n characters.

Biggest difference is 1.15s (24253.20%) for sample of the same size

defnull · 2024-09-22T21:28:17Z

This may justify a CVE, as it can be used to block CPU resources and deny service for other users with a manipulated file upload. Even accidentally, when uploading a real file with just \r\n in it.

defnull · 2024-09-22T21:47:39Z

While bench-marking different multipart implementations, I stumbled across a second (likely related) problem: Throughput goes down drastically if a file upload consists of characters that are also present in the boundary.

Here are the results for a 1MB upload with different content, and a boundary of ------------------------WqclBHaXe8KIsoSum4zfZ6:

"x" * 2**20: 579.44 MB/s
"X" * 2**20: 1.99MB/s
"\r\n" * 2**19: 0.76MB/s

Kludex · 2024-09-28T10:20:42Z

I've rerun the script mentioned in the description, and it seems we solved on #137.

Thanks @jhnstrk 🙏

jhnstrk mentioned this issue Mar 31, 2024

Improve performance, especially in data with many CR-LF #137

Merged

Kludex closed this as completed Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

vgovtva commented Jul 20, 2023

rdenadai commented Sep 21, 2023

defnull commented Sep 22, 2024

defnull commented Sep 22, 2024 •

edited

Loading

Kludex commented Sep 28, 2024

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

Parsing is extremely slow if there are a lot of CR LF characters in the stream #67

Comments

vgovtva commented Jul 20, 2023

rdenadai commented Sep 21, 2023

defnull commented Sep 22, 2024

defnull commented Sep 22, 2024 • edited Loading

Kludex commented Sep 28, 2024

defnull commented Sep 22, 2024 •

edited

Loading