Significantly lower throughput compared to `open` #184

jayantj · 2018-04-06T05:26:09Z

I was iterating over a large csv file using open vs smart_open and noticed a significant performance drop when nothing was changed but open -> smart_open

# Iterate over csv

import time
import csv

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with open(file_name, 'r') as f:
        reader = csv.reader(f)
        for i, line in enumerate(reader, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows('file.csv')

Output with open:

Time taken for 100000 rows: 0.08 seconds, 1222907.59 rows/s
Time taken for 100000 rows: 0.08 seconds, 1217525.99 rows/s
Time taken for 100000 rows: 0.08 seconds, 1223503.33 rows/s
Time taken for 100000 rows: 0.08 seconds, 1247851.67 rows/s
Time taken for 100000 rows: 0.08 seconds, 1245898.25 rows/s
Time taken for 100000 rows: 0.08 seconds, 1238971.91 rows/s
...

Output with smart_open:

Time taken for 100000 rows: 0.37 seconds, 272099.79 rows/s
Time taken for 100000 rows: 0.37 seconds, 272198.68 rows/s
Time taken for 100000 rows: 0.37 seconds, 273532.88 rows/s
Time taken for 100000 rows: 0.37 seconds, 272889.00 rows/s
Time taken for 100000 rows: 0.37 seconds, 272412.42 rows/s
...

Unfortunately, the file I'm using is sensitive data, so I can't share it, but I assume this should be reproducible with any file with a large number of lines. Information about file -
Number of lines: 25206601
File size: 2707135791 (~2.7 GB)

The text was updated successfully, but these errors were encountered:

mpenkov · 2018-04-06T05:29:07Z

Thank you for the report. Are you using Python 2 or 3?

jayantj · 2018-04-06T05:30:24Z

Python 3.6.4

menshikh-iv · 2018-04-06T06:45:45Z

@jayantj can you demonstrate a number of columns & line length distribution (this will be really helpful for generating "fake" data & reproducing the problem).

jayantj · 2018-04-06T08:12:32Z

Sure. Random sample of 1000 line lengths -

30, 161, 71, 162, 84, 19, 28, 100, 32, 253, 37, 39, 191, 119, 75, 26, 44, 64, 230, 71, 71, 71, 45, 22, 78, 155, 32, 38, 45, 64,
 121, 51, 76, 22, 148, 76, 38, 53, 154, 51, 65, 50, 361, 31, 99, 75, 137, 45, 46, 62, 53, 37, 23, 63, 26, 276, 26, 44, 142, 64, 117, 76, 57, 647, 99, 52, 113, 114, 42, 271, 5
8, 26, 54, 26, 74, 52, 89, 51, 68, 51, 403, 51, 40, 72, 458, 43, 267, 148, 96, 38, 103, 83, 74, 23, 30, 332, 27, 30, 23, 106, 62, 61, 99, 43, 49, 482, 39, 179, 73, 443, 64, 5
8, 26, 74, 123, 152, 45, 376, 43, 331, 132, 34, 27, 57, 61, 29, 138, 42, 83, 60, 51, 21, 34, 57, 39, 28, 52, 54, 50, 236, 44, 37, 44, 54, 64, 22, 105, 20, 182, 110, 44, 65, 4
4, 46, 264, 76, 55, 39, 83, 36, 75, 121, 80, 63, 151, 71, 45, 38, 22, 130, 56, 57, 44, 78, 114, 66, 54, 85, 71, 26, 50, 40, 107, 62, 170, 35, 30, 57, 80, 30, 155, 112, 121, 1
17, 90, 277, 84, 217, 386, 24, 29, 100, 36, 105, 709, 22, 49, 307, 90, 51, 1493, 26, 45, 77, 30, 26, 154, 93, 31, 60, 85, 218, 54, 75, 24, 54, 40, 70, 37, 251, 38, 81, 55, 46
, 44, 150, 49, 198, 248, 68, 48, 69, 67, 25, 32, 24, 230, 1532, 79, 44, 118, 56, 188, 120, 60, 131, 132, 39, 50, 56, 74, 50, 107, 134, 273, 46, 258, 120, 99, 27, 65, 39, 80, 
74, 30, 44, 63, 91, 61, 21, 73, 267, 79, 26, 22, 85, 311, 17, 121, 53, 58, 44, 165, 319, 108, 83, 27, 82, 555, 50, 142, 54, 25, 183, 111, 51, 27, 66, 70, 775, 31, 29, 234, 87
6, 18, 55, 55, 74, 64, 52, 147, 18, 91, 1280, 90, 108, 259, 50, 33, 45, 112, 65, 66, 23, 120, 91, 196, 140, 390, 47, 72, 24, 41, 66, 62, 26, 87, 77, 56, 88, 28, 26, 147, 99, 
50, 158, 44, 78, 59, 37, 31, 59, 84, 42, 142, 22, 150, 53, 25, 70, 257, 89, 38, 99, 213, 24, 99, 2926, 65, 58, 42, 40, 17, 54, 26, 41, 28, 49, 89, 60, 48, 39, 97, 58, 575, 10
2, 68, 100, 68, 101, 38, 38, 43, 78, 48, 93, 141, 39, 168, 96, 21, 26, 40, 93, 122, 48, 92, 291, 99, 35, 625, 44, 40, 64, 148, 308, 26, 51, 144, 26, 26, 40, 80, 34, 30, 99, 6
5, 289, 31, 36, 38, 108, 24, 38, 27, 87, 426, 67, 72, 112, 94, 44, 50, 68, 72, 23, 51, 68, 28, 264, 36, 167, 29, 70, 45, 57, 41, 69, 36, 35, 44, 58, 43, 216, 58, 57, 22, 55, 
13, 65, 227, 36, 24, 121, 45, 49, 49, 87, 66, 26, 203, 32, 46, 32, 56, 179, 437, 70, 149, 44, 54, 123, 157, 21, 45, 65, 26, 146, 668, 29, 23, 31, 268, 100, 66, 1339, 73, 44, 
92, 47, 47, 45, 48, 35, 377, 161, 43, 94, 97, 30, 63, 360, 44, 99, 344, 26, 115, 160, 99, 205, 64, 47, 290, 57, 104, 36, 158, 300, 20, 40, 200, 92, 57, 32, 42, 62, 34, 68, 18
0, 142, 182, 56, 44, 84, 225, 95, 72, 38, 132, 109, 82, 23, 94, 40, 389, 44, 31, 53, 80, 57, 116, 37, 51, 47, 25, 169, 44, 25, 99, 97, 174, 115, 44, 55, 110, 70, 68, 70, 26, 
298, 307, 86, 74, 40, 122, 176, 50, 44, 22, 67, 44, 99, 99, 34, 20, 64, 340, 33, 47, 19, 150, 132, 158, 51, 296, 50, 310, 449, 201, 326, 75, 53, 66, 26, 119, 223, 96, 74, 38,
 279, 31, 207, 44, 249, 99, 197, 240, 23, 59, 44, 38, 181, 111, 31, 26, 86, 97, 148, 106, 289, 37, 48, 23, 26, 45, 64, 46, 172, 28, 50, 270, 362, 104, 61, 64, 34, 174, 65, 87
, 84, 249, 22, 44, 198, 24, 45, 68, 872, 70, 37, 32, 44, 38, 99, 24, 901, 272, 377, 40, 32, 68, 182, 26, 350, 888, 26, 174, 150, 69, 397, 38, 151, 37, 287, 49, 102, 14, 52, 4
4, 44, 121, 28, 178, 102, 131, 50, 57, 35, 113, 55, 29, 125, 75, 72, 162, 54, 45, 38, 40, 49, 15, 44, 83, 58, 20, 111, 140, 235, 63, 262, 50, 72, 44, 99, 35, 89, 44, 45, 162,
 277, 268, 74, 50, 83, 71, 60, 26, 124, 40, 39, 42, 34, 37, 32, 1371, 32, 26, 24, 77, 85, 71, 163, 44, 130, 282, 40, 51, 103, 53, 91, 161, 40, 58, 172, 147, 63, 55, 25, 28, 2
41, 81, 44, 65, 203, 41, 50, 288, 86, 244, 51, 108, 45, 50, 81, 152, 56, 218, 99, 88, 196, 323, 180, 128, 41, 146, 67, 145, 57, 78, 62, 44, 43, 25, 82, 30, 42, 67, 115, 333, 
78, 44, 85, 53, 19, 23, 28, 158, 51, 93, 330, 40, 14, 23, 51, 26, 63, 56, 46, 537, 48, 425, 119, 33, 170, 99, 167, 26, 29, 44, 1724, 494, 36, 58, 26, 152, 236, 44, 80, 50, 35
, 127, 432, 249, 30, 116, 281, 57, 34, 54, 35, 50, 283, 174, 74, 64, 111, 30, 70, 62, 52, 104, 65, 71, 68, 67, 65, 103, 26, 26, 68, 26, 25, 97, 510, 96, 13, 39, 36, 59, 66, 1
66, 78, 80, 86, 192, 23, 74, 43, 65, 45, 23, 131, 71, 30, 61, 24, 73, 112, 15, 102, 54, 113, 188, 60, 55, 33, 57, 453, 27, 131, 61, 201, 44, 292, 22, 55, 109, 65, 46, 82, 47,
 146, 345, 73, 98, 57, 65, 44, 143, 36, 92, 359, 157, 469, 244, 50, 180, 193, 392, 26, 42, 167, 21, 44, 105, 130, 223, 41, 45, 59, 107, 44, 97, 121, 74, 29, 99, 103, 70, 57

jayantj · 2018-04-06T08:15:24Z

Re: columns, 1 column. The problem is reproducible with just iterating over lines in the file though. In fact, it is even more prominent.

open:

Time taken for 100000 rows: 0.02 seconds, 5294701.89 rows/s
Time taken for 100000 rows: 0.02 seconds, 5307834.63 rows/s
Time taken for 100000 rows: 0.02 seconds, 5074041.28 rows/s
Time taken for 100000 rows: 0.02 seconds, 5089989.44 rows/s
Time taken for 100000 rows: 0.02 seconds, 5262482.75 rows/s
...

smart_open:

Time taken for 100000 rows: 0.30 seconds, 330003.69 rows/s
Time taken for 100000 rows: 0.30 seconds, 329707.18 rows/s
Time taken for 100000 rows: 0.30 seconds, 329591.37 rows/s
Time taken for 100000 rows: 0.30 seconds, 330451.14 rows/s
Time taken for 100000 rows: 0.30 seconds, 328751.27 rows/s
...

open is about 15x faster.

Code:

# Iterate over lines

import time

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with open(file_name, 'r') as f:
        for i, line in enumerate(f, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows(file_name)

piskvorky · 2018-04-06T12:07:00Z

That's super weird -- AFAIK, smart_open just returns open for local files, no wrapping at all, no extra indirection. It's exactly the same object underneath. Interesting mystery!

mpenkov · 2018-04-06T12:35:56Z

@piskvorky that's not entirely true, and I'm to blame here :)

I think the problem is to do with UTF-8 decoding. You're opening the file directly in r mode, and Python does decoding for you behind the scenes. smart_open opens the file in binary mode first, and then applies a codecs.getreader (or .getwriter) to it. Ideally, these two should have the same performance, but I wouldn't be too surprised if the slightly different implementations account for the difference.

https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L575

The reason why smart_open does this (first bytes, then text) is that it's easy to apply this abstraction to all our use cases (compression, codecs, reading from remote sources like S3, etc). I don't know if there's a way to maintain this flexibility and keep the performance. If it really matters, we could handle reading and writing local text files as a separate edge case. It wouldn't be particularly hard to do, but each additional edge case adds a little bit of cruft.

jayantj · 2018-04-06T12:38:50Z

I had a quick look at the code and there does seem to be some wrapping. A codecs.getreader call is made and the raw file object is wrapped inside the decoder.

Using the same code as above with the file object wrapped in a similar object -

 # Iterate over lines

import codecs
import time

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with codecs.getreader('utf8')(open(file_name, 'rb')) as f:
        for i, line in enumerate(f, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows(file_name)

Throughput -

Time taken for 100000 rows: 0.31 seconds, 325395.62 rows/s
Time taken for 100000 rows: 0.31 seconds, 325199.84 rows/s
Time taken for 100000 rows: 0.31 seconds, 327534.66 rows/s
Time taken for 100000 rows: 0.31 seconds, 325983.61 rows/s
Time taken for 100000 rows: 0.31 seconds, 326569.66 rows/s
Time taken for 100000 rows: 0.31 seconds, 326363.33 rows/s
...

Seems pretty close to the smart open results

jayantj · 2018-04-06T12:39:31Z

Ah I see you beat me to it @mpenkov :)

piskvorky · 2018-04-06T14:17:04Z

Aha, that must be some new development. I wasn't aware of that, thanks for clarifying.

Yes, local files are definitely a "corner case" worth handling in an optimal way IMO. A 15x slower performance hit doesn't look good.

mpenkov · 2018-04-07T04:39:04Z

The reason I refer to it as an "edge case" is because smart_open can only rely on the built-in open under certain conditions:

Opening a local file
Ignore file extension is set to True (by default, it's False)
Encoding is left at default (encoding keyword argument)
Anything I've missed?

If all of the above are True, then we can shortcut the process I described earlier and just use the built-in open. If any of them are false, then we have to take the long way around.

Luckily, the conditions above are fairly easy to detect, from what I understand.

@menshikh-iv Have I missed anything?

menshikh-iv added the bug label Apr 6, 2018

piskvorky mentioned this issue Apr 9, 2018

Word2Vec 3.2.0 performance regression for corpus on s3 with smart-open 1.5.6 piskvorky/gensim#1836

Closed

mpenkov self-assigned this Apr 14, 2018

mpenkov added a commit that referenced this issue Apr 21, 2018

Fix #184

992dcfd

menshikh-iv closed this as completed in abb08ae Apr 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly lower throughput compared to `open` #184

Significantly lower throughput compared to `open` #184

jayantj commented Apr 6, 2018 •

edited

Loading

mpenkov commented Apr 6, 2018

jayantj commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

jayantj commented Apr 6, 2018

jayantj commented Apr 6, 2018 •

edited

Loading

piskvorky commented Apr 6, 2018 •

edited

Loading

mpenkov commented Apr 6, 2018 •

edited

Loading

jayantj commented Apr 6, 2018 •

edited

Loading

jayantj commented Apr 6, 2018

piskvorky commented Apr 6, 2018 •

edited

Loading

mpenkov commented Apr 7, 2018

Significantly lower throughput compared to open #184

Significantly lower throughput compared to open #184

Comments

jayantj commented Apr 6, 2018 • edited Loading

mpenkov commented Apr 6, 2018

jayantj commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

jayantj commented Apr 6, 2018

jayantj commented Apr 6, 2018 • edited Loading

piskvorky commented Apr 6, 2018 • edited Loading

mpenkov commented Apr 6, 2018 • edited Loading

jayantj commented Apr 6, 2018 • edited Loading

jayantj commented Apr 6, 2018

piskvorky commented Apr 6, 2018 • edited Loading

mpenkov commented Apr 7, 2018

Significantly lower throughput compared to `open` #184

Significantly lower throughput compared to `open` #184

jayantj commented Apr 6, 2018 •

edited

Loading

jayantj commented Apr 6, 2018 •

edited

Loading

piskvorky commented Apr 6, 2018 •

edited

Loading

mpenkov commented Apr 6, 2018 •

edited

Loading

jayantj commented Apr 6, 2018 •

edited

Loading

piskvorky commented Apr 6, 2018 •

edited

Loading