Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly lower throughput compared to open #184

Closed
jayantj opened this issue Apr 6, 2018 · 11 comments
Closed

Significantly lower throughput compared to open #184

jayantj opened this issue Apr 6, 2018 · 11 comments
Assignees
Labels

Comments

@jayantj
Copy link

jayantj commented Apr 6, 2018

I was iterating over a large csv file using open vs smart_open and noticed a significant performance drop when nothing was changed but open -> smart_open

# Iterate over csv

import time
import csv

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with open(file_name, 'r') as f:
        reader = csv.reader(f)
        for i, line in enumerate(reader, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows('file.csv')

Output with open:

Time taken for 100000 rows: 0.08 seconds, 1222907.59 rows/s
Time taken for 100000 rows: 0.08 seconds, 1217525.99 rows/s
Time taken for 100000 rows: 0.08 seconds, 1223503.33 rows/s
Time taken for 100000 rows: 0.08 seconds, 1247851.67 rows/s
Time taken for 100000 rows: 0.08 seconds, 1245898.25 rows/s
Time taken for 100000 rows: 0.08 seconds, 1238971.91 rows/s
...

Output with smart_open:

Time taken for 100000 rows: 0.37 seconds, 272099.79 rows/s
Time taken for 100000 rows: 0.37 seconds, 272198.68 rows/s
Time taken for 100000 rows: 0.37 seconds, 273532.88 rows/s
Time taken for 100000 rows: 0.37 seconds, 272889.00 rows/s
Time taken for 100000 rows: 0.37 seconds, 272412.42 rows/s
...

Unfortunately, the file I'm using is sensitive data, so I can't share it, but I assume this should be reproducible with any file with a large number of lines. Information about file -
Number of lines: 25206601
File size: 2707135791 (~2.7 GB)

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 6, 2018

Thank you for the report. Are you using Python 2 or 3?

@jayantj
Copy link
Author

jayantj commented Apr 6, 2018

Python 3.6.4

@menshikh-iv menshikh-iv added the bug label Apr 6, 2018
@menshikh-iv
Copy link
Contributor

@jayantj can you demonstrate a number of columns & line length distribution (this will be really helpful for generating "fake" data & reproducing the problem).

@jayantj
Copy link
Author

jayantj commented Apr 6, 2018

Sure. Random sample of 1000 line lengths -

30, 161, 71, 162, 84, 19, 28, 100, 32, 253, 37, 39, 191, 119, 75, 26, 44, 64, 230, 71, 71, 71, 45, 22, 78, 155, 32, 38, 45, 64,
 121, 51, 76, 22, 148, 76, 38, 53, 154, 51, 65, 50, 361, 31, 99, 75, 137, 45, 46, 62, 53, 37, 23, 63, 26, 276, 26, 44, 142, 64, 117, 76, 57, 647, 99, 52, 113, 114, 42, 271, 5
8, 26, 54, 26, 74, 52, 89, 51, 68, 51, 403, 51, 40, 72, 458, 43, 267, 148, 96, 38, 103, 83, 74, 23, 30, 332, 27, 30, 23, 106, 62, 61, 99, 43, 49, 482, 39, 179, 73, 443, 64, 5
8, 26, 74, 123, 152, 45, 376, 43, 331, 132, 34, 27, 57, 61, 29, 138, 42, 83, 60, 51, 21, 34, 57, 39, 28, 52, 54, 50, 236, 44, 37, 44, 54, 64, 22, 105, 20, 182, 110, 44, 65, 4
4, 46, 264, 76, 55, 39, 83, 36, 75, 121, 80, 63, 151, 71, 45, 38, 22, 130, 56, 57, 44, 78, 114, 66, 54, 85, 71, 26, 50, 40, 107, 62, 170, 35, 30, 57, 80, 30, 155, 112, 121, 1
17, 90, 277, 84, 217, 386, 24, 29, 100, 36, 105, 709, 22, 49, 307, 90, 51, 1493, 26, 45, 77, 30, 26, 154, 93, 31, 60, 85, 218, 54, 75, 24, 54, 40, 70, 37, 251, 38, 81, 55, 46
, 44, 150, 49, 198, 248, 68, 48, 69, 67, 25, 32, 24, 230, 1532, 79, 44, 118, 56, 188, 120, 60, 131, 132, 39, 50, 56, 74, 50, 107, 134, 273, 46, 258, 120, 99, 27, 65, 39, 80, 
74, 30, 44, 63, 91, 61, 21, 73, 267, 79, 26, 22, 85, 311, 17, 121, 53, 58, 44, 165, 319, 108, 83, 27, 82, 555, 50, 142, 54, 25, 183, 111, 51, 27, 66, 70, 775, 31, 29, 234, 87
6, 18, 55, 55, 74, 64, 52, 147, 18, 91, 1280, 90, 108, 259, 50, 33, 45, 112, 65, 66, 23, 120, 91, 196, 140, 390, 47, 72, 24, 41, 66, 62, 26, 87, 77, 56, 88, 28, 26, 147, 99, 
50, 158, 44, 78, 59, 37, 31, 59, 84, 42, 142, 22, 150, 53, 25, 70, 257, 89, 38, 99, 213, 24, 99, 2926, 65, 58, 42, 40, 17, 54, 26, 41, 28, 49, 89, 60, 48, 39, 97, 58, 575, 10
2, 68, 100, 68, 101, 38, 38, 43, 78, 48, 93, 141, 39, 168, 96, 21, 26, 40, 93, 122, 48, 92, 291, 99, 35, 625, 44, 40, 64, 148, 308, 26, 51, 144, 26, 26, 40, 80, 34, 30, 99, 6
5, 289, 31, 36, 38, 108, 24, 38, 27, 87, 426, 67, 72, 112, 94, 44, 50, 68, 72, 23, 51, 68, 28, 264, 36, 167, 29, 70, 45, 57, 41, 69, 36, 35, 44, 58, 43, 216, 58, 57, 22, 55, 
13, 65, 227, 36, 24, 121, 45, 49, 49, 87, 66, 26, 203, 32, 46, 32, 56, 179, 437, 70, 149, 44, 54, 123, 157, 21, 45, 65, 26, 146, 668, 29, 23, 31, 268, 100, 66, 1339, 73, 44, 
92, 47, 47, 45, 48, 35, 377, 161, 43, 94, 97, 30, 63, 360, 44, 99, 344, 26, 115, 160, 99, 205, 64, 47, 290, 57, 104, 36, 158, 300, 20, 40, 200, 92, 57, 32, 42, 62, 34, 68, 18
0, 142, 182, 56, 44, 84, 225, 95, 72, 38, 132, 109, 82, 23, 94, 40, 389, 44, 31, 53, 80, 57, 116, 37, 51, 47, 25, 169, 44, 25, 99, 97, 174, 115, 44, 55, 110, 70, 68, 70, 26, 
298, 307, 86, 74, 40, 122, 176, 50, 44, 22, 67, 44, 99, 99, 34, 20, 64, 340, 33, 47, 19, 150, 132, 158, 51, 296, 50, 310, 449, 201, 326, 75, 53, 66, 26, 119, 223, 96, 74, 38,
 279, 31, 207, 44, 249, 99, 197, 240, 23, 59, 44, 38, 181, 111, 31, 26, 86, 97, 148, 106, 289, 37, 48, 23, 26, 45, 64, 46, 172, 28, 50, 270, 362, 104, 61, 64, 34, 174, 65, 87
, 84, 249, 22, 44, 198, 24, 45, 68, 872, 70, 37, 32, 44, 38, 99, 24, 901, 272, 377, 40, 32, 68, 182, 26, 350, 888, 26, 174, 150, 69, 397, 38, 151, 37, 287, 49, 102, 14, 52, 4
4, 44, 121, 28, 178, 102, 131, 50, 57, 35, 113, 55, 29, 125, 75, 72, 162, 54, 45, 38, 40, 49, 15, 44, 83, 58, 20, 111, 140, 235, 63, 262, 50, 72, 44, 99, 35, 89, 44, 45, 162,
 277, 268, 74, 50, 83, 71, 60, 26, 124, 40, 39, 42, 34, 37, 32, 1371, 32, 26, 24, 77, 85, 71, 163, 44, 130, 282, 40, 51, 103, 53, 91, 161, 40, 58, 172, 147, 63, 55, 25, 28, 2
41, 81, 44, 65, 203, 41, 50, 288, 86, 244, 51, 108, 45, 50, 81, 152, 56, 218, 99, 88, 196, 323, 180, 128, 41, 146, 67, 145, 57, 78, 62, 44, 43, 25, 82, 30, 42, 67, 115, 333, 
78, 44, 85, 53, 19, 23, 28, 158, 51, 93, 330, 40, 14, 23, 51, 26, 63, 56, 46, 537, 48, 425, 119, 33, 170, 99, 167, 26, 29, 44, 1724, 494, 36, 58, 26, 152, 236, 44, 80, 50, 35
, 127, 432, 249, 30, 116, 281, 57, 34, 54, 35, 50, 283, 174, 74, 64, 111, 30, 70, 62, 52, 104, 65, 71, 68, 67, 65, 103, 26, 26, 68, 26, 25, 97, 510, 96, 13, 39, 36, 59, 66, 1
66, 78, 80, 86, 192, 23, 74, 43, 65, 45, 23, 131, 71, 30, 61, 24, 73, 112, 15, 102, 54, 113, 188, 60, 55, 33, 57, 453, 27, 131, 61, 201, 44, 292, 22, 55, 109, 65, 46, 82, 47,
 146, 345, 73, 98, 57, 65, 44, 143, 36, 92, 359, 157, 469, 244, 50, 180, 193, 392, 26, 42, 167, 21, 44, 105, 130, 223, 41, 45, 59, 107, 44, 97, 121, 74, 29, 99, 103, 70, 57

@jayantj
Copy link
Author

jayantj commented Apr 6, 2018

Re: columns, 1 column. The problem is reproducible with just iterating over lines in the file though. In fact, it is even more prominent.

open:

Time taken for 100000 rows: 0.02 seconds, 5294701.89 rows/s
Time taken for 100000 rows: 0.02 seconds, 5307834.63 rows/s
Time taken for 100000 rows: 0.02 seconds, 5074041.28 rows/s
Time taken for 100000 rows: 0.02 seconds, 5089989.44 rows/s
Time taken for 100000 rows: 0.02 seconds, 5262482.75 rows/s
...

smart_open:

Time taken for 100000 rows: 0.30 seconds, 330003.69 rows/s
Time taken for 100000 rows: 0.30 seconds, 329707.18 rows/s
Time taken for 100000 rows: 0.30 seconds, 329591.37 rows/s
Time taken for 100000 rows: 0.30 seconds, 330451.14 rows/s
Time taken for 100000 rows: 0.30 seconds, 328751.27 rows/s
...

open is about 15x faster.

Code:

# Iterate over lines

import time

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with open(file_name, 'r') as f:
        for i, line in enumerate(f, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows(file_name)

@piskvorky
Copy link
Owner

piskvorky commented Apr 6, 2018

That's super weird -- AFAIK, smart_open just returns open for local files, no wrapping at all, no extra indirection. It's exactly the same object underneath. Interesting mystery!

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 6, 2018

@piskvorky that's not entirely true, and I'm to blame here :)

I think the problem is to do with UTF-8 decoding. You're opening the file directly in r mode, and Python does decoding for you behind the scenes. smart_open opens the file in binary mode first, and then applies a codecs.getreader (or .getwriter) to it. Ideally, these two should have the same performance, but I wouldn't be too surprised if the slightly different implementations account for the difference.

https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L575

The reason why smart_open does this (first bytes, then text) is that it's easy to apply this abstraction to all our use cases (compression, codecs, reading from remote sources like S3, etc). I don't know if there's a way to maintain this flexibility and keep the performance. If it really matters, we could handle reading and writing local text files as a separate edge case. It wouldn't be particularly hard to do, but each additional edge case adds a little bit of cruft.

@jayantj
Copy link
Author

jayantj commented Apr 6, 2018

I had a quick look at the code and there does seem to be some wrapping. A codecs.getreader call is made and the raw file object is wrapped inside the decoder.

Using the same code as above with the file object wrapped in a similar object -

 # Iterate over lines

import codecs
import time

from smart_open import smart_open

def report_time_iterate_rows(file_name, report_every=100000):
    start = time.time()
    last = start
    with codecs.getreader('utf8')(open(file_name, 'rb')) as f:
        for i, line in enumerate(f, start=1):
            if not (i % report_every):
                current = time.time()
                time_taken = current - last
                print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
                    report_every, time_taken, report_every / time_taken))
                last = current
    total = time.time() - start
    print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
        i, total, i / total))

report_time_iterate_rows(file_name)

Throughput -

Time taken for 100000 rows: 0.31 seconds, 325395.62 rows/s
Time taken for 100000 rows: 0.31 seconds, 325199.84 rows/s
Time taken for 100000 rows: 0.31 seconds, 327534.66 rows/s
Time taken for 100000 rows: 0.31 seconds, 325983.61 rows/s
Time taken for 100000 rows: 0.31 seconds, 326569.66 rows/s
Time taken for 100000 rows: 0.31 seconds, 326363.33 rows/s
...

Seems pretty close to the smart open results

@jayantj
Copy link
Author

jayantj commented Apr 6, 2018

Ah I see you beat me to it @mpenkov :)

@piskvorky
Copy link
Owner

piskvorky commented Apr 6, 2018

Aha, that must be some new development. I wasn't aware of that, thanks for clarifying.

Yes, local files are definitely a "corner case" worth handling in an optimal way IMO. A 15x slower performance hit doesn't look good.

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 7, 2018

The reason I refer to it as an "edge case" is because smart_open can only rely on the built-in open under certain conditions:

  1. Opening a local file
  2. Ignore file extension is set to True (by default, it's False)
  3. Encoding is left at default (encoding keyword argument)
  4. Anything I've missed?

If all of the above are True, then we can shortcut the process I described earlier and just use the built-in open. If any of them are false, then we have to take the long way around.

Luckily, the conditions above are fairly easy to detect, from what I understand.

@menshikh-iv Have I missed anything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants