-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly lower throughput compared to open
#184
Comments
Thank you for the report. Are you using Python 2 or 3? |
Python 3.6.4 |
@jayantj can you demonstrate a number of columns & line length distribution (this will be really helpful for generating "fake" data & reproducing the problem). |
Sure. Random sample of 1000 line lengths -
|
Re: columns, 1 column. The problem is reproducible with just iterating over lines in the file though. In fact, it is even more prominent. open:
smart_open:
open is about 15x faster. Code: # Iterate over lines
import time
from smart_open import smart_open
def report_time_iterate_rows(file_name, report_every=100000):
start = time.time()
last = start
with open(file_name, 'r') as f:
for i, line in enumerate(f, start=1):
if not (i % report_every):
current = time.time()
time_taken = current - last
print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
report_every, time_taken, report_every / time_taken))
last = current
total = time.time() - start
print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
i, total, i / total))
report_time_iterate_rows(file_name) |
That's super weird -- AFAIK, |
@piskvorky that's not entirely true, and I'm to blame here :) I think the problem is to do with UTF-8 decoding. You're opening the file directly in r mode, and Python does decoding for you behind the scenes. smart_open opens the file in binary mode first, and then applies a codecs.getreader (or .getwriter) to it. Ideally, these two should have the same performance, but I wouldn't be too surprised if the slightly different implementations account for the difference. The reason why smart_open does this (first bytes, then text) is that it's easy to apply this abstraction to all our use cases (compression, codecs, reading from remote sources like S3, etc). I don't know if there's a way to maintain this flexibility and keep the performance. If it really matters, we could handle reading and writing local text files as a separate edge case. It wouldn't be particularly hard to do, but each additional edge case adds a little bit of cruft. |
I had a quick look at the code and there does seem to be some wrapping. A Using the same code as above with the file object wrapped in a similar object - # Iterate over lines
import codecs
import time
from smart_open import smart_open
def report_time_iterate_rows(file_name, report_every=100000):
start = time.time()
last = start
with codecs.getreader('utf8')(open(file_name, 'rb')) as f:
for i, line in enumerate(f, start=1):
if not (i % report_every):
current = time.time()
time_taken = current - last
print('Time taken for %d rows: %.2f seconds, %.2f rows/s' % (
report_every, time_taken, report_every / time_taken))
last = current
total = time.time() - start
print('Total: %d rows, %.2f seconds, %.2f rows/s' % (
i, total, i / total))
report_time_iterate_rows(file_name) Throughput -
Seems pretty close to the smart open results |
Ah I see you beat me to it @mpenkov :) |
Aha, that must be some new development. I wasn't aware of that, thanks for clarifying. Yes, local files are definitely a "corner case" worth handling in an optimal way IMO. A 15x slower performance hit doesn't look good. |
The reason I refer to it as an "edge case" is because smart_open can only rely on the built-in open under certain conditions:
If all of the above are True, then we can shortcut the process I described earlier and just use the built-in open. If any of them are false, then we have to take the long way around. Luckily, the conditions above are fairly easy to detect, from what I understand. @menshikh-iv Have I missed anything? |
I was iterating over a large csv file using
open
vssmart_open
and noticed a significant performance drop when nothing was changed butopen
->smart_open
Output with
open
:Output with
smart_open
:Unfortunately, the file I'm using is sensitive data, so I can't share it, but I assume this should be reproducible with any file with a large number of lines. Information about file -
Number of lines: 25206601
File size: 2707135791 (~2.7 GB)
The text was updated successfully, but these errors were encountered: