Skip to content

perf for read_csv with parse_dates #16914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cottrell opened this issue Jul 13, 2017 · 1 comment
Closed

perf for read_csv with parse_dates #16914

cottrell opened this issue Jul 13, 2017 · 1 comment
Labels
Duplicate Report Duplicate issue or pull request

Comments

@cottrell
Copy link
Contributor

Caching (memoizing) the date_parser function in read_csv might be an easy perf improvement. Seems it is not cached unless I am missing something?

In [55]: df = pd.DataFrame([datetime.datetime.today()] * 1000000)

In [56]: df.to_csv('j', index=False)

In [57]: !gzip j

In [58]: %time df = pd.read_csv('j.gz')
CPU times: user 703 ms, sys: 68.7 ms, total: 772 ms
Wall time: 774 ms

In [59]: d = {df['0'][0]: datetime.datetime.today()}

In [60]: %time s = df['0'].map(d)
CPU times: user 84.8 ms, sys: 14.8 ms, total: 99.6 ms
Wall time: 99.2 ms

In [61]: %time df = pd.read_csv('j.gz', parse_dates=['0'])
CPU times: user 1.49 s, sys: 88.7 ms, total: 1.58 s
Wall time: 1.58 s
@chris-b1
Copy link
Contributor

Yep, caching certainly could help, this is a duplicate of #11665. PR welcome!

@chris-b1 chris-b1 added this to the No action milestone Jul 13, 2017
@chris-b1 chris-b1 added the Duplicate Report Duplicate issue or pull request label Jul 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants