-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
datetime optimization #9594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For reading csvs with |
@charles-cooper Can you give a reproducible example? And is this approach still faster when using |
Quick example now that I have a python terminal in front of me: class memoize:
def __init__(self, function):
self.function = function
self.memoized = {}
def __call__(self, *args):
try:
return self.memoized[args]
except KeyError:
self.memoized[args] = self.function(*args)
return self.memoized[args]
def to_datetime(x):
return pd.to_datetime(x, '%Y/%m/%d')
def time_read_csv(date_parser, num_rows=50000):
s = StringIO.StringIO()
pd.DataFrame([{'d': '2014/01/01', 'a': 0}] * num_rows).to_csv(s, index=False)
s.seek(0)
return pd.read_csv(s, parse_dates=['d'], date_parser=date_parser) In [83]: %timeit time_read_csv(to_datetime)
1 loops, best of 3: 4.59 s per loop
In [84]: %timeit time_read_csv(memoize(to_datetime))
10 loops, best of 3: 185 ms per loop Looks like pandas read_sql doesn't accept arbitrary date parsers, but you could read the date column as strings and apply a memoized date parser after the initial read. |
there was a whole discussion of this in #9377 so the simple heuristic is this:
With the addendum if you have a format that Trying to memoize is fine, but not necessary as that is what There was some talk of adding this to the docs (in |
Thanks for the feedback all, Charles |
turns out |
Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:
use
The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.
For tables where dates are repeated many times this can result in orders of magnitude speedup.
Thanks
Charles
The text was updated successfully, but these errors were encountered: