Closed
Description
Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:
def parse_date(date_str) :
return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
return [parse_date(date_str) for date_str in str_col]
use
def parse_date(date_str) :
return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
cache = dict()
for date_str in str_col :
if date_str not in cache :
cache[date_str] = parse_date(date_str)
return [cache[date_str] for date_str in str_col]
The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.
For tables where dates are repeated many times this can result in orders of magnitude speedup.
Thanks
Charles