Skip to content

datetime optimization #9594

Closed
Closed
@charles-cooper

Description

@charles-cooper

Hi,
I noticed that datetime parsing for large sql/csv tables is really slow. Would it be acceptable to use a technique where repeated calculations are cached? For example, instead of:

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    return [parse_date(date_str) for date_str in str_col]

use

def parse_date(date_str) :
    return datetime.datetime.strptime(date_str,FMT)
def parse_date_col(str_col) :
    cache = dict()
    for date_str in str_col :
        if date_str not in cache :
            cache[date_str] = parse_date(date_str)
    return [cache[date_str] for date_str in str_col]

The reason this works is that string hashing / comparison / dictionary insertion is much much faster than strptime.

For tables where dates are repeated many times this can result in orders of magnitude speedup.

Thanks
Charles

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions