-
Notifications
You must be signed in to change notification settings - Fork 86
Optimize read_datetime #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize read_datetime #1019
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the performance files using SnakeViz and they reflect the performance upgrade.
year=int(date_str[:4]) if date_str[:4] != "" else 0, | ||
month=int(date_str[4:6]) if date_str[4:6] != "" else 0, | ||
day=int(date_str[-2:] if date_str[-2:] != "" else 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
year=int(date_str[:4]) if date_str[:4] != "" else 0, | |
month=int(date_str[4:6]) if date_str[4:6] != "" else 0, | |
day=int(date_str[-2:] if date_str[-2:] != "" else 0), | |
year=int(date_str[:4]) or 0, | |
month=int(date_str[4:6]) or 0, | |
day=int(date_str[-2:] or 0), |
hour=int(time_str[-12:-10]) if time_str[-12:-10] != "" else 0, | ||
minute=int(time_str[-10:-8]) if time_str[-10:-8] != "" else 0, | ||
second=int(time_str[-8:-6]) if time_str[-8:-6] != "" else 0, | ||
microsecond=int(time_str[6:12]) if time_str[6:12] != "" else 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hour=int(time_str[-12:-10]) if time_str[-12:-10] != "" else 0, | |
minute=int(time_str[-10:-8]) if time_str[-10:-8] != "" else 0, | |
second=int(time_str[-8:-6]) if time_str[-8:-6] != "" else 0, | |
microsecond=int(time_str[6:12]) if time_str[6:12] != "" else 0, | |
hour=int(time_str[-12:-10] or 0), | |
minute=int(time_str[-10:-8] or 0), | |
second=int(time_str[-8:-6] or 0), | |
microsecond=int(time_str[6:12] or 0), |
date = ( | ||
datetime.date(year=date // 10000, month=(date // 100) % 100, day=date % 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if date >= 0 | ||
else None | ||
) | ||
time = ( | ||
datetime.time( | ||
hour=(time // 10000000000) % 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string parsing takes longer according to
%%timeit
tests and is more verbose.
The way that I found that this results in a speedup is I profiled the unpacking of a nparray of 100000 datetime objects and then I overloaded the read_datetime method and profiled unpack again. When looking at the cProfile results the string process method resulted in less total time spent in read_datetime. But it seems like when you profile it you get different results.
Do you want to tag-up on this sometime? Also I have a python script that generates the cProfile performance profiling which you can then visualize using a python package called snakeviz, I can send you this script and we can look at it to see if there is some error in my logic.
I will attach pictures of the visualized performance profiles below.
specifically I looked at the decrease in total execution time as well as the totime
column which represents total time spent in each method across all method calls. Please let me know what you think @dimitri-yatsenko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't see a compelling reason why the string processing would produce a speedup. I think the reason the blob deserialization is slow is because of python's need to loop through the numbers and calling datetime.date
separately for each time and each date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real speedup can potentially be produced by using numpy.datetime64
type support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also think we could speed up the process by multiprocessing the unpacking of arrays as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.datetime64
did not exist when we made the original time serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we can recommend the workaround for storing datetimes as int64
as shown. We can add native support for the datetime64
data type, which would eliminate the need for converting into uint64
and back.
The string parsing takes longer according to |
superseded by #1036 |
Switching from floor division to string processing results in ~35% speedup in execution time for read_datetime after profiling with cProfile.