ENH: Add dateutil timezone support (GH4688) #4689

prossahl · 2013-08-27T16:31:45Z

This adds dateutil time zone support as a compliment to pytz. The tests are designed to demonstrate that the same result is obtained using either library except where the libraries themselves differ (at the onset of DST).

There is a fair amount of tricky coding here in tslib.pyx as it is accessing private members of the time zone object, cacheing them and performing calculations on those cached values.

This should close #4688.

prossahl · 2013-08-28T08:01:23Z

The 2.6 and 3.2 and 3.3 builds are failing. I'll reproduce these locally and update a fix.

jtratner · 2013-08-28T17:09:59Z

Why would we want to make this change? Is there some additional functionality or flexibility we get from using dateutil's timezone support? (For example, if this removed a dependency on pytz, that could be a net benefit).

A few other items that would be helpful:

Perf tests (using test_perf.sh)
Depending on how extensive the change is, an example or description in the documentation showing how you choose to use dateutil vs. pytz.

Finally, if this is supposed to remove a pytz dependency, can you try removing the pytz imports and seeing if this will work without it (and therefore confirm that timezones aren't getting changed under the hood - maybe you're already doing this).

jreback · 2013-08-28T17:35:29Z

also pandas would like to possibly remove the dateutil dep in the future (which will take some work); so pytz is the preferred backend

wesm · 2013-08-29T00:57:00Z

One problem is that dateutil is used widely in production so eliminating its use may be hard / not worth it. But not completely clear to me.

prossahl · 2013-08-29T11:11:17Z

I added some rationale issue 4688 here but the intent was not to remove pytz but to prevent dateutil users from a situation where a Timestamp can be constructed with a dateutil time zone but it is not convertible to another time zone.

@jtratner I'll do the performance test and add some doc. on the comparison between pytz and dateutil.

jreback · 2013-08-29T11:31:36Z

can you show an example of the above? how a Timestamp cannot be converted to another time zone?

prossahl · 2013-08-29T12:04:07Z

Sure, in the case of dateutil:

>>> import pandas as pd
>>> import dateutil
>>> ts = pd.Timestamp('2013-08-30 12:00', tz=dateutil.tz.gettz('US/Eastern'))
>>> ts.astimezone(tz=dateutil.tz.gettz('US/Pacific'))
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
<Timestamp: 2013-08-30 16:00:00>

The converted Timestamp is naive and has the UTC value.

jreback · 2013-08-29T12:51:32Z

it seems that this could easily be solved by always using a stringified tz (instead of the object directly), which would then use pytz internally

prossahl · 2013-08-29T14:09:29Z

Agreed. Indeed that was option (b) I identified here: #4688 (comment)

However there is some performance impact in that there is a (small) cost associated with creating a new pytz and a (larger) cost for the cache of keeping the (effectively duplicate) data from both pytz and dateutil zones.
The cache cost would be reduced if the dateutil was converted to pytz on construction and in astimezone() so only pytz data was in the cache.

However the intent of my change was to allow users to use Pandas with either library as some people prefer dateutil over pytz.

jreback · 2013-08-29T14:28:12Z

can u quantify to perf / mem cost of these options? (don't go crazy just some simple benchmarks)

jtratner · 2013-08-30T20:57:05Z

pandas/tseries/tests/test_timezones.py

+
+    def _assert_two_datetime_values_same(self, a, b):
+        err_code = self._two_datetime_values_same(a, b)
+        self.assertEquals(err_code, 0, '%s != %s with err_code: %d' % (repr(a), repr(b), err_code))


I get what you're trying to do, but it would probably be cleaner to use the assert_attr_equal function from pandas/util/testing

def _two_values_same_attributes(self, a, b, attrs): for attr in attrs: assert_attr_equal(attr, a, b)

That way you'll get an error message that specifies exactly which attribute is off. If you want to collect them, you could run through the attrs first and then raise an error at the end:

non_equal = [attr for attr in attrs if getattr(a, attr) != getattr(b, attr)] if non_equal: raise AssertionError("Attributes not equal: %r for objects %s and %s" % (attrs, a, b))

The error code part is just hard to debug.

jtratner · 2013-08-30T21:01:49Z

Should there be a test case for doing arithmetic with a pytz timezone-aware series and a dateutil timezone-aware series/single object? Don't need to run through every operation, but would be nice to test subtraction (which should result in timedeltas). I looked through your test cases and I didn't see something like this, but I'm not sure if it's in scope (especially because I don't use datetime things frequently).

prossahl · 2013-10-16T15:11:26Z

I have added the Series subtraction test and finished the refactor of the tslib timezone cache so that the Travis builds pass. So as far as I can see this is good to go!

jreback · 2013-10-16T15:42:01Z

can you run a perf test and post relevant results (e.g. timeseries_*)....

test_perf.sh -b <commit prior> -t <lastetst commit of yours>

prossahl · 2013-10-17T10:42:49Z

Hi @jreback ,
Ran vbench and here are the results for the *time* benchmarks:

* cat vb_suite.log | grep time
series_timestamp_compare                     |   3.2133 |   4.2440 |   0.7571 |
timestamp_series_compare                     |   3.2264 |   4.2516 |   0.7589 |
reshape_pivot_time_series                    | 200.9457 | 238.4533 |   0.8427 |
timeseries_timestamp_downsample_mean         |   5.5393 |   6.0937 |   0.9090 |
timeseries_add_irregular                     |  21.1617 |  23.1246 |   0.9151 |
query_datetime_series                        |  27.3857 |  29.8944 |   0.9161 |
timeseries_to_datetime_YYYYMMDD              |  11.0830 |  12.0927 |   0.9165 |
plot_timeseries_period                       |  75.2217 |  81.9450 |   0.9180 |
timeseries_infer_freq                        |  10.4183 |  11.1361 |   0.9355 |
datetimeindex_infer_dst                      |   4.0346 |   4.2646 |   0.9461 |
timeseries_asof_nan                          |   8.3126 |   8.7557 |   0.9494 |
timeseries_asof                              |   9.0500 |   9.5197 |   0.9507 |
timeseries_1min_5min_mean                    |   0.7903 |   0.8213 |   0.9623 |
timeseries_period_downsample_mean            |  12.7714 |  13.2470 |   0.9641 |
timeseries_sort_index                        |  22.5794 |  23.4093 |   0.9645 |
timeseries_1min_5min_ohlc                    |   0.8423 |   0.8703 |   0.9679 |
timeseries_asof_single                       |   0.0513 |   0.0530 |   0.9685 |
datetime_index_union                         |   0.0937 |   0.0963 |   0.9728 |
datetime_index_intersection                  |   0.3873 |   0.3970 |   0.9754 |
datetimeindex_add_offset                     |   0.2637 |   0.2703 |   0.9756 |
query_datetime_index                         | 610.0246 | 622.7791 |   0.9795 |
timeseries_slice_minutely                    |   0.0896 |   0.0910 |   0.9852 |
datetimeindex_normalize                      |   3.5630 |   3.6083 |   0.9874 |
datetimeindex_unique                         |   0.1566 |   0.1580 |   0.9914 |
timeseries_to_datetime_iso8601               |   5.8773 |   5.8760 |   1.0002 |
timeseries_large_lookup_value                |   0.0407 |   0.0406 |   1.0020 |
index_datetime_union                         |  14.8304 |  14.2953 |   1.0374 |
index_datetime_intersection                  |  14.9283 |  14.3057 |   1.0435 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0203 |   1.1641 |
Target [d5c0b20] : ENH: Add dateutil timezone support (GH4688)

What do you think?

jtratner · 2013-10-17T12:07:30Z

Can you run that last one again and see what you get? Could just be noise.

prossahl · 2013-10-17T12:59:21Z

Not noise methinks, results from 7 runs:

timeseries_timestamp_tzinfo_cons             |   0.0240 |   0.0193 |   1.2428 |
timeseries_timestamp_tzinfo_cons             |   0.0257 |   0.0187 |   1.3745 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0203 |   1.1641 |
timeseries_timestamp_tzinfo_cons             |   0.0269 |   0.0193 |   1.3951 |
timeseries_timestamp_tzinfo_cons             |   0.0240 |   0.0213 |   1.1269 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0190 |   1.2469 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0216 |   1.0956 |

Average 1.235

prossahl · 2013-10-17T16:57:46Z

@jtratner What do you think? I have had a close look at my changes and can't see any added bottlenecks.

jtratner · 2013-10-17T21:06:12Z

I have a number of suspicions/comments (and you should try running cython
-a to see which calls will stay c-level vs. Python-level):

I'm assuming that one or more of these functions is being called many many
times, otherwise most of these things would be small. The major issue (I
think) is that you've introduced a number of places where Cython has to
bump up to Python-level for hasattr/getattr calls. The best way to resolve
this would be to have separate functions for pytz vs. dateutil instead of
determining which tz you have each time. We should set up this module to
return the function to apply and avoid the determination each time (and you
need to make the assumption that you aren't going to have mixed tzs or
whatever) or delegate the iteration to Cython so it can call the function
from there.

I think if you define a few globals that cover some of the underlying
attributes (like ), etc. you could get some speedups. You can also use
cython -a <filename> to see which calls will stay c-level vs.
Python-level (the more yellow, the more Python required).

I believe

In addition to that, you've added a number of getattr/hasattr calls
(module.something.another.thing means 3 getattr calls, all of which need
to go to Python level) throughout, particularly the checks for which type
of tz you have.

You've changed the is_utc function from using a predefined global
(_u_utc) to doing 2 Python getattr lookups: dateutil.tz.tzutc.
Your functions to test also add two more hasattr checks which cause two
additional calls up to Python-level.
If we're assuming pytz is the standard and you don't want to have
separate functions, it could make more sense to make the code try doing it
as pytz first, catch the attribute error and then use a different code
path.
Seems like you could inline each of the cpdef'd functions (and then if
you feel like it's actually needed outside of Cython code, provide a Python
wrapper class that's def'd).
_is_fixed_offset has gone from one attribute checks + return to 3-6
attribute checks at the minimum, plus boolean comparisons, etc.
_tz_cache_key does at least 2 Python lookups (for the pytz checks) + a
call to str and type conversion (this may be the same).
You get two python lookups from pytz.tzinfo.BaseTzInfo, probably better
to just define a global BaseTzInfo=pytz.tzinfo.BaseTzInfo.

Perhaps the best option is to test for pytz vs. dateutil once then use
two different functions (similar to what we do for code generation for
different types). Heck, you could do it at Python level and then pass to
different functions depending on which you have (or use a dict too lookup
(i.e.): {'pytz': pytz_func, 'dateutil': dateutil_func}/

Not sure whether isinstance checks would be different (I doubt it)

prossahl · 2013-10-18T15:21:47Z

@jtratner Thanks so much for taking the time to do such a detailed analysis. I haven't seen any significant speed improvement with these optimisations yet. I'm on a different project next week and I'll return to this after completing that. Thanks again.

jtratner · 2013-10-18T16:06:43Z

Bummer - I'm mostly spitballing here anyway - maybe I'll see if any of them
actually have an effect this weekend (actually see if my comments are
useful).

jreback · 2014-02-16T12:08:19Z

closing as stale

@prossahl if u would like to come back to this let us know
has merit

dbew · 2014-03-04T12:11:30Z

I've picked this up in https://github.com/orgs/ahlmss where @prossahl left off. We're still interested in pursuing the work in GH4688 but not through this PR. We've found some problems with the code in this PR so we're happy to close it and open a new request related to the original issue.

The problem we found is that conversion between pytz and dateutil can't be round-tripped - and this PR implicitly converts from dateutil to pytz.

I'm working on a patch which adds full support for dateutil (e.g. create TimeStamp, DatetimeIndex, DataFrame with dateutil timezones, avoid conversions where possible) but fails some tests. I'll be looking at doing some work on this over the next few weeks to try and get it working fully.

This would only really be worthwhile for us if we can get it merged into pandas (otherwise we'll end up in merging-hell at every pandas release). Are you and the rest of the pandas team still interested in dateutil support?

dbew · 2014-03-04T12:11:37Z

I've picked this up in https://github.com/orgs/ahlmss where @prossahl left off. We're still interested in pursuing the work in GH4688 but not through this PR. We've found some problems with the code in this PR so we're happy to close it and open a new request related to the original issue.

The problem we found is that conversion between pytz and dateutil can't be round-tripped - and this PR implicitly converts from dateutil to pytz.

I'm working on a patch which adds full support for dateutil (e.g. create TimeStamp, DatetimeIndex, DataFrame with dateutil timezones, avoid conversions where possible) but fails some tests. I'll be looking at doing some work on this over the next few weeks to try and get it working fully.

This would only really be worthwhile for us if we can get it merged into pandas (otherwise we'll end up in merging-hell at every pandas release). Are you and the rest of the pandas team still interested in dateutil support?

jreback · 2014-03-04T12:48:42Z

I think this would be a worthwhile addition, but you are right. needs to be integrated in a more seemless manner. go for a PR and will close this then (just so this is still a reminder)

jtratner reviewed Aug 30, 2013
View reviewed changes

ENH: Add dateutil timezone support (GH4688)

d5c0b20

jorisvandenbossche mentioned this pull request Oct 16, 2013

ENH: Scatterplot Method added #3473

Merged

jreback closed this Jan 3, 2014

jreback reopened this Jan 3, 2014

jreback closed this Feb 16, 2014

dbew mentioned this pull request Apr 25, 2014

ENH: Support dateutil timezones. GH4688. #6968

Merged

Uh oh!

ENH: Add dateutil timezone support (GH4688) #4689

ENH: Add dateutil timezone support (GH4688) #4689

Uh oh!

Conversation

prossahl commented Aug 27, 2013

Uh oh!

prossahl commented Aug 28, 2013

Uh oh!

jtratner commented Aug 28, 2013

Uh oh!

jreback commented Aug 28, 2013

Uh oh!

wesm commented Aug 29, 2013

Uh oh!

prossahl commented Aug 29, 2013

Uh oh!

jreback commented Aug 29, 2013

Uh oh!

prossahl commented Aug 29, 2013

Uh oh!

jreback commented Aug 29, 2013

Uh oh!

prossahl commented Aug 29, 2013

Uh oh!

jreback commented Aug 29, 2013

Uh oh!

jtratner Aug 30, 2013

Choose a reason for hiding this comment

Uh oh!

jtratner commented Aug 30, 2013

Uh oh!

prossahl commented Oct 16, 2013

Uh oh!

jreback commented Oct 16, 2013

Uh oh!

prossahl commented Oct 17, 2013

Uh oh!

jtratner commented Oct 17, 2013

Uh oh!

prossahl commented Oct 17, 2013

Uh oh!

prossahl commented Oct 17, 2013

Uh oh!

jtratner commented Oct 17, 2013

Uh oh!

prossahl commented Oct 18, 2013

Uh oh!

jtratner commented Oct 18, 2013

Uh oh!

jreback commented Feb 16, 2014

Uh oh!

dbew commented Mar 4, 2014

Uh oh!

dbew commented Mar 4, 2014

Uh oh!

jreback commented Mar 4, 2014

Uh oh!

Uh oh!