Skip to content

ENH: Add dateutil timezone support (GH4688) #4689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

prossahl
Copy link
Contributor

This adds dateutil time zone support as a compliment to pytz. The tests are designed to demonstrate that the same result is obtained using either library except where the libraries themselves differ (at the onset of DST).

There is a fair amount of tricky coding here in tslib.pyx as it is accessing private members of the time zone object, cacheing them and performing calculations on those cached values.

This should close #4688.

@prossahl
Copy link
Contributor Author

The 2.6 and 3.2 and 3.3 builds are failing. I'll reproduce these locally and update a fix.

@jtratner
Copy link
Contributor

Why would we want to make this change? Is there some additional functionality or flexibility we get from using dateutil's timezone support? (For example, if this removed a dependency on pytz, that could be a net benefit).

A few other items that would be helpful:

  1. Perf tests (using test_perf.sh)
  2. Depending on how extensive the change is, an example or description in the documentation showing how you choose to use dateutil vs. pytz.

Finally, if this is supposed to remove a pytz dependency, can you try removing the pytz imports and seeing if this will work without it (and therefore confirm that timezones aren't getting changed under the hood - maybe you're already doing this).

@jreback
Copy link
Contributor

jreback commented Aug 28, 2013

also pandas would like to possibly remove the dateutil dep in the future (which will take some work); so pytz is the preferred backend

@wesm
Copy link
Member

wesm commented Aug 29, 2013

One problem is that dateutil is used widely in production so eliminating its use may be hard / not worth it. But not completely clear to me.

@prossahl
Copy link
Contributor Author

I added some rationale issue 4688 here but the intent was not to remove pytz but to prevent dateutil users from a situation where a Timestamp can be constructed with a dateutil time zone but it is not convertible to another time zone.

@jtratner I'll do the performance test and add some doc. on the comparison between pytz and dateutil.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

can you show an example of the above? how a Timestamp cannot be converted to another time zone?

@prossahl
Copy link
Contributor Author

Sure, in the case of dateutil:

>>> import pandas as pd
>>> import dateutil
>>> ts = pd.Timestamp('2013-08-30 12:00', tz=dateutil.tz.gettz('US/Eastern'))
>>> ts.astimezone(tz=dateutil.tz.gettz('US/Pacific'))
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
<Timestamp: 2013-08-30 16:00:00>

The converted Timestamp is naive and has the UTC value.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

it seems that this could easily be solved by always using a stringified tz (instead of the object directly), which would then use pytz internally

@prossahl
Copy link
Contributor Author

Agreed. Indeed that was option (b) I identified here: #4688 (comment)

However there is some performance impact in that there is a (small) cost associated with creating a new pytz and a (larger) cost for the cache of keeping the (effectively duplicate) data from both pytz and dateutil zones.
The cache cost would be reduced if the dateutil was converted to pytz on construction and in astimezone() so only pytz data was in the cache.

However the intent of my change was to allow users to use Pandas with either library as some people prefer dateutil over pytz.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

can u quantify to perf / mem cost of these options? (don't go crazy just some simple benchmarks)


def _assert_two_datetime_values_same(self, a, b):
err_code = self._two_datetime_values_same(a, b)
self.assertEquals(err_code, 0, '%s != %s with err_code: %d' % (repr(a), repr(b), err_code))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you're trying to do, but it would probably be cleaner to use the assert_attr_equal function from pandas/util/testing

def _two_values_same_attributes(self, a, b, attrs):
    for attr in attrs:
        assert_attr_equal(attr, a, b)

That way you'll get an error message that specifies exactly which attribute is off. If you want to collect them, you could run through the attrs first and then raise an error at the end:

    non_equal = [attr for attr in attrs if getattr(a, attr) != getattr(b, attr)]
    if non_equal:
        raise AssertionError("Attributes not equal: %r for objects %s and %s" % (attrs, a, b))

The error code part is just hard to debug.

@jtratner
Copy link
Contributor

Should there be a test case for doing arithmetic with a pytz timezone-aware series and a dateutil timezone-aware series/single object? Don't need to run through every operation, but would be nice to test subtraction (which should result in timedeltas). I looked through your test cases and I didn't see something like this, but I'm not sure if it's in scope (especially because I don't use datetime things frequently).

@prossahl
Copy link
Contributor Author

I have added the Series subtraction test and finished the refactor of the tslib timezone cache so that the Travis builds pass. So as far as I can see this is good to go!

@jreback
Copy link
Contributor

jreback commented Oct 16, 2013

can you run a perf test and post relevant results (e.g. timeseries_*)....

test_perf.sh -b <commit prior> -t <lastetst commit of yours>

@prossahl
Copy link
Contributor Author

Hi @jreback ,
Ran vbench and here are the results for the *time* benchmarks:

* cat vb_suite.log | grep time
series_timestamp_compare                     |   3.2133 |   4.2440 |   0.7571 |
timestamp_series_compare                     |   3.2264 |   4.2516 |   0.7589 |
reshape_pivot_time_series                    | 200.9457 | 238.4533 |   0.8427 |
timeseries_timestamp_downsample_mean         |   5.5393 |   6.0937 |   0.9090 |
timeseries_add_irregular                     |  21.1617 |  23.1246 |   0.9151 |
query_datetime_series                        |  27.3857 |  29.8944 |   0.9161 |
timeseries_to_datetime_YYYYMMDD              |  11.0830 |  12.0927 |   0.9165 |
plot_timeseries_period                       |  75.2217 |  81.9450 |   0.9180 |
timeseries_infer_freq                        |  10.4183 |  11.1361 |   0.9355 |
datetimeindex_infer_dst                      |   4.0346 |   4.2646 |   0.9461 |
timeseries_asof_nan                          |   8.3126 |   8.7557 |   0.9494 |
timeseries_asof                              |   9.0500 |   9.5197 |   0.9507 |
timeseries_1min_5min_mean                    |   0.7903 |   0.8213 |   0.9623 |
timeseries_period_downsample_mean            |  12.7714 |  13.2470 |   0.9641 |
timeseries_sort_index                        |  22.5794 |  23.4093 |   0.9645 |
timeseries_1min_5min_ohlc                    |   0.8423 |   0.8703 |   0.9679 |
timeseries_asof_single                       |   0.0513 |   0.0530 |   0.9685 |
datetime_index_union                         |   0.0937 |   0.0963 |   0.9728 |
datetime_index_intersection                  |   0.3873 |   0.3970 |   0.9754 |
datetimeindex_add_offset                     |   0.2637 |   0.2703 |   0.9756 |
query_datetime_index                         | 610.0246 | 622.7791 |   0.9795 |
timeseries_slice_minutely                    |   0.0896 |   0.0910 |   0.9852 |
datetimeindex_normalize                      |   3.5630 |   3.6083 |   0.9874 |
datetimeindex_unique                         |   0.1566 |   0.1580 |   0.9914 |
timeseries_to_datetime_iso8601               |   5.8773 |   5.8760 |   1.0002 |
timeseries_large_lookup_value                |   0.0407 |   0.0406 |   1.0020 |
index_datetime_union                         |  14.8304 |  14.2953 |   1.0374 |
index_datetime_intersection                  |  14.9283 |  14.3057 |   1.0435 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0203 |   1.1641 |
Target [d5c0b20] : ENH: Add dateutil timezone support (GH4688)

What do you think?

@jtratner
Copy link
Contributor

Can you run that last one again and see what you get? Could just be noise.

@prossahl
Copy link
Contributor Author

Not noise methinks, results from 7 runs:

timeseries_timestamp_tzinfo_cons             |   0.0240 |   0.0193 |   1.2428 |
timeseries_timestamp_tzinfo_cons             |   0.0257 |   0.0187 |   1.3745 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0203 |   1.1641 |
timeseries_timestamp_tzinfo_cons             |   0.0269 |   0.0193 |   1.3951 |
timeseries_timestamp_tzinfo_cons             |   0.0240 |   0.0213 |   1.1269 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0190 |   1.2469 |
timeseries_timestamp_tzinfo_cons             |   0.0237 |   0.0216 |   1.0956 |

Average 1.235

@prossahl
Copy link
Contributor Author

@jtratner What do you think? I have had a close look at my changes and can't see any added bottlenecks.

@jtratner
Copy link
Contributor

I have a number of suspicions/comments (and you should try running cython
-a to see which calls will stay c-level vs. Python-level):

I'm assuming that one or more of these functions is being called many many
times, otherwise most of these things would be small. The major issue (I
think) is that you've introduced a number of places where Cython has to
bump up to Python-level for hasattr/getattr calls. The best way to resolve
this would be to have separate functions for pytz vs. dateutil instead of
determining which tz you have each time. We should set up this module to
return the function to apply and avoid the determination each time (and you
need to make the assumption that you aren't going to have mixed tzs or
whatever) or delegate the iteration to Cython so it can call the function
from there.

I think if you define a few globals that cover some of the underlying
attributes (like ), etc. you could get some speedups. You can also use
cython -a <filename> to see which calls will stay c-level vs.
Python-level (the more yellow, the more Python required).

I believe

In addition to that, you've added a number of getattr/hasattr calls
(module.something.another.thing means 3 getattr calls, all of which need
to go to Python level) throughout, particularly the checks for which type
of tz you have.

  1. You've changed the is_utc function from using a predefined global
    (_u_utc) to doing 2 Python getattr lookups: dateutil.tz.tzutc.
  2. Your functions to test also add two more hasattr checks which cause two
    additional calls up to Python-level.
  3. If we're assuming pytz is the standard and you don't want to have
    separate functions, it could make more sense to make the code try doing it
    as pytz first, catch the attribute error and then use a different code
    path.
  4. Seems like you could inline each of the cpdef'd functions (and then if
    you feel like it's actually needed outside of Cython code, provide a Python
    wrapper class that's def'd).
  5. _is_fixed_offset has gone from one attribute checks + return to 3-6
    attribute checks at the minimum, plus boolean comparisons, etc.
  6. _tz_cache_key does at least 2 Python lookups (for the pytz checks) + a
    call to str and type conversion (this may be the same).
  7. You get two python lookups from pytz.tzinfo.BaseTzInfo, probably better
    to just define a global BaseTzInfo=pytz.tzinfo.BaseTzInfo.

Perhaps the best option is to test for pytz vs. dateutil once then use
two different functions (similar to what we do for code generation for
different types). Heck, you could do it at Python level and then pass to
different functions depending on which you have (or use a dict too lookup
(i.e.): {'pytz': pytz_func, 'dateutil': dateutil_func}/

Not sure whether isinstance checks would be different (I doubt it)

@prossahl
Copy link
Contributor Author

@jtratner Thanks so much for taking the time to do such a detailed analysis. I haven't seen any significant speed improvement with these optimisations yet. I'm on a different project next week and I'll return to this after completing that. Thanks again.

@jtratner
Copy link
Contributor

Bummer - I'm mostly spitballing here anyway - maybe I'll see if any of them
actually have an effect this weekend (actually see if my comments are
useful).

@jreback jreback closed this Jan 3, 2014
@jreback jreback reopened this Jan 3, 2014
@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

closing as stale

@prossahl if u would like to come back to this let us know
has merit

@jreback jreback closed this Feb 16, 2014
@dbew
Copy link
Contributor

dbew commented Mar 4, 2014

I've picked this up in https://github.com/orgs/ahlmss where @prossahl left off. We're still interested in pursuing the work in GH4688 but not through this PR. We've found some problems with the code in this PR so we're happy to close it and open a new request related to the original issue.

The problem we found is that conversion between pytz and dateutil can't be round-tripped - and this PR implicitly converts from dateutil to pytz.

I'm working on a patch which adds full support for dateutil (e.g. create TimeStamp, DatetimeIndex, DataFrame with dateutil timezones, avoid conversions where possible) but fails some tests. I'll be looking at doing some work on this over the next few weeks to try and get it working fully.

This would only really be worthwhile for us if we can get it merged into pandas (otherwise we'll end up in merging-hell at every pandas release). Are you and the rest of the pandas team still interested in dateutil support?

1 similar comment
@dbew
Copy link
Contributor

dbew commented Mar 4, 2014

I've picked this up in https://github.com/orgs/ahlmss where @prossahl left off. We're still interested in pursuing the work in GH4688 but not through this PR. We've found some problems with the code in this PR so we're happy to close it and open a new request related to the original issue.

The problem we found is that conversion between pytz and dateutil can't be round-tripped - and this PR implicitly converts from dateutil to pytz.

I'm working on a patch which adds full support for dateutil (e.g. create TimeStamp, DatetimeIndex, DataFrame with dateutil timezones, avoid conversions where possible) but fails some tests. I'll be looking at doing some work on this over the next few weeks to try and get it working fully.

This would only really be worthwhile for us if we can get it merged into pandas (otherwise we'll end up in merging-hell at every pandas release). Are you and the rest of the pandas team still interested in dateutil support?

@jreback
Copy link
Contributor

jreback commented Mar 4, 2014

I think this would be a worthwhile addition, but you are right. needs to be integrated in a more seemless manner. go for a PR and will close this then (just so this is still a reminder)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add dateutil timezone support
5 participants