API: add DatetimeBlockTZ #8260 #10477

jreback · 2015-06-30T15:44:09Z

ToDos:

~~- [ ] get_values/values - make consistent~~
~~- [ ] maybe move DatetimeTZBlock.shift mostly to DatetimeIndex.shift~~

Also

This cleans up the internal blocks calling conventions a bit
Fixes a bug in DatetimeIndex and localizing when NaT's are present

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
timestamp_tz_ops_diff1                       |   0.1804 | 159.4527 |   0.0011 | # note these are 10k element
timestamp_tz_ops_diff2                       |   2.5350 | 156.4047 |   0.0162 | # note these are 10k elements
timeseries_timestamp_downsample_mean         |   3.1467 |   3.3040 |   0.9524 |
timestamp_series_compare                     |   9.0797 |   9.1290 |   0.9946 |
timestamp_ops_diff2                          |  19.7570 |  19.6819 |   1.0038 | # this is 1M elements
series_timestamp_compare                     |   9.3430 |   9.0226 |   1.0355 |
timestamp_ops_diff1                          |   9.7457 |   9.0450 |   1.0775 | # this is 1M elements
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8502474] : API: add Block.make_block
API: add DatetimeBlockWithTZ #8260
Base   [16a44ad] : Merge pull request #10199 from jreback/gil
PERF: releasing the GIL, #8882

Demo

In [1]:   df = DataFrame({'A' : date_range('20130101',periods=3),
   ...:                    'B' : date_range('20130101',periods=3,tz='US/Eastern'),
   ...:                    'C' : date_range('20130101',periods=3,tz='CET')})

In [2]:    df
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]:    df.dtypes
Out[3]: 
A                datetime64[ns]
B    datetime64[ns, US/Eastern]
C           datetime64[ns, CET]
dtype: object

In [4]: df.B
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]

In [5]: df.B.dt.tz_localize(None)
Out[5]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

jreback · 2015-06-30T15:44:30Z

cc @ssanderson

ssanderson · 2015-06-30T15:50:27Z

Ooh, shiny! I'll try this out tonight.

jorisvandenbossche · 2015-07-01T11:15:10Z

@jreback Nice! I will try to go through it one of the coming days

ssanderson · 2015-07-01T23:27:37Z

@jreback one quirk I'm noticing is that if you construct a DataFrame directly from a DatetimeIndex, you lose the tz information:

In [1]: dates = date_range('2014-01-01', periods=10, tz='UTC')

In [2]: from_dict = DataFrame({'a': dates})

In [3]: from_dict.dtypes
Out[3]:
a    datetime64[ns, UTC]
dtype: object

In [4]: from_index = DataFrame(dates)

In [5]: from_index.dtypes
Out[5]:
0    datetime64[ns]
dtype: object

Is this expected behavior?

ssanderson · 2015-07-01T23:31:08Z

Also surprising: if I take a series of dtype datetime64[ns, UTC] and call.values on it, I get a DatetimeIndex rather than an ndarray:

In [7]: from_dict['a'].values
Out[7]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10'],
              dtype='datetime64[ns]', freq='D', tz='UTC')

The result I'd expect here is what you actually get from doing

In [21]: from_dict['a'].values.values
Out[21]:
array(['2013-12-31T19:00:00.000000000-0500',
       '2014-01-01T19:00:00.000000000-0500',
       '2014-01-02T19:00:00.000000000-0500',
       '2014-01-03T19:00:00.000000000-0500',
       '2014-01-04T19:00:00.000000000-0500',
       '2014-01-05T19:00:00.000000000-0500',
       '2014-01-06T19:00:00.000000000-0500',
       '2014-01-07T19:00:00.000000000-0500',
       '2014-01-08T19:00:00.000000000-0500',
       '2014-01-09T19:00:00.000000000-0500'], dtype='datetime64[ns]')

jreback · 2015-07-01T23:34:12Z

do you have the current one?

this all works (it might not have in a prior one)

jreback · 2015-07-01T23:36:54Z

This is the latest jreback@19ec61d I think

In [3]: from_dict
Out[3]: 
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: from_dict.dtypes
Out[4]: 
a    datetime64[ns, UTC]
dtype: object

Series.values IS a DatetimeIndex that is how its implemented. Its very similar to how Sparse/Categorical are done. This preserves the tz info inside the object (actually have freq too). Rather than relying upon a ndarray impl and passing it around. Much cleaner this way.

ssanderson · 2015-07-01T23:36:56Z

I'm testing with a development install of this branch:

(pandas)[~/clones/pandas]@(tz:2422fe50fc)$ git log HEAD^..HEAD
commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

jreback · 2015-07-01T23:37:14Z

@ssanderson pretty old branch. pull my latest.

jreback · 2015-07-01T23:39:56Z

commit 757bbf92c926d4584d01bce419a576c7cb831fce
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Jul 1 19:38:16 2015 -0400

    start on csv

commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

everthing works except for csv rounding tripping (not sure I can get it as can't repro the tz upon readback) but we'll see, and to/from hdf5 (but soon).

jreback · 2015-07-01T23:40:35Z

@ssanderson I have been amending that commit, so for sure update.

ssanderson · 2015-07-01T23:41:35Z

I pulled 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec ~20 minutes ago. The commit hash would change if the content had changed.

jreback · 2015-07-01T23:42:17Z

did you make? (it has a tad bit of cython code changes)

ssanderson · 2015-07-01T23:43:14Z

I did a pip install -e . in a fresh virtualenv. What in the above that I posted is different from what you're seeing?

jreback · 2015-07-01T23:44:00Z

sorry, I didn't see what you meant above. DataFrame(new_index) hang on a sec. Didnt have a test for that.

ssanderson · 2015-07-01T23:44:37Z

The from_dict case is working as expected modulo the unexpected type of .values, which sounds like it's actually as-designed. The one that I think is incorrect is direct construction from a DatetimeIndex.

jreback · 2015-07-01T23:47:38Z

very subtle path difference here....fixing

In [6]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern')).dtypes
Out[6]: 
0    datetime64[ns]
dtype: object

In [7]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern',name='foo')).dtypes
Out[7]: 
foo    datetime64[ns, US/Eastern]
dtype: object

jreback · 2015-07-01T23:54:40Z

@ssanderson fixed

jreback · 2015-07-01T23:55:38Z

if you have a chance I'd like to see how you are actually using it, if you could post some small sample code would be great. I can time things, but having a usecase is even better.

ssanderson · 2015-07-01T23:56:01Z

@jreback I think DataFrame indexing is broken with columns of tz-aware dtype:

In [2]: df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')})

In [3]: df
Out[3]:
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: df.iloc[5]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-1d4dcfe8a425> in <module>()
----> 1 df.iloc[5]

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in __getitem__(self, key)
   1187             return self._getitem_tuple(key)
   1188         else:
-> 1189             return self._getitem_axis(key, axis=0)
   1190
   1191     def _getitem_axis(self, key, axis=0):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1478                 self._is_valid_integer(key, axis)
   1479
-> 1480             return self._get_loc(key, axis=axis)
   1481
   1482     def _convert_to_indexer(self, obj, axis=0, is_setter=False):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _get_loc(self, key, axis)
     87
     88     def _get_loc(self, key, axis=0):
---> 89         return self.obj._ixs(key, axis=axis)
     90
     91     def _slice(self, obj, axis=0, kind=None):

/home/ssanderson/clones/pandas/pandas/core/frame.pyc in _ixs(self, i, axis)
   1727                     copy=True
   1728                 else:
-> 1729                     new_values = self._data.fast_xs(i)
   1730
   1731                     # if we are a copy, mark as such

/home/ssanderson/clones/pandas/pandas/core/internals.pyc in fast_xs(self, loc)
   2899         """
   2900         if len(self.blocks) == 1:
-> 2901             return self.blocks[0].values[:, loc]
   2902
   2903         items = self.items

/home/ssanderson/clones/pandas/pandas/tseries/base.pyc in __getitem__(self, key)
     93             attribs['freq'] = freq
     94
---> 95             result = getitem(key)
     96             if result.ndim > 1:
     97                 return result

IndexError: too many indices for array

jreback · 2015-07-02T00:07:19Z

hmm, let me look.

jreback · 2015-09-04T02:57:27Z

ok, I changed this. So now .values -> 'external values', and ._values -> 'internal values'. These are currently the same for everything, except DatetimeTZ. so this allows an internal implementation, and we can have an external .values that is different.

so:

In [1]: s = Series(date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0    2013-01-01 00:00:00-05:00
1    2013-01-02 00:00:00-05:00
2    2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: s.values
Out[3]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [4]: s._values
Out[4]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

jorisvandenbossche · 2015-09-04T12:32:49Z

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?

Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3), 
                         'B' : date_range('20130101',periods=3,tz='US/Eastern'),
                         'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
           A                          B                          C
0 2013-01-01  2013-01-01 00:00:00-05:00  2013-01-01 00:00:00+01:00
1 2013-01-02  2013-01-02 00:00:00-05:00  2013-01-02 00:00:00+01:00
2 2013-01-03  2013-01-03 00:00:00-05:00  2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'

jreback · 2015-09-04T12:57:05Z

fixed the bug:

In [2]: df.astype('datetime64[ns]')
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]: df.astype('datetime64[ns]').dtypes
Out[3]: 
A    datetime64[ns]
B    datetime64[ns]
C    datetime64[ns]
dtype: object

For the first, converted to UTC and displated as if its astype(['datetime64[ns]])

n [18]: df['B'].astype('datetime64[ns]')
Out[18]: 
0   2013-01-01 05:00:00
1   2013-01-02 05:00:00
2   2013-01-03 05:00:00
Name: B, dtype: datetime64[ns]

In [19]: df['B'].astype('datetime64[ns]').values
Out[19]: 
array(['2013-01-01T00:00:00.000000000-0500',
       '2013-01-02T00:00:00.000000000-0500',
       '2013-01-03T00:00:00.000000000-0500'], dtype='datetime64[ns]')

I could be on board with that, except again you lose the fact that this has a meaningful tz, but if you are really accessing .values then you know what you are doing and want a numpy array anyhow

jreback · 2015-09-04T13:37:24Z

ok making this change. it actually is exposing some bugs.....:)

jreback · 2015-09-04T14:06:17Z

ok, updated as I described above.

ssanderson · 2015-09-04T14:27:45Z

👍 from me on this proposal. If I'm accessing.values, it's almost always because I care about performance or because I'm doing something numpy-specific.

On Sep 4, 2015, at 8:33 AM, Joris Van den Bossche notifications@github.com wrote:

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?
Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'
—
Reply to this email directly or view it on GitHub.

TomAugspurger · 2015-09-04T14:40:27Z

Also +1 on .values returning a numpy array of datetime64s (without tz).

jreback · 2015-09-04T16:29:44Z

ok, pls have a final look if desired.

jorisvandenbossche · 2015-09-05T10:06:59Z

Final comment: the series .values now gives you the datetime64 values, but when having multiple columns, this are still the Timestamp objects. This seems a bit inconsistent:

In [10]: df['B'].values
Out[10]:
array(['2013-01-01T06:00:00.000000000+0100',
       '2013-01-02T06:00:00.000000000+0100',
       '2013-01-03T06:00:00.000000000+0100'], dtype='datetime64[ns]')

In [11]: df.values
Out[11]:
array([[Timestamp('2013-01-01 00:00:00'),
        Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-02 00:00:00'),
        Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-02 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-03 00:00:00'),
        Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-03 00:00:00+0100', tz='CET')]], dtype=object)

jreback · 2015-09-05T14:11:39Z

hmm, the interleaving on a DataFrame.values could go either way, IOW, if you had an object column mixed in then that would be correct (it ends up as an object array and nothing is cast), but I suppose if its just mixed datetime-like then I can coerce .

jorisvandenbossche · 2015-09-05T14:27:18Z

Ah, yes, that is true. In this case, if it's not too difficult, I would say to coerce them all to datetime64, but leaving as is, is also not that strange then

fix scalar comparisons vs None generally fix NaT formattting in Series TST: skip postgresql test with tz's update for msgpack Conflicts: pandas/core/base.py pandas/core/categorical.py pandas/core/format.py pandas/tests/test_base.py pandas/util/testing.py full interop for tz-aware Series & timedeltas pandas-dev#10763

jreback · 2015-09-05T15:41:13Z

I have left it as is. Then this is very consistent and not suddently changed if you add say an 'object' field or whatever. We could always adjust this later.

jreback · 2015-09-05T16:17:20Z

ok, bombs away....

the initial impl was only a week or so.....2 months to make it work properly....:>

API: add DatetimeBlockTZ #8260

jreback added Performance Memory or execution speed performance Internals Related to non-user accessible pandas implementation Timezones Timezone data dtype labels Jun 30, 2015

jreback added this to the 0.17.0 milestone Jun 30, 2015

jreback force-pushed the tz branch from 81fd4c7 to 81ea84f Compare June 30, 2015 15:46

jreback force-pushed the tz branch 2 times, most recently from d86c20d to 4b9316a Compare July 1, 2015 02:20

jreback force-pushed the tz branch 2 times, most recently from 19ec61d to 2422fe5 Compare July 1, 2015 22:09

jreback force-pushed the tz branch 3 times, most recently from 6f7a514 to eaa28ad Compare September 4, 2015 02:54

jreback force-pushed the tz branch 2 times, most recently from 616cbc5 to d3ae2a3 Compare September 4, 2015 03:22

jreback mentioned this pull request Sep 4, 2015

DataFrame combine_first() loses timezone information for datetime columns #10567

Closed

jreback force-pushed the tz branch 2 times, most recently from 33e747b to b340e76 Compare September 4, 2015 14:05

jreback force-pushed the tz branch from b340e76 to 5db6476 Compare September 4, 2015 15:34

jreback force-pushed the tz branch from 5db6476 to 0e06cca Compare September 4, 2015 20:09

jreback force-pushed the tz branch from 0e06cca to 27b7b1e Compare September 5, 2015 15:39

jreback added a commit that referenced this pull request Sep 5, 2015

Merge pull request #10477 from jreback/tz

666540f

API: add DatetimeBlockTZ #8260

jreback merged commit 666540f into pandas-dev:master Sep 5, 2015

jreback mentioned this pull request Oct 23, 2015

Performance drop when using timezone-aware DateTimeIndex #10192

Closed

jreback mentioned this pull request Aug 19, 2016

Index.unique() should always return an Index object of the same type #13395

Closed

jreback mentioned this pull request Mar 20, 2017

API: Series.values with tz-aware should return object array of Timestamps #15750

Closed

Uh oh!

API: add DatetimeBlockTZ #8260 #10477

API: add DatetimeBlockTZ #8260 #10477

Uh oh!

Conversation

jreback commented Jun 30, 2015

Uh oh!

jreback commented Jun 30, 2015

Uh oh!

ssanderson commented Jun 30, 2015

Uh oh!

jorisvandenbossche commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

jreback commented Jul 1, 2015

Uh oh!

ssanderson commented Jul 1, 2015

Uh oh!

jreback commented Jul 2, 2015

Uh oh!

jreback commented Sep 4, 2015

Uh oh!

jorisvandenbossche commented Sep 4, 2015

Uh oh!

jreback commented Sep 4, 2015

Uh oh!

jreback commented Sep 4, 2015

Uh oh!

jreback commented Sep 4, 2015

Uh oh!

ssanderson commented Sep 4, 2015

Uh oh!

TomAugspurger commented Sep 4, 2015

Uh oh!

jreback commented Sep 4, 2015

Uh oh!

jorisvandenbossche commented Sep 5, 2015

Uh oh!

jreback commented Sep 5, 2015

Uh oh!

jorisvandenbossche commented Sep 5, 2015

Uh oh!

jreback commented Sep 5, 2015

Uh oh!

jreback commented Sep 5, 2015

Uh oh!

Uh oh!