Skip to content

API: add DatetimeBlockTZ #8260 #10477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 5, 2015
Merged

API: add DatetimeBlockTZ #8260 #10477

merged 1 commit into from
Sep 5, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 30, 2015

closes #8260
closes #10763

ToDos:

  • doc updates
  • test with Series.dt.*
  • test with csv/HDF5
  • nat setting borked ATM
  • HDF5 example from 0.16.2

- [ ] get_values/values - make consistent
- [ ] maybe move DatetimeTZBlock.shift mostly to DatetimeIndex.shift

Also

  • This cleans up the internal blocks calling conventions a bit
  • Fixes a bug in DatetimeIndex and localizing when NaT's are present
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
timestamp_tz_ops_diff1                       |   0.1804 | 159.4527 |   0.0011 | # note these are 10k element
timestamp_tz_ops_diff2                       |   2.5350 | 156.4047 |   0.0162 | # note these are 10k elements
timeseries_timestamp_downsample_mean         |   3.1467 |   3.3040 |   0.9524 |
timestamp_series_compare                     |   9.0797 |   9.1290 |   0.9946 |
timestamp_ops_diff2                          |  19.7570 |  19.6819 |   1.0038 | # this is 1M elements
series_timestamp_compare                     |   9.3430 |   9.0226 |   1.0355 |
timestamp_ops_diff1                          |   9.7457 |   9.0450 |   1.0775 | # this is 1M elements
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8502474] : API: add Block.make_block
API: add DatetimeBlockWithTZ #8260
Base   [16a44ad] : Merge pull request #10199 from jreback/gil
PERF: releasing the GIL, #8882

Demo

In [1]:   df = DataFrame({'A' : date_range('20130101',periods=3),
   ...:                    'B' : date_range('20130101',periods=3,tz='US/Eastern'),
   ...:                    'C' : date_range('20130101',periods=3,tz='CET')})

In [2]:    df
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]:    df.dtypes
Out[3]: 
A                datetime64[ns]
B    datetime64[ns, US/Eastern]
C           datetime64[ns, CET]
dtype: object

In [4]: df.B
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
Name: B, dtype: datetime64[ns, US/Eastern]

In [5]: df.B.dt.tz_localize(None)
Out[5]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
dtype: datetime64[ns]

@jreback jreback added Performance Memory or execution speed performance Internals Related to non-user accessible pandas implementation Timezones Timezone data dtype labels Jun 30, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 30, 2015
@jreback
Copy link
Contributor Author

jreback commented Jun 30, 2015

cc @ssanderson

@ssanderson
Copy link
Contributor

Ooh, shiny! I'll try this out tonight.

@jreback jreback force-pushed the tz branch 2 times, most recently from d86c20d to 4b9316a Compare July 1, 2015 02:20
@jorisvandenbossche
Copy link
Member

@jreback Nice! I will try to go through it one of the coming days

@jreback jreback force-pushed the tz branch 2 times, most recently from 19ec61d to 2422fe5 Compare July 1, 2015 22:09
@ssanderson
Copy link
Contributor

@jreback one quirk I'm noticing is that if you construct a DataFrame directly from a DatetimeIndex, you lose the tz information:

In [1]: dates = date_range('2014-01-01', periods=10, tz='UTC')

In [2]: from_dict = DataFrame({'a': dates})

In [3]: from_dict.dtypes
Out[3]:
a    datetime64[ns, UTC]
dtype: object

In [4]: from_index = DataFrame(dates)

In [5]: from_index.dtypes
Out[5]:
0    datetime64[ns]
dtype: object

Is this expected behavior?

@ssanderson
Copy link
Contributor

Also surprising: if I take a series of dtype datetime64[ns, UTC] and call.values on it, I get a DatetimeIndex rather than an ndarray:

In [7]: from_dict['a'].values
Out[7]:
DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10'],
              dtype='datetime64[ns]', freq='D', tz='UTC')

The result I'd expect here is what you actually get from doing

In [21]: from_dict['a'].values.values
Out[21]:
array(['2013-12-31T19:00:00.000000000-0500',
       '2014-01-01T19:00:00.000000000-0500',
       '2014-01-02T19:00:00.000000000-0500',
       '2014-01-03T19:00:00.000000000-0500',
       '2014-01-04T19:00:00.000000000-0500',
       '2014-01-05T19:00:00.000000000-0500',
       '2014-01-06T19:00:00.000000000-0500',
       '2014-01-07T19:00:00.000000000-0500',
       '2014-01-08T19:00:00.000000000-0500',
       '2014-01-09T19:00:00.000000000-0500'], dtype='datetime64[ns]')

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

do you have the current one?

this all works (it might not have in a prior one)

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

This is the latest jreback@19ec61d I think

In [3]: from_dict
Out[3]: 
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: from_dict.dtypes
Out[4]: 
a    datetime64[ns, UTC]
dtype: object

Series.values IS a DatetimeIndex that is how its implemented. Its very similar to how Sparse/Categorical are done. This preserves the tz info inside the object (actually have freq too). Rather than relying upon a ndarray impl and passing it around. Much cleaner this way.

@ssanderson
Copy link
Contributor

I'm testing with a development install of this branch:

(pandas)[~/clones/pandas]@(tz:2422fe50fc)$ git log HEAD^..HEAD
commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

@ssanderson pretty old branch. pull my latest.

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

commit 757bbf92c926d4584d01bce419a576c7cb831fce
Author: Jeff Reback <jeff@reback.net>
Date:   Wed Jul 1 19:38:16 2015 -0400

    start on csv

commit 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec
Author: Jeff Reback <jeff@reback.net>
Date:   Sat Jun 27 17:55:29 2015 -0400

    API: add DatetimeBlockTZ #8260

everthing works except for csv rounding tripping (not sure I can get it as can't repro the tz upon readback) but we'll see, and to/from hdf5 (but soon).

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

@ssanderson I have been amending that commit, so for sure update.

@ssanderson
Copy link
Contributor

I pulled 2422fe50fcc5aac541a0be8d67a4e86309b3e2ec ~20 minutes ago. The commit hash would change if the content had changed.

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

did you make? (it has a tad bit of cython code changes)

@ssanderson
Copy link
Contributor

I did a pip install -e . in a fresh virtualenv. What in the above that I posted is different from what you're seeing?

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

sorry, I didn't see what you meant above. DataFrame(new_index) hang on a sec. Didnt have a test for that.

@ssanderson
Copy link
Contributor

The from_dict case is working as expected modulo the unexpected type of .values, which sounds like it's actually as-designed. The one that I think is incorrect is direct construction from a DatetimeIndex.

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

very subtle path difference here....fixing

In [6]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern')).dtypes
Out[6]: 
0    datetime64[ns]
dtype: object

In [7]: DataFrame(date_range('20130101',periods=3,tz='US/Eastern',name='foo')).dtypes
Out[7]: 
foo    datetime64[ns, US/Eastern]
dtype: object

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

@ssanderson fixed

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2015

if you have a chance I'd like to see how you are actually using it, if you could post some small sample code would be great. I can time things, but having a usecase is even better.

@ssanderson
Copy link
Contributor

@jreback I think DataFrame indexing is broken with columns of tz-aware dtype:

In [2]: df = DataFrame({'a': date_range('2014-01-01', periods=10, tz='UTC')})

In [3]: df
Out[3]:
           a
0 2014-01-01
1 2014-01-02
2 2014-01-03
3 2014-01-04
4 2014-01-05
5 2014-01-06
6 2014-01-07
7 2014-01-08
8 2014-01-09
9 2014-01-10

In [4]: df.iloc[5]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-1d4dcfe8a425> in <module>()
----> 1 df.iloc[5]

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in __getitem__(self, key)
   1187             return self._getitem_tuple(key)
   1188         else:
-> 1189             return self._getitem_axis(key, axis=0)
   1190
   1191     def _getitem_axis(self, key, axis=0):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1478                 self._is_valid_integer(key, axis)
   1479
-> 1480             return self._get_loc(key, axis=axis)
   1481
   1482     def _convert_to_indexer(self, obj, axis=0, is_setter=False):

/home/ssanderson/clones/pandas/pandas/core/indexing.pyc in _get_loc(self, key, axis)
     87
     88     def _get_loc(self, key, axis=0):
---> 89         return self.obj._ixs(key, axis=axis)
     90
     91     def _slice(self, obj, axis=0, kind=None):

/home/ssanderson/clones/pandas/pandas/core/frame.pyc in _ixs(self, i, axis)
   1727                     copy=True
   1728                 else:
-> 1729                     new_values = self._data.fast_xs(i)
   1730
   1731                     # if we are a copy, mark as such

/home/ssanderson/clones/pandas/pandas/core/internals.pyc in fast_xs(self, loc)
   2899         """
   2900         if len(self.blocks) == 1:
-> 2901             return self.blocks[0].values[:, loc]
   2902
   2903         items = self.items

/home/ssanderson/clones/pandas/pandas/tseries/base.pyc in __getitem__(self, key)
     93             attribs['freq'] = freq
     94
---> 95             result = getitem(key)
     96             if result.ndim > 1:
     97                 return result

IndexError: too many indices for array

@jreback
Copy link
Contributor Author

jreback commented Jul 2, 2015

hmm, let me look.

@jreback jreback force-pushed the tz branch 3 times, most recently from 6f7a514 to eaa28ad Compare September 4, 2015 02:54
@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2015

ok, I changed this. So now .values -> 'external values', and ._values -> 'internal values'. These are currently the same for everything, except DatetimeTZ. so this allows an internal implementation, and we can have an external .values that is different.

so:

In [1]: s = Series(date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0    2013-01-01 00:00:00-05:00
1    2013-01-02 00:00:00-05:00
2    2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: s.values
Out[3]: 
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
       Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [4]: s._values
Out[4]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

@jorisvandenbossche
Copy link
Member

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

  • object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
  • datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?

Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3), 
                         'B' : date_range('20130101',periods=3,tz='US/Eastern'),
                         'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
           A                          B                          C
0 2013-01-01  2013-01-01 00:00:00-05:00  2013-01-01 00:00:00+01:00
1 2013-01-02  2013-01-02 00:00:00-05:00  2013-01-02 00:00:00+01:00
2 2013-01-03  2013-01-03 00:00:00-05:00  2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2015

fixed the bug:

In [2]: df.astype('datetime64[ns]')
Out[2]: 
           A                   B                   C
0 2013-01-01 2013-01-01 05:00:00 2012-12-31 23:00:00
1 2013-01-02 2013-01-02 05:00:00 2013-01-01 23:00:00
2 2013-01-03 2013-01-03 05:00:00 2013-01-02 23:00:00

In [3]: df.astype('datetime64[ns]').dtypes
Out[3]: 
A    datetime64[ns]
B    datetime64[ns]
C    datetime64[ns]
dtype: object

For the first, converted to UTC and displated as if its astype(['datetime64[ns]])

n [18]: df['B'].astype('datetime64[ns]')
Out[18]: 
0   2013-01-01 05:00:00
1   2013-01-02 05:00:00
2   2013-01-03 05:00:00
Name: B, dtype: datetime64[ns]

In [19]: df['B'].astype('datetime64[ns]').values
Out[19]: 
array(['2013-01-01T00:00:00.000000000-0500',
       '2013-01-02T00:00:00.000000000-0500',
       '2013-01-03T00:00:00.000000000-0500'], dtype='datetime64[ns]')

I could be on board with that, except again you lose the fact that this has a meaningful tz, but if you are really accessing .values then you know what you are doing and want a numpy array anyhow

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2015

ok making this change. it actually is exposing some bugs.....:)

@jreback jreback force-pushed the tz branch 2 times, most recently from 33e747b to b340e76 Compare September 4, 2015 14:05
@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2015

ok, updated as I described above.

@ssanderson
Copy link
Contributor

👍 from me on this proposal. If I'm accessing.values, it's almost always because I care about performance or because I'm doing something numpy-specific.

On Sep 4, 2015, at 8:33 AM, Joris Van den Bossche notifications@github.com wrote:

@jreback Thanks for this change!

Did some quick testing, and some more feedback:

What do you guys think what the dtype of .values should be?

object with Timestamp objects -> this is what it currently does in the PR, and is also backwards compatible (as this is how tz aware datetime data are stored now in a frame, as objects)
datetime64 -> I think this is the more useful return value, if the reason that you access the raw .values numpy arrays is to do some performant operation on it. The tz can always be accessed separately (s.dt.tz) if you want to keep it along the numpy array. There is also no easy way to get the datetime64 values from a object array of Timestamps I think?
Further, I noticed:

In [18]: df = DataFrame({'A' : date_range('20130101',periods=3),
'B' : date_range('20130101',periods=3,tz='US/Eastern'),
'C' : date_range('20130101',periods=3,tz='CET')})

In [19]: df
Out[19]:
A B C
0 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00+01:00
1 2013-01-02 2013-01-02 00:00:00-05:00 2013-01-02 00:00:00+01:00
2 2013-01-03 2013-01-03 00:00:00-05:00 2013-01-03 00:00:00+01:00

In [20]: s = df['B']

In [21]: s.astype('datetime64[ns]')

AttributeError: 'DatetimeIndex' object has no attribute 'to_dense'

Reply to this email directly or view it on GitHub.

@TomAugspurger
Copy link
Contributor

Also +1 on .values returning a numpy array of datetime64s (without tz).

@jreback
Copy link
Contributor Author

jreback commented Sep 4, 2015

ok, pls have a final look if desired.

@jorisvandenbossche
Copy link
Member

Final comment: the series .values now gives you the datetime64 values, but when having multiple columns, this are still the Timestamp objects. This seems a bit inconsistent:

In [10]: df['B'].values
Out[10]:
array(['2013-01-01T06:00:00.000000000+0100',
       '2013-01-02T06:00:00.000000000+0100',
       '2013-01-03T06:00:00.000000000+0100'], dtype='datetime64[ns]')

In [11]: df.values
Out[11]:
array([[Timestamp('2013-01-01 00:00:00'),
        Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-01 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-02 00:00:00'),
        Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-02 00:00:00+0100', tz='CET')],
       [Timestamp('2013-01-03 00:00:00'),
        Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern'),
        Timestamp('2013-01-03 00:00:00+0100', tz='CET')]], dtype=object)

@jreback
Copy link
Contributor Author

jreback commented Sep 5, 2015

hmm, the interleaving on a DataFrame.values could go either way, IOW, if you had an object column mixed in then that would be correct (it ends up as an object array and nothing is cast), but I suppose if its just mixed datetime-like then I can coerce .

@jorisvandenbossche
Copy link
Member

Ah, yes, that is true. In this case, if it's not too difficult, I would say to coerce them all to datetime64, but leaving as is, is also not that strange then

fix scalar comparisons vs None generally

fix NaT formattting in Series

TST: skip postgresql test with tz's

update for msgpack

Conflicts:
	pandas/core/base.py
	pandas/core/categorical.py
	pandas/core/format.py
	pandas/tests/test_base.py
	pandas/util/testing.py

full interop for tz-aware Series & timedeltas pandas-dev#10763
@jreback
Copy link
Contributor Author

jreback commented Sep 5, 2015

I have left it as is. Then this is very consistent and not suddently changed if you add say an 'object' field or whatever. We could always adjust this later.

@jreback
Copy link
Contributor Author

jreback commented Sep 5, 2015

ok, bombs away....

the initial impl was only a week or so.....2 months to make it work properly....:>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance Timezones Timezone data dtype
Projects
None yet
8 participants