API: Index.take inconsistently handle fill_value #12676

sinhrks · 2016-03-20T15:31:10Z

closes API: Index.take inconsistently handle fill_value #12631
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-03-20T15:33:11Z

@sinhrks as an aside (maybe a new issue). I think we should remove allow_fill parameter it duplicates a fill_value of not None. But would have to check more carefully.

jreback · 2016-03-20T15:38:45Z

actually, forget what I said above. The allow_fill=True is meant as an automatic way of setting the fill value based on the data type (IOW, it automatically looks up the _na_value).

sinhrks · 2016-03-20T15:39:04Z

doc/source/whatsnew/v0.18.1.txt

@@ -29,12 +29,16 @@ New features
 Enhancements
 ~~~~~~~~~~~~

+- ``Index.take`` now handles ``allow_fill`` and ``fill_value`` consistently (:issue:`12631`)


Current (updated in this PR) docstring says as below, is it not enough?

If allow_fill=True and fill_value is not None, indices specified by -1 is regarded as NA. If Index doesn't hold NA, raise ValueError

jreback · 2016-03-20T15:45:14Z

pandas/indexes/base.py

+
+        if self._can_hold_na:
+            # only fill if we are passing a non-None fill_value
+            if allow_fill and fill_value is not None:


Actually I think we need to either make both this, and tseries/base take:

either: raise on any negative indexers except -1
or: make a new method to do this, maybe .take_with_fill and restore .take for compat with 'regular' take. That will raise on indices that are not >= -1. These are serving double duty here.

jreback · 2016-03-20T15:46:28Z

moving this comment to main:

Actually I think we need to either make both Index.take, and tseries/base take:

either: raise on any negative indexers except -1
or: make a new method to do this, maybe .take_with_fill and restore .take for compat with 'regular' take. That will raise on indices that are not >= -1. These are serving double duty here.

sinhrks · 2016-03-21T07:16:49Z

To be compat with numpy, how about:

`allow_fill=False` or `fill_value=None`

Allow negative indexer, raise IndexError if indexer exceeds the range.

np.take(np.array([1, 2, 3]), np.array([-2, -1]))
# array([2, 3])

`allow_fill=True` and `fill_value is not None`

Regard -1 as NA, raise ValueError if any other negative values.

jreback · 2016-03-21T12:09:51Z

ok if we change the default to allow_fill=False, there are a LOT of .take (and .take_nd) calls in the code-base, so prob need to change a bit for this to work. I agree that is going to be much better to be explicit though.

So allow_fill=False and fill_value is None -> regular taking, esentially self.values.take(...)
with alow_fill=True or fill_value not None -> -1 == NA's and outside of bounds is a ValueError.

sinhrks · 2016-03-22T13:01:32Z

@jreback Based on the current impl, I don't think we should change the default to allow_fill=False. The current DTI/TDI logic is:

allow_fill=False OR fill_value is None -> regular taking, esentially self.values.take(...)
alow_fill=True AND fill_value is not None -> -1 is filled by pd.NaT

Because current default is allow_fill=True and fill_value=None (first path), adding the above logic to all Index will not change the existing behavior.

jreback · 2016-03-22T13:12:47Z

@sinhrks hmm ok, though I think we should have validation though (e.g. if you try something like

In [4]: dt = pd.DatetimeIndex([pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')])

In [5]: dt
Out[5]: DatetimeIndex(['2013-01-01', 'NaT', '2013-01-03'], dtype='datetime64[ns]', freq=None)

In [6]: dt.take([-2,-1,0])
Out[6]: DatetimeIndex(['NaT', '2013-01-03', '2013-01-01'], dtype='datetime64[ns]', freq=None)

In [7]: dt.take([-2,-1,0],allow_fill=True,fill_value=pd.NaT)
Out[7]: DatetimeIndex(['NaT', 'NaT', '2013-01-01'], dtype='datetime64[ns]', freq=None)

In [8]: dt = pd.date_range('20130101',periods=3)

In [9]: dt
Out[9]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', freq='D')

In [10]: dt.take([-2,-1,0],allow_fill=True,fill_value=pd.NaT)
Out[10]: DatetimeIndex(['2013-01-02', 'NaT', '2013-01-01'], dtype='datetime64[ns]', freq=None)

[10] is wrong, this is treating -2 as a positional taker, and -1 as a NA indicator.

so ok with leaving the API, just want to put in some validation tests so that if we are NA filling, then don't allow negative values (except -1)

sinhrks · 2016-03-30T06:36:29Z

Yes, added a logic to raise ValueError if NA filling and negative indexer less than -1 are specified.

Indexer contains outside of bounds raises IndexError (no change, added tests).

jreback · 2016-03-30T12:43:48Z

pandas/indexes/base.py

+                if mask.any():
+                    taken[mask] = self._na_value
+        else:
+            if allow_fill and fill_value is not None:


can you refactor this function, something like (as an internal function), as repeating this several times.

def _assert_take_fillable(self, indices, allow_fill, fill_value): if allow_fill and fill_value: if (indices < -1).any(): raise .....

Sure, how about this?

jreback · 2016-03-30T23:32:57Z

In [11]: idx
Out[11]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

# I think this shoudl give the same as [13]
In [12]: idx.take([2, -5], fill_value=True)
IndexError: index -5 is out of bounds for size 4

In [13]: idx.take([2, -4], fill_value=True)
ValueError: When allow_fill=True and fill_value is not None, all indices must be >= -1

sinhrks · 2016-03-31T00:24:24Z

@jreback You mean always raising ValueError? IndexError is to be compat with numpy.

np.take(np.array([1, 2]), 3)
# IndexError: index 3 is out of bounds for size 2

jreback · 2016-03-31T00:29:43Z

no when in filling mode it should raise ValueError with negative indices ; in 12, the -5 should be caught before the take is done (as its negative and not -1)

sinhrks · 2016-03-31T00:43:20Z

Thanks, understood. Fixed and added tests.

jreback · 2016-03-31T13:26:20Z

pandas/indexes/category.py

-        taken = self.codes.take(indexer)
+    @Appender(_index_shared_docs['take'])
+    def take(self, indices, axis=0, allow_fill=True, fill_value=None):
+        indices = com._ensure_platform_int(indices)


i don't think you need this ensure here (as its inside assert_take_fillabel)?

If we move it to assert_take_fillable, it will be executed duplicatelly when take has additional logic`` like below. No need to care?

https://github.com/pydata/pandas/pull/12676/files#diff-b56ec8d28d045a5a83219fe4d5e4dabfR266

yeah I just realized that!

ok the platform_int thing is a whole other issue! (we shouldn't even use them at all), or convert JUST where needed (only for things like take).

jreback · 2016-03-31T13:38:42Z

@sinhrks thanks! you are fixing some of these internal warts! keep em comin'

related to #4400 Added more tests for sparse indexing. `SparseArray.take`` has optimized logic to omit dense ``np.ndarray`` creation. SparseSeires.iloc` can work with negative indices. Made ``SparseArray.take`` to handle negative indices as the same rule as ``Index`` (#12676) Author: sinhrks <sinhrks@gmail.com> Closes #12796 from sinhrks/sparse_test_at and squashes the following commits: df1f056 [sinhrks] ENH/PERF SparseArray.take indexing

sinhrks added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Mar 20, 2016

sinhrks added this to the 0.18.1 milestone Mar 20, 2016

sinhrks reviewed Mar 20, 2016
View reviewed changes

sinhrks force-pushed the index_take branch 2 times, most recently from c7af0dc to 97dd513 Compare March 20, 2016 15:42

jreback reviewed Mar 20, 2016
View reviewed changes

sinhrks force-pushed the index_take branch 4 times, most recently from 0644232 to 17cd4ee Compare March 30, 2016 02:24

jreback reviewed Mar 30, 2016
View reviewed changes

sinhrks force-pushed the index_take branch from 17cd4ee to 50af1f7 Compare March 30, 2016 12:59

API: Index.take inconsistently handle fill_value

3c19920

sinhrks force-pushed the index_take branch from 50af1f7 to 3c19920 Compare March 31, 2016 00:42

jreback reviewed Mar 31, 2016
View reviewed changes

jreback closed this in 9b90016 Mar 31, 2016

sinhrks deleted the index_take branch March 31, 2016 13:40

sinhrks mentioned this pull request Apr 4, 2016

ENH/PERF SparseArray.take indexing #12796

Closed

4 tasks

Uh oh!

API: Index.take inconsistently handle fill_value #12676

API: Index.take inconsistently handle fill_value #12676

Uh oh!

Conversation

sinhrks commented Mar 20, 2016

Uh oh!

jreback commented Mar 20, 2016

Uh oh!

jreback commented Mar 20, 2016

Uh oh!

sinhrks Mar 20, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Mar 20, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 20, 2016

Uh oh!

sinhrks commented Mar 21, 2016

allow_fill=False or fill_value=None

allow_fill=True and fill_value is not None

Uh oh!

jreback commented Mar 21, 2016

Uh oh!

sinhrks commented Mar 22, 2016

Uh oh!

jreback commented Mar 22, 2016

Uh oh!

sinhrks commented Mar 30, 2016

Uh oh!

jreback Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks Mar 30, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 30, 2016

Uh oh!

sinhrks commented Mar 31, 2016

Uh oh!

jreback commented Mar 31, 2016

Uh oh!

sinhrks commented Mar 31, 2016

Uh oh!

jreback Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

sinhrks Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

jreback Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 31, 2016

Uh oh!

Uh oh!

`allow_fill=False` or `fill_value=None`

`allow_fill=True` and `fill_value is not None`