Skip to content

API: Index.take inconsistently handle fill_value #12676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Mar 20, 2016

@sinhrks sinhrks added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Mar 20, 2016
@sinhrks sinhrks added this to the 0.18.1 milestone Mar 20, 2016
@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

@sinhrks as an aside (maybe a new issue). I think we should remove allow_fill parameter it duplicates a fill_value of not None. But would have to check more carefully.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

actually, forget what I said above. The allow_fill=True is meant as an automatic way of setting the fill value based on the data type (IOW, it automatically looks up the _na_value).

@@ -29,12 +29,16 @@ New features
Enhancements
~~~~~~~~~~~~

- ``Index.take`` now handles ``allow_fill`` and ``fill_value`` consistently (:issue:`12631`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current (updated in this PR) docstring says as below, is it not enough?

If allow_fill=True and fill_value is not None, indices specified by -1 is regarded as NA. If Index doesn't hold NA, raise ValueError

@sinhrks sinhrks force-pushed the index_take branch 2 times, most recently from c7af0dc to 97dd513 Compare March 20, 2016 15:42

if self._can_hold_na:
# only fill if we are passing a non-None fill_value
if allow_fill and fill_value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we need to either make both this, and tseries/base take:

either: raise on any negative indexers except -1
or: make a new method to do this, maybe .take_with_fill and restore .take for compat with 'regular' take. That will raise on indices that are not >= -1. These are serving double duty here.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

moving this comment to main:

Actually I think we need to either make both Index.take, and tseries/base take:

either: raise on any negative indexers except -1
or: make a new method to do this, maybe .take_with_fill and restore .take for compat with 'regular' take. That will raise on indices that are not >= -1. These are serving double duty here.

@sinhrks
Copy link
Member Author

sinhrks commented Mar 21, 2016

To be compat with numpy, how about:

allow_fill=False or fill_value=None

Allow negative indexer, raise IndexError if indexer exceeds the range.

np.take(np.array([1, 2, 3]), np.array([-2, -1]))
# array([2, 3])
allow_fill=True and fill_value is not None

Regard -1 as NA, raise ValueError if any other negative values.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

ok if we change the default to allow_fill=False, there are a LOT of .take (and .take_nd) calls in the code-base, so prob need to change a bit for this to work. I agree that is going to be much better to be explicit though.

So allow_fill=False and fill_value is None -> regular taking, esentially self.values.take(...)
with alow_fill=True or fill_value not None -> -1 == NA's and outside of bounds is a ValueError.

@sinhrks
Copy link
Member Author

sinhrks commented Mar 22, 2016

@jreback Based on the current impl, I don't think we should change the default to allow_fill=False. The current DTI/TDI logic is:

  1. allow_fill=False OR fill_value is None -> regular taking, esentially self.values.take(...)
  2. alow_fill=True AND fill_value is not None -> -1 is filled by pd.NaT

Because current default is allow_fill=True and fill_value=None (first path), adding the above logic to all Index will not change the existing behavior.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@sinhrks hmm ok, though I think we should have validation though (e.g. if you try something like

In [4]: dt = pd.DatetimeIndex([pd.Timestamp('20130101'),pd.NaT,pd.Timestamp('20130103')])

In [5]: dt
Out[5]: DatetimeIndex(['2013-01-01', 'NaT', '2013-01-03'], dtype='datetime64[ns]', freq=None)

In [6]: dt.take([-2,-1,0])
Out[6]: DatetimeIndex(['NaT', '2013-01-03', '2013-01-01'], dtype='datetime64[ns]', freq=None)

In [7]: dt.take([-2,-1,0],allow_fill=True,fill_value=pd.NaT)
Out[7]: DatetimeIndex(['NaT', 'NaT', '2013-01-01'], dtype='datetime64[ns]', freq=None)

In [8]: dt = pd.date_range('20130101',periods=3)

In [9]: dt
Out[9]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', freq='D')

In [10]: dt.take([-2,-1,0],allow_fill=True,fill_value=pd.NaT)
Out[10]: DatetimeIndex(['2013-01-02', 'NaT', '2013-01-01'], dtype='datetime64[ns]', freq=None)

[10] is wrong, this is treating -2 as a positional taker, and -1 as a NA indicator.

so ok with leaving the API, just want to put in some validation tests so that if we are NA filling, then don't allow negative values (except -1)

@sinhrks sinhrks force-pushed the index_take branch 4 times, most recently from 0644232 to 17cd4ee Compare March 30, 2016 02:24
@sinhrks
Copy link
Member Author

sinhrks commented Mar 30, 2016

Yes, added a logic to raise ValueError if NA filling and negative indexer less than -1 are specified.

Indexer contains outside of bounds raises IndexError (no change, added tests).

if mask.any():
taken[mask] = self._na_value
else:
if allow_fill and fill_value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you refactor this function, something like (as an internal function), as repeating this several times.

def _assert_take_fillable(self, indices, allow_fill, fill_value):
     if allow_fill and fill_value:
          if (indices < -1).any():
            raise .....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, how about this?

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

In [11]: idx
Out[11]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

# I think this shoudl give the same as [13]
In [12]: idx.take([2, -5], fill_value=True)
IndexError: index -5 is out of bounds for size 4

In [13]: idx.take([2, -4], fill_value=True)
ValueError: When allow_fill=True and fill_value is not None, all indices must be >= -1

@sinhrks
Copy link
Member Author

sinhrks commented Mar 31, 2016

@jreback You mean always raising ValueError? IndexError is to be compat with numpy.

np.take(np.array([1, 2]), 3)
# IndexError: index 3 is out of bounds for size 2

@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

no when in filling mode it should raise ValueError with negative indices ; in 12, the -5 should be caught before the take is done (as its negative and not -1)

@sinhrks
Copy link
Member Author

sinhrks commented Mar 31, 2016

Thanks, understood. Fixed and added tests.

taken = self.codes.take(indexer)
@Appender(_index_shared_docs['take'])
def take(self, indices, axis=0, allow_fill=True, fill_value=None):
indices = com._ensure_platform_int(indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need this ensure here (as its inside assert_take_fillabel)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move it to assert_take_fillable, it will be executed duplicatelly when take has additional logic`` like below. No need to care?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I just realized that!

ok the platform_int thing is a whole other issue! (we shouldn't even use them at all), or convert JUST where needed (only for things like take).

@jreback jreback closed this in 9b90016 Mar 31, 2016
@jreback
Copy link
Contributor

jreback commented Mar 31, 2016

@sinhrks thanks! you are fixing some of these internal warts! keep em comin'

@sinhrks sinhrks deleted the index_take branch March 31, 2016 13:40
@sinhrks sinhrks mentioned this pull request Apr 4, 2016
4 tasks
jreback pushed a commit that referenced this pull request Apr 7, 2016
related to #4400

Added more tests for sparse indexing.
`SparseArray.take`` has optimized logic to omit dense ``np.ndarray`` creation.
SparseSeires.iloc` can work with negative indices.
Made ``SparseArray.take`` to handle negative indices as the same rule as ``Index`` (#12676)

Author: sinhrks <sinhrks@gmail.com>

Closes #12796 from sinhrks/sparse_test_at and squashes the following commits:

df1f056 [sinhrks] ENH/PERF SparseArray.take indexing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Index.take inconsistently handle fill_value
2 participants