-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
TST/CLN: correctly skip in indexes/common; add test for duplicated #21902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21902 +/- ##
=======================================
Coverage 92.07% 92.07%
=======================================
Files 169 169
Lines 50684 50684
=======================================
Hits 46668 46668
Misses 4016 4016
Continue to review full report at Codecov.
|
idx = self._holder([indices[0]] * 5) | ||
assert not idx.is_unique | ||
assert idx.has_duplicates | ||
|
||
@pytest.mark.parametrize('keep', ['first', 'last', False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a duplicate these (pun intended) or are not testing indices duplicated currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are not relevant?
pandas/tests/indexes/common.py: def test_duplicates(self, indices):
pandas/tests/indexes/test_category.py: def test_duplicates(self):
pandas/tests/indexes/test_range.py: def test_duplicates(self):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback No, test_duplicates
(at least for pandas/tests/indexes/common.py
, which is what this PR is about) tests .is_unique
and .has_duplicates
, but not the .duplicated
-method itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you point to the coverage that shows this is NOT tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I never said that it is not tested, just that it is only tested implicitly. Any call to .drop_duplicates
will invoke duplicated
, so obviously the coverage works out.
idx = self._holder([indices[0]] * 5) | ||
assert not idx.is_unique | ||
assert idx.has_duplicates | ||
|
||
@pytest.mark.parametrize('keep', ['first', 'last', False]) | ||
def test_duplicated(self, indices, keep): | ||
if type(indices) is not self._holder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use isinstance here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback isinstance
of what? I copied the checks from other tests, because I didn't know all the different cases that flow into common.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance of self._holder, is this what we do elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in indexes/common.py
, as far as I can see. There's several isinstance
of course, but never for self._holder
, which can apparently be None
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback If I change
if type(indices) is not self._holder:
to
if not isinstance(indices, self._holder):
I get 9 failures (instead of skips), all of which are from
tests/indexes/test_numeric.TestInt64Index
, when run against DatetimeIndex
, PeriodIndex
or TimedeltaIndex
, mainly because of TypeError: Unsafe NumPy casting, you must explicitly cast
it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Seem to have found something: in my normal workbook, I get
pd.__version__
# '0.24.0.dev0+321.g0fe6ded52.dirty'
isinstance(pd.PeriodIndex, pd.Int64Index)
# False
but for some reason, within tests/indexes/common.py
, this is the opposite:
[...output of test run...]
@pytest.mark.parametrize('keep', ['first', 'last', False])
def test_duplicated(self, indices, keep):
if not isinstance(indices, self._holder):
pytest.skip('Can only check if we know the index type')
if not len(indices) or isinstance(indices, (MultiIndex)):
# MultiIndex tested separately in:
# tests/indexes/multi/test_unique_and_duplicates
pytest.skip('Skip check for empty Index and MultiIndex')
if isinstance(indices, (PeriodIndex, DatetimeIndex)):
# this branch should be impossible for Int64Index
# after the instance-check above!
> raise ValueError(f'{type(indices).__name__}, {self._holder.__name__}, '
f'{isinstance(indices, self._holder)}, {type(indices) is self._holder}')
E ValueError: PeriodIndex, Int64Index, True, False
Same happens for DatetimeIndex
. It does work for the original is not
variant, so I'm leaving that as it is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened a follow-up: #22211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I had gotten confused between instances and classes.
So DatetimeIndex
, PeriodIndex
and TimedeltaIndex
are subclasses of Int64Index
- but are unsafe to cast to Int64...? I guess this is intentional?
In any case, under these circumstances, I'm even more convinced that it's best to just stay with the if type(indices) is not self._holder:
condition.
pandas/tests/indexes/common.py
Outdated
|
||
idx = self._holder(indices) | ||
if idx.has_duplicates: | ||
# We need to be able to control creation of duplicates here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is a bit obtuse, can you reword
c0d3ec4
to
7f74578
Compare
@jreback All green. Any more feedback / comments? |
pandas/tests/indexes/common.py
Outdated
@@ -37,7 +37,7 @@ def verify_pickle(self, indices): | |||
def test_pickle_compat_construction(self): | |||
# this is testing for pickle compat | |||
if self._holder is None: | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I didn't write these tests, and the inner workings of indexes/common.py
are not immediately apparent (which files call it, what do they fill indices
with, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to leave the bare return in there, if that's preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, put a breakpoint there and see if you would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to do work on any of this for 2 weeks now. I'll have a look after.
|
||
n, k = len(idx), 10 | ||
duplicated_selection = np.random.choice(n, k * n) | ||
expected = pd.Series(duplicated_selection).duplicated(keep=keep).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to check as a numpy array, much prefer to check the type and use assert_index_equal or assert_series_equal. is this how the other tests are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, index.duplicated()
(without return_inverse
) yields a numpy array - this is the documented signature (I guess because selecting on an Index
really only needs an ndarray
).
All the manual duplicated
-tests actually create their own data, and know what the correct outcome should be. Here, we're feeding tons of different things through that test, so we need to determine - as I'm doing with duplicated_selection
what is actually duplicate; self._holder(idx.values[duplicated_selection])
is then a duplicate Index of the correct type, but we know where its duplicates are (from inspecting duplicated_selection
), and therefore can validate.
@jreback The if-branch you mentioned in |
db29d9b
to
cfa6182
Compare
The travis failure is unrelated. There seems to be a problem with the 3.5 job when collecting parametrized tests, in that the order is not stable, which yields a failure due to different tests
|
@jreback ping. Should be ready to go - travis failure is unrelated. |
@h-vetinari actually no, your PR is causing the fail. you have non-determinism in the test generation. usually this is because the ordering of the fixtures / parameters is based on a dictionary. |
I honestly can't see how this would be the case. I only change tests (no fixtures or anything) in
and the collection errors are in The parametrization of that test is
|
cfa6182
to
f9c9aab
Compare
rebased again, let's see if it helps |
@jreback all green |
thanks @h-vetinari |
Splitting up #21645
duplicated
return
statements (which falsely pass the test) intopytest.skip
.