Skip to content

TST/CLN: correctly skip in indexes/common; add test for duplicated #21902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 10, 2018

Conversation

h-vetinari
Copy link
Contributor

Splitting up #21645

@codecov
Copy link

codecov bot commented Jul 14, 2018

Codecov Report

Merging #21902 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21902   +/-   ##
=======================================
  Coverage   92.07%   92.07%           
=======================================
  Files         169      169           
  Lines       50684    50684           
=======================================
  Hits        46668    46668           
  Misses       4016     4016
Flag Coverage Δ
#multiple 90.48% <ø> (ø) ⬆️
#single 42.34% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bcc2bb...f9c9aab. Read the comment docs.

idx = self._holder([indices[0]] * 5)
assert not idx.is_unique
assert idx.has_duplicates

@pytest.mark.parametrize('keep', ['first', 'last', False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a duplicate these (pun intended) or are not testing indices duplicated currently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback .duplicated is hardly ever tested directly, only indirectly for stuff like .drop_duplicates. Regardless of the changes to .duplicated in #21645, I think duplicated should be tested separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are not relevant?

pandas/tests/indexes/common.py:    def test_duplicates(self, indices):
pandas/tests/indexes/test_category.py:    def test_duplicates(self):
pandas/tests/indexes/test_range.py:    def test_duplicates(self):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback No, test_duplicates (at least for pandas/tests/indexes/common.py, which is what this PR is about) tests .is_unique and .has_duplicates, but not the .duplicated-method itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point to the coverage that shows this is NOT tested?

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I never said that it is not tested, just that it is only tested implicitly. Any call to .drop_duplicates will invoke duplicated, so obviously the coverage works out.

@jreback jreback added the Testing pandas testing functions or related to the test suite label Jul 14, 2018
@jreback jreback added this to the 0.24.0 milestone Jul 14, 2018
idx = self._holder([indices[0]] * 5)
assert not idx.is_unique
assert idx.has_duplicates

@pytest.mark.parametrize('keep', ['first', 'last', False])
def test_duplicated(self, indices, keep):
if type(indices) is not self._holder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use isinstance here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback isinstance of what? I copied the checks from other tests, because I didn't know all the different cases that flow into common.py.

Copy link
Contributor

@jreback jreback Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance of self._holder, is this what we do elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in indexes/common.py, as far as I can see. There's several isinstance of course, but never for self._holder, which can apparently be None as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback If I change

if type(indices) is not self._holder:

to

if not isinstance(indices, self._holder):

I get 9 failures (instead of skips), all of which are from

tests/indexes/test_numeric.TestInt64Index, when run against DatetimeIndex, PeriodIndex or TimedeltaIndex, mainly because of TypeError: Unsafe NumPy casting, you must explicitly cast it seems.

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Seem to have found something: in my normal workbook, I get

pd.__version__
# '0.24.0.dev0+321.g0fe6ded52.dirty'
isinstance(pd.PeriodIndex, pd.Int64Index)
# False

but for some reason, within tests/indexes/common.py, this is the opposite:

    [...output of test run...]
    
    @pytest.mark.parametrize('keep', ['first', 'last', False])
    def test_duplicated(self, indices, keep):
        if not isinstance(indices, self._holder):
            pytest.skip('Can only check if we know the index type')
        if not len(indices) or isinstance(indices, (MultiIndex)):
            # MultiIndex tested separately in:
            # tests/indexes/multi/test_unique_and_duplicates
            pytest.skip('Skip check for empty Index and MultiIndex')
        if isinstance(indices, (PeriodIndex, DatetimeIndex)):
            # this branch should be impossible for Int64Index
            # after the instance-check above!
>           raise ValueError(f'{type(indices).__name__}, {self._holder.__name__}, '
                             f'{isinstance(indices, self._holder)}, {type(indices) is self._holder}')
E           ValueError: PeriodIndex, Int64Index, True, False

Same happens for DatetimeIndex. It does work for the original is not variant, so I'm leaving that as it is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a follow-up: #22211

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I had gotten confused between instances and classes.

So DatetimeIndex, PeriodIndex and TimedeltaIndex are subclasses of Int64Index - but are unsafe to cast to Int64...? I guess this is intentional?

In any case, under these circumstances, I'm even more convinced that it's best to just stay with the if type(indices) is not self._holder: condition.


idx = self._holder(indices)
if idx.has_duplicates:
# We need to be able to control creation of duplicates here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is a bit obtuse, can you reword

@h-vetinari h-vetinari force-pushed the tst_index_common_skip branch 2 times, most recently from c0d3ec4 to 7f74578 Compare July 16, 2018 16:16
@h-vetinari
Copy link
Contributor Author

@jreback All green. Any more feedback / comments?

@@ -37,7 +37,7 @@ def verify_pickle(self, indices):
def test_pickle_compat_construction(self):
# this is testing for pickle compat
if self._holder is None:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I didn't write these tests, and the inner workings of indexes/common.py are not immediately apparent (which files call it, what do they fill indices with, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to leave the bare return in there, if that's preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, put a breakpoint there and see if you would.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to do work on any of this for 2 weeks now. I'll have a look after.


n, k = len(idx), 10
duplicated_selection = np.random.choice(n, k * n)
expected = pd.Series(duplicated_selection).duplicated(keep=keep).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to check as a numpy array, much prefer to check the type and use assert_index_equal or assert_series_equal. is this how the other tests are?

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, index.duplicated() (without return_inverse) yields a numpy array - this is the documented signature (I guess because selecting on an Index really only needs an ndarray).

All the manual duplicated-tests actually create their own data, and know what the correct outcome should be. Here, we're feeding tons of different things through that test, so we need to determine - as I'm doing with duplicated_selection what is actually duplicate; self._holder(idx.values[duplicated_selection]) is then a duplicate Index of the correct type, but we know where its duplicates are (from inspecting duplicated_selection), and therefore can validate.

@h-vetinari
Copy link
Contributor Author

is this actually hit?

@jreback The if-branch you mentioned in test_pickle_compat_construction was not hit, therefore I removed it. Please re-review.

@h-vetinari h-vetinari force-pushed the tst_index_common_skip branch from db29d9b to cfa6182 Compare August 5, 2018 17:38
@h-vetinari
Copy link
Contributor Author

The travis failure is unrelated. There seems to be a problem with the 3.5 job when collecting parametrized tests, in that the order is not stable, which yields a failure due to different tests

==================================== ERRORS ====================================
_____________________________ ERROR collecting gw1 _____________________________
Different tests were collected between gw0 and gw1. The difference is:

[...]

 pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df0-sum-expected0]
-pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df1-func1-expected1]
-pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df2-sum-expected2]
+pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df1-sum-expected1]
+pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df2-func2-expected2]
 pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df3-max-expected3]
 pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df4-amax-expected4]
 pandas/tests/frame/test_apply.py::TestDataFrameAggregate::()::test_agg_cython_table[axis 0-df5-func5-expected5]

[...]

@h-vetinari
Copy link
Contributor Author

@jreback ping. Should be ready to go - travis failure is unrelated.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

@h-vetinari actually no, your PR is causing the fail. you have non-determinism in the test generation. usually this is because the ordering of the fixtures / parameters is based on a dictionary.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 9, 2018

@jreback

actually no, your PR is causing the fail

I honestly can't see how this would be the case. I only change tests (no fixtures or anything) in

  • pandas/tests/indexes/common.py
  • pandas/tests/indexes/test_category.py
  • pandas/tests/indexes/test_range.py

and the collection errors are in pandas/tests/frame/test_apply.test_agg_cython_table.

The parametrization of that test is

    @pytest.mark.parametrize("df, func, expected", chain(
        _get_cython_table_params(
            DataFrame(), [
                ('sum', Series()),
                ('max', Series()),
                ('min', Series()),
                ('all', Series(dtype=bool)),
                ('any', Series(dtype=bool)),
                ('mean', Series()),
                ('prod', Series()),
                ('std', Series()),
                ('var', Series()),
                ('median', Series()),
            ]),
        _get_cython_table_params(
            DataFrame([[np.nan, 1], [1, 2]]), [
                ('sum', Series([1., 3])),
                ('max', Series([1., 2])),
                ('min', Series([1., 1])),
                ('all', Series([True, True])),
                ('any', Series([True, True])),
                ('mean', Series([1, 1.5])),
                ('prod', Series([1., 2])),
                ('std', Series([np.nan, 0.707107])),
                ('var', Series([np.nan, 0.5])),
                ('median', Series([1, 1.5])),
            ]),
    ))

@h-vetinari h-vetinari force-pushed the tst_index_common_skip branch from cfa6182 to f9c9aab Compare August 9, 2018 08:05
@h-vetinari
Copy link
Contributor Author

rebased again, let's see if it helps

@h-vetinari
Copy link
Contributor Author

@jreback all green

@h-vetinari
Copy link
Contributor Author

Btw, that issue with the test order is related to #22156, #22157

@jreback jreback merged commit c7d6264 into pandas-dev:master Aug 10, 2018
@jreback
Copy link
Contributor

jreback commented Aug 10, 2018

thanks @h-vetinari

@h-vetinari h-vetinari deleted the tst_index_common_skip branch August 10, 2018 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants