Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure X and y are sorted equally in select_features #715

Merged
merged 5 commits into from
Jun 20, 2020

Conversation

nils-braun
Copy link
Collaborator

This looks like a quite large bug to me.

In principle, X and y could have the same index entries (what we check), but we never check if the order is the same! So passing in a X with index [0, 1, 2] and a y with index [2, 1, 0] will not raise any problems.

For classification tasks, this is not a problem, as all the significance tests except target_real_feature_real_test do not care about the order of the index, as x and y are re-ordered internally again in all the other tasks. But for target_real_feature_real_test the index is stripped before any re-order could happen - which leads to the bug described in #713.

This PR fixes this. I am a bit puzzled why this was never a problem before :-/
Therefore I added the exact test from the issue, to see any degradation again.
@MaxBenChrist as a general remark: we should add more tests on select_features :-)

@github-actions
Copy link

You have style errors. See them below.

./tests/integrations/examples/test_driftbif_simulation.py:73:121: E501 line too long (127 > 120 characters)

@coveralls
Copy link

coveralls commented Jun 16, 2020

Coverage Status

Coverage decreased (-0.05%) to 97.046% when pulling 1a0106f on bugfix/sorting-of-index into f1fcdd2 on master.

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks nils!

@@ -140,6 +140,11 @@ def select_features(X, y, test_for_binary_target_binary_feature=defaults.TEST_FO
if isinstance(y, np.ndarray):
y = pd.Series(y, index=X.index)

y = y.sort_index()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the sorting inside the calculate_relevance_table method. That way it gets more robust

@nils-braun nils-braun merged commit 7656f51 into master Jun 20, 2020
@nils-braun nils-braun deleted the bugfix/sorting-of-index branch June 20, 2020 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants