Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Crosstab margins ignoring dropna #12614

Closed
wants to merge 6 commits into from
Closed

Conversation

OXPHOS
Copy link
Contributor

@OXPHOS OXPHOS commented Mar 14, 2016

To fix bug #12577 : Crosstab margins ignoring dropna

To fix bug pandas-dev#12577: Crosstab margins ignoring dropna
@OXPHOS OXPHOS changed the title Modified _generate_marginal_results Modified pivot._generate_marginal_results Mar 14, 2016
@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

tests!

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 14, 2016
@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 14, 2016

I did test several groups of data on my Mac and they worked out well.. Can I have some hints where could go wrong? Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

you need to add tests to the codebase. pandas/tools/tests/test_pivot.py. see the docs

@nickeubank
Copy link
Contributor

@OXPHOS Thanks for the PR!

Contributions to pandas require not only that code be modified, but also that you add tests that ensure the proper behavior. This helps ensure that when other people change code in the future, if they do something that breaks the code you've now written, when they run the test suite (which will include your tests) they'll catch the error.

The section of the dogs that talks about it is here. http://pandas.pydata.org/pandas-docs/stable/contributing.html#test-driven-development-code-writing

As a basic test, you can basically write out the case from the issue report. You should probably put around line 938 in pandas/tools/tests/test_pivot.py

if dropna, pass truncated data to _add_margin
@OXPHOS OXPHOS changed the title Modified pivot._generate_marginal_results BUG: Crosstab margins ignoring dropna Mar 14, 2016
@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 14, 2016

Hi @nickeubank , thanks for your instruction and thanks @jreback !

I modified the codes and added tests.
For modification: truncated data pass to _add_margin if dropna == true in pivot.pivot_table()
For tests: added 3 tests in test_pivot.py. Yet I am not sure about the test style (i.e. class name).

I tried to add another test case:
df = DataFrame({'a':[1, np.nan, np.nan, np.nan, np.nan, 2], 'b':[np.nan, 3, 4, 4, 4, np.nan]}) print (crosstab(df.a,df.b, margins=True))

All intersected entries are NaN. However, this case leads to a general fail (even without my modification) and to be honest I don't how what the expected result should be like.

Error Information:

raise ValueError('No objects to concatenate')

ValueError: No objects to concatenate

I assume we need try/except somewhere else upstream (but I am rushing for lunch..)

margins_name=margins_name)
if dropna:
data_dropna = data[data.notnull().all(axis = 1)]
table = _add_margins(table, data_dropna, values, rows=index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to repeat the if, just do:

if dropna:
   data = data[......]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like below? I felt unsafe changing data values but if you're okay with it I'll update the codes.

if dropna: data = data[data.notnull().all(axis = 1)] table = _add_margins(table, data, values, rows=index, cols=columns, aggfunc=aggfunc, margins_name=margins_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you changing data values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think she means over-writing data. That's fine if all tests clear!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

pls add a whatsnew entry

@nickeubank
Copy link
Contributor

@OXPHOS I think that the "general fail" you note where no objects intersect is actually behaving fine -- it's throwing an informative exception, and I can't think of another way it could be handled.

Here are docs for adding a "whatsnew" entry -- basically, this is just a notification file that gives people a headsup about changes. I think this will be in version 0.18.1 (@jreback sound right?)

@jreback
Copy link
Contributor

jreback commented Mar 14, 2016

yep

simplified codes in pivot_table()
updated what’s new entry
updated comments in test_pivot.py
@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 15, 2016

... I am so sorry I ran flake8 as the first step for the tests but didn't go back to it right before commit or after modification. I know it took a long time to run through the test line. I'll do better this time.

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 16, 2016

Wow I passed! Thanks @nickeubank and @jreback guys for your help! I saw a conflict with base branch now. Anything else I should do? Thanks!

@jreback jreback added this to the 0.18.1 milestone Mar 16, 2016

df = pd.DataFrame({'a': [1, 2, 2, 2, 2, np.nan],
'b': [3, 3, 4, 4, 4, 4]})
actual = pd.crosstab(df.a, df.b, margins=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explictly add dropna=True for these. tests

then add a test for dropna=False (same set of tests)
and you will find it raises. the level is None fails.

In [10]: pd.crosstab(df.a,df.b,margins=True,dropna=False)
KeyError: 'Level None not found'

@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

looks good. see my comment above. Have to handle another case here.

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 16, 2016

Okay so here's what I found:

  • This is an independent issue. Apparently no previous tests have conditions margins=True and dropna=False at the same time.
  • For now, I added if to table.columns.names level in _add_margins and avoided the error. I am not sure whether we need the same condition in table.rows.names level.
  • But the problem raised from the if not dropna: table = table.reindex_axis(m, axis=1) in pivot_table, which erased all column names. I have been confused by the column names for a while. I was not sure whether the column names would show up on the side. Then it turns out that the show-up of column names completely depends on dropna, which is kinda weird. Apparently this is a problem for a while and I think it's worth fixing. However, I tried to play with table = table.reindex_axis(m, axis=1) but failed. I haven't come up with a good idea to solve the problem so far.

@jreback
Copy link
Contributor

jreback commented Mar 16, 2016

@OXPHOS ok let's do this.

going to merge your current PR.

Pls open an issue for the other case with a simple repro example (and cross reference to this PR).

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Mar 16, 2016

Will open soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants