Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Changing Series and DataFrame repr for NaN values #15375

Closed
wesm opened this issue Feb 12, 2017 · 6 comments · Fixed by #29964
Closed

ENH: Changing Series and DataFrame repr for NaN values #15375

wesm opened this issue Feb 12, 2017 · 6 comments · Fixed by #29964
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@wesm
Copy link
Member

wesm commented Feb 12, 2017

With future pandas internal improvements in contemplation, I have often wondered if it would be worth changing the NaN outputs to be NA or NULL instead to reflect the actual semantics of the data. This could be something that's configurable in pandas.options (i.e. showing the semantic value or the physical value).

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Feb 12, 2017
@jreback jreback added this to the 1.0 milestone Feb 12, 2017
@jorisvandenbossche
Copy link
Member

I would be in favor of that (NA instead of NaN).
The only thing I am wondering is if we already know whether we want to make a distinction between NA and NaN for float data in pandas 2.0, where this would theoretically be possible when using bitmasks for NA. Because if we want that distinction, it might make sense to wait with the other repr until 2.0 instead 1.0.

@wesm
Copy link
Member Author

wesm commented Feb 12, 2017

It's somewhat out of scope for this issue, but I've been thinking about the NaN problem in pandas 2.0, and I think for backwards compatibility reasons we're going to get forced to make NaN and NA / NULL equivalent during the transition period. We could later add warnings when NaN is being treated as NA in operations like s[...] = np.nan (which I can attest litters people's pandas code), and then later add an option (where the default is that NaN and NA are differnet), and then later remove the option.

I am pretty confident that using bitmaps everywhere will make our code much simpler and faster (i.e. can use SIMD operations on the bitmaps to deal with null analytics and propagation).

@wesm
Copy link
Member Author

wesm commented Feb 12, 2017

As one example of why things will be faster, we can use bitmaps to eliminate branching in aggregations:

sum_x += values[i] * GetBit(bitmap, i);

compared with

if (values[i] == values[i]) sum_x += values[i];

@jorisvandenbossche
Copy link
Member

Yes, we could have a 'nan_as_missing' option that is first True and could possibly later change (similar to how we now have pd.options.mode.use_inf_as_null option that is False by default).

@jorisvandenbossche
Copy link
Member

The issue to discuss this is probably: wesm/pandas2#46

@TomAugspurger
Copy link
Contributor

FWIW, I think this will be effectively closed by #29964. That changes IntegerArray to use pd.NA, which has NA as its repr.

In [1]: import pandas as pd

In [2]: pd.Series(pd.array([1, 2, None]))
Out[2]:
0     1
1     2
2    NA
dtype: Int64

I don't think we'll want to move away from np.nan for ndarrays, and I don't think we want to display nan as NA, since they have different behaviors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants