-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-5: NoRowIndex #49694
PDEP-5: NoRowIndex #49694
Conversation
## Detailed Description | ||
|
||
This would require 3 steps: | ||
1. creation of a ``NoIndex`` object, which would be a subclass of ``RangeIndex`` on which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why NoIndex
be a subclass of RangeIndex
? Does it just make things easier to implement?
IMO it feels like NoIndex
should be an abstract-like class that Index and RangeIndex should subclass (and implement the "index-like" functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or why not just None
? (i.e. df.index is None
)
In any case I agree with @mroeschke that it seems strange to make this a subclass of RangeIndex, since we explicitly don't want to have it act as a RangeIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that under the hood (e.g. at the Manager level) we'd still have an Index object (until/unless we refactor to get the indexes out of the Managers) and a NoIndex object makes sense there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but that could be an internal implementation detail (to avoid having to check internally for if index is not None:
too often). I think what matters here is what we want for the user. In the public .index
property, we could convert a "NoIndex" object into None
, if we think that is the better user API.
(although in the internals at least we don't do much with the index, we mostly pass it through, and passing through None could also work; I suppose it is mostly outside of the internals that a NoIndex object could make an initial implementation easier (less places to update))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding whether it could just be None
- just as an example, if you have
ser = Series([1,2,3])
ser.loc[ser>1] = 4
then it'll call Index.get_loc
.
Line 1132 in dbb2adc
loc = self.index.get_loc(key) |
So rather than re-engineering _set_with_engine
to not use self.index
if index
is None
, I thought it'd be simpler to have some NoIndex
object which overrides get_loc
Then, with get_loc
, append
, and join
overridden, very little would need changing outside of the NoIndex
class itself
As for whether NoIndex
needs to subclass RangeIndex
- perhaps not, but my reasoning was that for aligning and joining, it would act just like RangeIndex
, but might just need to santise its inputs.
join
, for example, could be:
if not isinstance(other, NoIndex):
raise TypeError("Can't join NoIndex with Index")
if not len(self) == len(other):
raise TypeError("Can't join NoIndex objects of different lengths")
return super().join(other, how=how, return_indexers=return_indexers, sort=sort)
Not sure I follow the suggestion about it being an abstract class:
- where would the methods be implemented? If
NoIndex
is an abstract class, what would the concrete class be? - would
Index
then be defined asclass Index(NoIndex, IndexOpsMixin, PandasObject)
?
An alternative would be to have class NoIndex(IndexOpsMixin, PandasObject)
, though it would then require re-implementing many of the Index
methods (like __len__
)
I'll look further into this and will amend the proposal - thanks all for your comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow the suggestion about it being an abstract class:
Sorry I think "abstract-like" was probably a loaded term to use: I meant more akin to a skeleton class (that you alluded to) where you can define the necessary methods and raise a NotImeplementedError
to "disable" indexing-like behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So rather than re-engineering
_set_with_engine
to not useself.index
ifindex
isNone
, I thought it'd be simpler to have someNoIndex
object which overridesget_loc
Then, with
get_loc
,append
, andjoin
overridden, very little would need changing outside of theNoIndex
class itself
For me, I would say this is still an implementation detail. We could decide to internally use for example ._index
instead of .index
, which would be ensured to return an Index (sub)class instance, so that this can be implemented this way without too much changes (apart from adding the underscore). And this way the public property could still return None
.
For the public API, the question is: would it be useful for the user to have a NoIndex object? (to be clear, I am not necessarily saying the answer is definitly "no", we should probably think through some use cases) How would the user make use of this?
And do we want that the user starts relying on the fact that they can still call all the Index methods on this NoIndex object?
(and for this PDEP, I don't think we necessarily have to agree on the exact implementation approach, but I would like to see a concrete API proposal for the public API)
Is there a proposed result for
I suppose this must no longer be a Series? Edit: Or is the result the values without any column labels? Another one is transpose - which I think will have to lose the column labels. We use transpose internally at times to do axis=1 computations, but we can track the column labels there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if its better to have this as a dataframe / series constructor argument rather than a global setting. Also @rhshadrach would probably have some good input as to how this aligns with some of the groupby parameters he has been working on
Thanks all for your comments!
Ah, that's a really good one. I hadn't tried that out, and I don't have a good solution. It'd be nice to do In [11]: df
Out[11]:
a b c
1 4 7
2 5 8
3 6 9
In [12]: df.sum()
Out[12]:
a b c
6 15 24 but I think that'd be too much of a breaking change from the regular pandas case. Not keen on having: In [14]: df.sum()
Out[14]:
6
15
24
dtype: int64 because here the row labels would be really meaningful, but also not keen on having and extra column like In [19]: df.sum()
Out[19]:
column sum
a 6
b 15
c 24 I can't see how to get this to conform with "3. Nobody should get an index unless they ask for one" without it being too big of a change to current behaviour. I think I may need to drop "3." from this proposal, and only describe how a |
I would personally find that the better option. Specifically for the reduction example: a DataFrame has named columns, so if those end up in the index in a certain operation, we use those names. So you could say that the user "asks for this", when it does an operation where the columns names are used for row labels. I personally also think that having an option to have to no index to start with (i.e, in IO, constructors, etc) is already useful in itself (and is a smaller scoped change than the "with this option enabled, no index is ever created", since we just have quite some functionality that introduce meaningful indices even when starting from a default range index / no index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we don't want to call this NoRowIndex? even internally
#as by definition we are allowing this for axis=0
(i guess we could allow no column index but then it's just a 2D numpy array)
Thanks everyone for your comments, really appreciate them - I'll revise the proposal trying to account for everyone's thoughts For now it'll focus exclusively on the |
Do you have a point of view as to what should happen when objects without an index that are a different length get added? Do they truncate to the shortest length, raise an error, or do something else? |
that'd raise - aiming to push a revision with plenty of examples tomorrow |
Thanks all for your reviews I've pushed another commit with a first revision. I've resolved most of the comments as I think I've addressed them in the revision, either in the body or in the FAQ section I've left open the discussion about whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One global comment. You may want to avoid contractions, i.e., instead of "wouldn't", say "would not", instead of "didn't", use "did not", etc.
|
||
The suggestion is to add a ``NoRowIndex`` class. Internally, it would act a bit like | ||
a ``RangeIndex``, but some methods would be stricter. This would be one | ||
step towards enabling users who don't want to think about indices to not have to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This would be one step towards enabling users who don't want to think about indices." I think the phrase "to not have to" is awkward.
My one general issue is this might be throwing out the baby with the bath water, though I get why it might be easy to implement. It also feels quite dangerous as it will be inconsistent with pretty much how pandas is documented everywhere to behave. This will make stack overflow questions hard to answer as you'll need to know whether this option was set. Row Indices are also broadly useful and declarative. The point of pandas over numpy is that it adds meta-data which controls its behavior. I can't imagine turning them off and retaining the value of Pandas. The problem with indices occurs with implicit and rather arbitrary data alignment during various operations, such as sum above. In other data QLs you need to specify this behavior explicitly via a join clause. Better I think would be to leave indices largely alone but have an option (which can be globally set) to turn it off/on with one of the following: strict mode / error on unaligned indices, off with ignore error (which this seems to be, and it seems very dangerous), or on (current behavior). The dangerous mode, I only include it to be friendly, but I'd advise against it. For stack overflow questions, an answer could often be - try setting strict mode to help debug your problem. Data alignment may be confusing, but it's extremely important as it ensures that the user isn't incorrectly collating unaligned data, which can easily happen. Implicit behavior like this is often the source of confusion, especially when it's not very consistent. My biggest complaint with how it currently automagically works and doesn't warn you when probably you made a mistake. eg: |
Thanks for taking a look - I'll respond to some of your points:
This wouldn't be the dangerous mode you describe, because without row labels, there'd be no concept of misalignment. If you try to sum (or take
pandas doesn't know what you're trying to do, so it can't warn when you "probably made a mistake". An option to warn whenever indexes aren't aligned would be a separate discussion - if you want to take it further, please open a new issue and I'd be happy to take a look. Let's keep this issue's discussion focused on not having row labels. |
These checks aren't expensive, but they do add up (xref #49771). I'm more concerned about the maintenance/testing burde they add up to. |
Thanks all for your reviews and comments, I really appreciate it 🙏 After several discussions, it's become clear that there's not enough support for this to pass, at least in its current state. Hence, I've marked the proposal as 'rejected', and have explained the reasons why in a section at the bottom of the page. As mentioned in the PDEP, this has still been a useful exercise which has resulted in 2 related proposals. It may be that the idea of not having an index can make it into pandas in the future, but it would require a more comprehensive set of changes to be made together. This particular version of the proposal, however, doesn't have a chance of making it. No hard feelings, and thank you all for the time you dedicated to reviewing this 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIthdrawn status is okay with me. Thanks for exploring this @MarcoGorelli!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there were some comments from my previous review that weren't addressed, and I added new ones here.
```python | ||
In [12]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, index=NoRowIndex(3)) | ||
|
||
In [13]: df.loc[df['a']>1, 'b'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, would be useful to have spaces around the >
operator here and a few lines below
|
||
### Columns cannot be NoRowIndex | ||
|
||
This proposal deals exclusively with letting users not have to think about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to write "This proposal deals exclusively with allowing users to not have to think about"
then the following would all create a ``DataFrame`` with a ``NoRowIndex`` (as they | ||
all call ``default_index``): | ||
|
||
- ``df.reset_index(drop=True)``; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't df.reset_index()
(i.e., with drop=False
) also do the same thing? I see no need to include drop=True
in this example.
**Q: Are indices not really powerful?** | ||
|
||
**A:** Yes! And they're also confusing to many users, even experienced developers. | ||
It's fairly common to see pandas code with ``.reset_index`` scattered around every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, we often end up using .reset_index()
because we have a MultiIndex
, and there is some data in the index that we need to use in a computation. E.g.:
df = pd.DataFrame({"a":[1,2,3,4]}, index=pd.MultiIndex.from_product([[10, 20], [30, 40]], names=["x", "y"]))
df.reset_index().assign(z=lambda df: df["x"] + df["y"]).set_index(df.index.names)
So your comment about "It's fairly common to see pandas code with .reset_index()
scattered around every other line" implies that the reason that people are doing reset_index()
is because people are worried about indices and alignment.
It might be better to say "Often users are using .reset_index
to avoid issues with indices and alignment."
|
||
**A:** This PDEP would only allow it via the constructor, passing | ||
``index=NoRowIndex(len(df))``. A mode could be introduced to toggle | ||
making that the default, but would be out-of-scope for the current PDEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with what you wrote above about having the mode.no_row_index
option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how so? even when I mention it above, I make it clear that this PDEP only deals with the NoRowIndex
class, and that this mode would be out-of-scope for this PDEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is text in the PDEP that refers to the mode that makes it seem like it is part of the proposal.
For example, lines 297-298 read:
Conversely, to get rid of the index, then (so long as one has enabled the ``mode.no_row_index`` option)
one could simply do ``df.reset_index(drop=True)``.
When I read that, it seems like the mode is part of the proposal. So I think you may need to qualify all references to mode.no_row_index
to describe what would happen if the mode.no_row_index
option is not available.
Or make a decision to include mode.no_row_index
as part of the proposal. (Which I think makes it easier to understand, IMHO)
**Q: Would ``.loc`` stop working?** | ||
|
||
**A:** No. It would only raise if used for label-based selection. Other uses | ||
of ``.loc``, such as ``df.loc[:, col_1]`` or ``df.loc[mask, col_1]``, would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to use boolean_mask
here to make it clear you're referring to that kind of mask
Apologies, and thanks for your patience + comments |
I think I've addressed the comments - this is withdrawn anyway so arguably doesn't need to be perfect @Dr-Irv is it OK to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to merge, since it is withdrawn
Initial draft of a proposal to have a "no index by default" mode
@pandas-dev/pandas-core @pandas-dev/pandas-triage would appreciate any thoughts / feedback, thanks 🙏