-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Disallow non-increasing multi-index header arguments #47314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -198,6 +198,8 @@ def validate_header_arg(header: object) -> None: | |||
raise ValueError("header must be integer or list of integers") | |||
if any(i < 0 for i in header): | |||
raise ValueError("cannot specify multi-index header with negative integers") | |||
if list(header) != sorted(set(header)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe any(h1 >= h2 for h1, h2 in zip(header, header[:1])
so this check can short circuit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short circuit is a good idea. If header is already sorted, then list(header) != sorted(set(header))
is either somewhat faster or comparable in speed to any(h1 >= h2 for h1, h2 in zip(header, header[:1])
, but if header is not sorted than the short circuit makes a big difference. (I got curious and tested up to len(header) = 10**6.)
If you could find if these are already tested, please reference them in this issue. Otherwise would be good to add test for these |
@@ -61,6 +61,20 @@ def test_negative_multi_index_header(all_parsers, header): | |||
parser.read_csv(StringIO(data), header=header) | |||
|
|||
|
|||
@pytest.mark.parametrize("header", [([0, 0]), ([1, 0])]) | |||
def test_nonincreasing_multi_index_header(all_parsers, header): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but the [0, 0]
case seems to work
1 2 3 4 5
1 2 3 4 5
0 6 7 8 9 10
1 11 12 13 14 15
Is there a reason why we don't try to make it work instead of raising an error? I get the code complexity argument, but I haven't looked if this would really add complexity.
Also we should probably deprecate first. I would consider this a bug on our side right now, raising is an unexpected change in behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl Thanks for taking a look at my PR.
I suspect that the complexity of handling decreasing header arguments is beyond my current pandas skills. The current code handles all the permutations of named/unnamed Index/MultiIndex on both axes in both the Python and C parsers, and I'd probably break it while trying to consume header rows out-of-sequence. Since the user can .swaplevel today, and since the current code fails silently on decreasing header arguments, I think raising a ValueError is an improvement. A more clever person may still implement decreasing header arguments in the future, being careful to replace header[-1]
with max(header)
in the current code.
On the other hand, since header=[n, n] does work maybe we should keep it. I can't imagine a use case for it, but that could be a lack of creativity on my part. Would you prefer we keep it so we don't have to log a deprecation and follow up later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since passing the same object twice works right now (also gives the correct result), we would have to deprecate before removing. Can't think of an example either, but this does not mean that it is not out there.
Decreasing headers work already for the python engine, so we would only have to fix the c engine. So I am -1 on raising
Edit: Sorry, they don't work, but it is a trivial fix. Replacing [header[-1] + 1]
with [max(header) + 1]
fixes it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The c engine fix is equally simple, so I would propose just fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! I'll close this PR and attempt the fix. Thanks for the investigation/encouragement.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I didn't add separate tests for read_excel, read_html, or read_fwf because they all pass their data to TextParser, where the header validation occurs. Let me know if you'd like separate tests for those.