BUG: Disallow non-increasing multi-index header arguments #47314

ahawryluk · 2022-06-11T21:18:36Z

closes BUG: pandas accepts non-increasing MultiIndex header arguments #47011 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
NA - Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I didn't add separate tests for read_excel, read_html, or read_fwf because they all pass their data to TextParser, where the header validation occurs. Let me know if you'd like separate tests for those.

mroeschke · 2022-06-13T18:50:26Z

pandas/io/common.py

@@ -198,6 +198,8 @@ def validate_header_arg(header: object) -> None:
            raise ValueError("header must be integer or list of integers")
        if any(i < 0 for i in header):
            raise ValueError("cannot specify multi-index header with negative integers")
+        if list(header) != sorted(set(header)):


Maybe any(h1 >= h2 for h1, h2 in zip(header, header[:1]) so this check can short circuit

The short circuit is a good idea. If header is already sorted, then list(header) != sorted(set(header)) is either somewhat faster or comparable in speed to any(h1 >= h2 for h1, h2 in zip(header, header[:1]), but if header is not sorted than the short circuit makes a big difference. (I got curious and tested up to len(header) = 10**6.)

mroeschke · 2022-06-13T18:51:14Z

I didn't add separate tests for read_excel, read_html, or read_fwf because they all pass their data to TextParser, where the header validation occurs. Let me know if you'd like separate tests for those.

If you could find if these are already tested, please reference them in this issue. Otherwise would be good to add test for these

phofl · 2022-06-14T07:15:33Z

pandas/tests/io/parser/test_header.py

@@ -61,6 +61,20 @@ def test_negative_multi_index_header(all_parsers, header):
        parser.read_csv(StringIO(data), header=header)


+@pytest.mark.parametrize("header", [([0, 0]), ([1, 0])])
+def test_nonincreasing_multi_index_header(all_parsers, header):


I might be missing something, but the [0, 0] case seems to work

1 2 3 4 5 1 2 3 4 5 0 6 7 8 9 10 1 11 12 13 14 15

Is there a reason why we don't try to make it work instead of raising an error? I get the code complexity argument, but I haven't looked if this would really add complexity.

Also we should probably deprecate first. I would consider this a bug on our side right now, raising is an unexpected change in behavior.

@phofl Thanks for taking a look at my PR.

I suspect that the complexity of handling decreasing header arguments is beyond my current pandas skills. The current code handles all the permutations of named/unnamed Index/MultiIndex on both axes in both the Python and C parsers, and I'd probably break it while trying to consume header rows out-of-sequence. Since the user can .swaplevel today, and since the current code fails silently on decreasing header arguments, I think raising a ValueError is an improvement. A more clever person may still implement decreasing header arguments in the future, being careful to replace header[-1] with max(header) in the current code.

On the other hand, since header=[n, n] does work maybe we should keep it. I can't imagine a use case for it, but that could be a lack of creativity on my part. Would you prefer we keep it so we don't have to log a deprecation and follow up later?

Since passing the same object twice works right now (also gives the correct result), we would have to deprecate before removing. Can't think of an example either, but this does not mean that it is not out there.

Decreasing headers work already for the python engine, so we would only have to fix the c engine. So I am -1 on raising

Edit: Sorry, they don't work, but it is a trivial fix. Replacing [header[-1] + 1] with [max(header) + 1] fixes it

The c engine fix is equally simple, so I would propose just fixing it.

Excellent! I'll close this PR and attempt the fix. Thanks for the investigation/encouragement.

ahawryluk added 3 commits June 9, 2022 22:05

Fix bug, add test

06dd9a8

Whats new

4a06cc5

Merge branch 'main' into bug_47011

3ab77e7

mroeschke reviewed Jun 13, 2022

View reviewed changes

phofl reviewed Jun 14, 2022

View reviewed changes

datapythonista added Bug MultiIndex labels Jun 16, 2022

ahawryluk closed this Jul 7, 2022

ahawryluk deleted the bug_47011 branch July 8, 2022 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Disallow non-increasing multi-index header arguments #47314

BUG: Disallow non-increasing multi-index header arguments #47314

Uh oh!

ahawryluk commented Jun 11, 2022

Uh oh!

mroeschke Jun 13, 2022

Uh oh!

ahawryluk Jun 21, 2022

Uh oh!

mroeschke commented Jun 13, 2022

Uh oh!

phofl Jun 14, 2022

Uh oh!

ahawryluk Jul 5, 2022

Uh oh!

phofl Jul 6, 2022 •

edited

Loading

Uh oh!

phofl Jul 6, 2022

Uh oh!

ahawryluk Jul 7, 2022

Uh oh!

Uh oh!

Uh oh!

BUG: Disallow non-increasing multi-index header arguments #47314

BUG: Disallow non-increasing multi-index header arguments #47314

Uh oh!

Conversation

ahawryluk commented Jun 11, 2022

Uh oh!

mroeschke Jun 13, 2022

Choose a reason for hiding this comment

Uh oh!

ahawryluk Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Jun 13, 2022

Uh oh!

phofl Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

ahawryluk Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

phofl Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

ahawryluk Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl Jul 6, 2022 •

edited

Loading