BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

KaonToPion · 2020-06-16T16:52:04Z

closes Pivot / unstack on large data frame does not work int32 overflow #26314
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I have tested it with :
df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(90000, 2)), columns=['a', 'b'])
df.set_index(['a', 'b']).unstack()

I am not sure if I should add it as a test, it requires quite some memory. I am also hesitant about where the test should be located in case it's added.

jreback

pls always add a test and make sure it fails first before adding a path

jreback · 2020-06-16T17:20:11Z

I am not sure if I should add it as a test, it requires quite some memory. I am also hesitant about where the test should be located in case it's added.

how long does this take / how much memory.?

you can mark tests as well.

jreback · 2020-07-17T11:01:43Z

can u merge master and see if can get passing

simonjayhawkins · 2020-09-15T08:41:35Z

@KaonToPion closing as stale. please LMK if you want to continue and the PR will be reopened.

KaonToPion · 2020-10-11T18:22:58Z

@simonjayhawkins could you re-open it? I can take it again

KaonToPion · 2020-10-12T16:07:34Z

I am having trouble with the fix. This is the code


# Bug fix GH 20601
# If the data frame is too big, the number of unique index combination
# will cause int32 overflow on windows environments.
# We want to check and raise an error before this happens

num_rows = np.max([index_level.size for index_level in self.new_index_levels])
num_columns = self.removed_level.size

# GH20601: This forces an overflow if the number of cells is too high.

num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
if num_rows > 0 and num_columns > 0 and num_cells <= 0:
    raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

The issue right now is that "raise ValueError" generates troubles in all the other systems so there are two ways out of this:

Checking with int64
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)
This solves the issue, but I am not sure how to test this, because It doesn't matter the size of my Ram (I have even tried with a 500 Gb RAM machine) , that I have a memoryError way before I am able to raise the ValueError exception using num_cells
Just deleting the num_cells, num_rows, and num_columns and the raise_exception so go forward with the MemoryErrors that a lot of people are going to have when doing the df.unstack() because their data will unwrapp.

My first commit tried the first option but trying to test it I find that the second option is probably the best option.

Any opinion on this @jreback @simonjayhawkins ?

BUG fix _Unstacker int32 limit in dataframe sizes (#26314)

7f430d4

jreback requested changes Jun 16, 2020

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 16, 2020

simonjayhawkins mentioned this pull request Jul 24, 2020

BUG: ValueError: Cannot convert non-finite values (NA or inf) to integer only when DF exceed certain size #35227

Closed

3 tasks

simonjayhawkins mentioned this pull request Sep 15, 2020

CI: Add stale PR action #36336

Merged

simonjayhawkins closed this Sep 15, 2020

KaonToPion deleted the bug-fix-unstack branch October 11, 2020 18:18

KaonToPion restored the bug-fix-unstack branch October 11, 2020 18:18

KaonToPion deleted the bug-fix-unstack branch October 11, 2020 18:18

KaonToPion restored the bug-fix-unstack branch October 11, 2020 18:18

treuherz mentioned this pull request Oct 13, 2020

VIS: Accept xlabel and ylabel for scatter and hexbin plots #37102

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

Uh oh!

KaonToPion commented Jun 16, 2020 •

edited

Loading

Uh oh!

jreback left a comment

Uh oh!

jreback commented Jun 16, 2020

Uh oh!

jreback commented Jul 17, 2020

Uh oh!

simonjayhawkins commented Sep 15, 2020

Uh oh!

KaonToPion commented Oct 11, 2020

Uh oh!

KaonToPion commented Oct 12, 2020

Uh oh!

Uh oh!

Uh oh!

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

Uh oh!

Conversation

KaonToPion commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 16, 2020

Uh oh!

jreback commented Jul 17, 2020

Uh oh!

simonjayhawkins commented Sep 15, 2020

Uh oh!

KaonToPion commented Oct 11, 2020

Uh oh!

KaonToPion commented Oct 12, 2020

Uh oh!

Uh oh!

KaonToPion commented Jun 16, 2020 •

edited

Loading