Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

Closed
wants to merge 1 commit into from
Closed

BUG fix _Unstacker int32 limit in dataframe sizes (pandas-dev#26314) #34827

wants to merge 1 commit into from

Conversation

KaonToPion
Copy link

@KaonToPion KaonToPion commented Jun 16, 2020

I have tested it with :
df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(90000, 2)), columns=['a', 'b'])
df.set_index(['a', 'b']).unstack()

I am not sure if I should add it as a test, it requires quite some memory. I am also hesitant about where the test should be located in case it's added.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls always add a test and make sure it fails first before adding a path

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 16, 2020
@jreback
Copy link
Contributor

jreback commented Jun 16, 2020

I am not sure if I should add it as a test, it requires quite some memory. I am also hesitant about where the test should be located in case it's added.

how long does this take / how much memory.?

you can mark tests as well.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

can u merge master and see if can get passing

@simonjayhawkins
Copy link
Member

@KaonToPion closing as stale. please LMK if you want to continue and the PR will be reopened.

@KaonToPion KaonToPion deleted the bug-fix-unstack branch October 11, 2020 18:18
@KaonToPion KaonToPion restored the bug-fix-unstack branch October 11, 2020 18:18
@KaonToPion KaonToPion deleted the bug-fix-unstack branch October 11, 2020 18:18
@KaonToPion KaonToPion restored the bug-fix-unstack branch October 11, 2020 18:18
@KaonToPion
Copy link
Author

@simonjayhawkins could you re-open it? I can take it again

@KaonToPion
Copy link
Author

I am having trouble with the fix. This is the code


# Bug fix GH 20601
# If the data frame is too big, the number of unique index combination
# will cause int32 overflow on windows environments.
# We want to check and raise an error before this happens

num_rows = np.max([index_level.size for index_level in self.new_index_levels])
num_columns = self.removed_level.size

# GH20601: This forces an overflow if the number of cells is too high.

num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)
if num_rows > 0 and num_columns > 0 and num_cells <= 0:
    raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")

The issue right now is that "raise ValueError" generates troubles in all the other systems so there are two ways out of this:

  1. Checking with int64
    num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)
    This solves the issue, but I am not sure how to test this, because It doesn't matter the size of my Ram (I have even tried with a 500 Gb RAM machine) , that I have a memoryError way before I am able to raise the ValueError exception using num_cells

  2. Just deleting the num_cells, num_rows, and num_columns and the raise_exception so go forward with the MemoryErrors that a lot of people are going to have when doing the df.unstack() because their data will unwrapp.

My first commit tried the first option but trying to test it I find that the second option is probably the best option.

Any opinion on this @jreback @simonjayhawkins ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pivot / unstack on large data frame does not work int32 overflow
3 participants