-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756
Comments
I think we might have a chance to fix this with Copy on write. When using CoW we have to/should copy a given array to avoid that changing the array propagates to the DataFrame. When doing the copy, we could also change the memory layout |
xref #44871 |
I'm thinking about the data that is loaded into frames from the |
Most of them are not using 2D numpy arrays as far as I know |
@jbrockmendel the PR you reference here, as I understand, tried to avoid changing the memory layout in The speed-ups you reported there seems to be mostly from methods that make a copy (and thus become faster because the copy becomes faster). But now with CoW enabled, many of those methods will not copy anyway, and so this benefit isn't relevant anymore (eg the |
I opened a PR to force the memory layout to column major for |
NumPy data representation, by default, contains rows together in memory. In this example:
Operating (e.g. adding up) over
1, 2, 3, 4
will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level1, 5
, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.The problem is that if I create a dataframe from a 2D numpy array (e.g.
pandas.DataFrame(2d_numpy_array_with_default_strides)
), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:
Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.
The text was updated successfully, but these errors were encountered: