Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756

Closed
datapythonista opened this issue Jan 15, 2023 · 6 comments · Fixed by #57459
Closed
Labels
Performance Memory or execution speed performance

Comments

@datapythonista
Copy link
Member

NumPy data representation, by default, contains rows together in memory. In this example:

numpy.array([[1, 2, 3, 4],
             [5, 6, 7, 8]])

Operating (e.g. adding up) over 1, 2, 3, 4 will be fast, since the values are together in the buffer, and CPU caches will be used efficiently. If we operate at column level 1, 5, caches won't be used efficiently since the values are far in memory, and performance will be significantly worse.

The problem is that if I create a dataframe from a 2D numpy array (e.g. pandas.DataFrame(2d_numpy_array_with_default_strides)), pandas will build the dataframe with the same shape, meaning that every column in the array will be a column in pandas, and operating over the columns in the dataframe will be inefficient.

Given a dataframe with 2 columns and 10M rows, the difference when adding up the column values is one order of magnitude slower:

>>> import numpy
>>> import pandas

>>> df_default = pandas.DataFrame(numpy.random.rand(10_000_000, 2))
>>> df_efficient = pandas.DataFrame(numpy.random.rand(2, 10_000_000).T)

>>> %timeit df_default.sum()
340 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit df_efficient.sum()
23.4 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Not sure what's a good solution, I don't think we want to change the behavior to load numpy arrays transposed, and I'm not sure if rewritting the numpy array fixing the strides is a great option either. Personally, I'd check the strides in the provided array, and if they are inefficient, show the user a warning with a descriptive message on what's the problem and how to fix it.

@datapythonista datapythonista added the Performance Memory or execution speed performance label Jan 15, 2023
@phofl
Copy link
Member

phofl commented Jan 15, 2023

I think we might have a chance to fix this with Copy on write. When using CoW we have to/should copy a given array to avoid that changing the array propagates to the DataFrame. When doing the copy, we could also change the memory layout

@jbrockmendel
Copy link
Member

When doing the copy, we could also change the memory layout

xref #44871

@topper-123
Copy link
Contributor

I'm thinking about the data that is loaded into frames from the read_* functions: E.g. is the data from read_sql laid out optimally? In those functions, the data to the DataFrame isn't really coming from the user, but from a Pandas function. Could be worth it to check if the data from the read_* functions is laid out optimally?

@phofl
Copy link
Member

phofl commented Apr 2, 2023

Most of them are not using 2D numpy arrays as far as I know

@jorisvandenbossche
Copy link
Member

When doing the copy, we could also change the memory layout

xref #44871

@jbrockmendel the PR you reference here, as I understand, tried to avoid changing the memory layout in copy. So the fact that you saw some slowdowns in arithmetic ASVs seems to support it is good to force the memory layout when you have to copy anyway?

The speed-ups you reported there seems to be mostly from methods that make a copy (and thus become faster because the copy becomes faster). But now with CoW enabled, many of those methods will not copy anyway, and so this benefit isn't relevant anymore (eg the rename and concat benchmarks that showed the biggest speedup)

@jorisvandenbossche
Copy link
Member

I opened a PR to force the memory layout to column major for DataFrame(ndarray) -> #57459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants