Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Inefficient data representation when building dataframe from NumPy array using copy=True #52438

Open
topper-123 opened this issue Apr 5, 2023 · 0 comments
Labels
Performance Memory or execution speed performance

Comments

@topper-123
Copy link
Contributor

topper-123 commented Apr 5, 2023

>>> import numpy
>>> import pandas as pd, numpy as np
>>> data = numpy.random.rand(10_000_000, 2)
>>> df = pd.DataFrame(data, copy=True)
>>> %timeit df.sum()
164 ms ± 1.12 ms per loop 
>>> df2 = df.copy()
>>> %timeit df2.sum()
13.2 ms ± 48.3 µs per loop 

I'd expect pandas to align the ndarray internally the same way in these two cases and IMO the problem is in the constructor copy.

xref #50756, where the similar issue pd.DataFrame(data, copy=None) is discussed.

@topper-123 topper-123 changed the title PERF: Inefficient data representation when building dataframe from 2D NumPy array using copy=True PERF: Inefficient data representation when building dataframe from NumPy array using copy=True Apr 5, 2023
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants