-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memory usage with pandas input. #8927
Comments
I didn't read the code very carefully, but if you make your |
I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future. |
Related: pandas-dev/pandas#51463 . |
I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).
No. That is independent of the dtype. |
I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column. |
You can still use views without reallocating the arrays. The problem is a bit different though: pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects |
qid
column introduced in Support sklearn cross validation for ranker. #8859 is actually quite expensive, as pandasdrop
method makes a data copy. After some profiling, extracting a dictionary of columns actually saves memory. (reducing about 6GB for 5-fold cv with istella-s)The text was updated successfully, but these errors were encountered: