Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize memory usage with pandas input. #8927

Open
trivialfis opened this issue Mar 16, 2023 · 6 comments
Open

Optimize memory usage with pandas input. #8927

trivialfis opened this issue Mar 16, 2023 · 6 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Mar 16, 2023

  • The special qid column introduced in Support sklearn cross validation for ranker. #8859 is actually quite expensive, as pandas drop method makes a data copy. After some profiling, extracting a dictionary of columns actually saves memory. (reducing about 6GB for 5-fold cv with istella-s)
  • We might want to iterate through the columns in C like what we currently do for cuDF.
@s-banach
Copy link

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

@trivialfis
Copy link
Member Author

I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future.

@trivialfis
Copy link
Member Author

Related: pandas-dev/pandas#51463 .

@phofl
Copy link

phofl commented Jul 30, 2023

I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

No. That is independent of the dtype.

@s-banach
Copy link

I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column.

@phofl
Copy link

phofl commented Jul 30, 2023

You can still use views without reallocating the arrays. The problem is a bit different though:

pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants