Optimize memory usage with pandas input. #8927

trivialfis · 2023-03-16T10:16:41Z

The special qid column introduced in Support sklearn cross validation for ranker. #8859 is actually quite expensive, as pandas drop method makes a data copy. After some profiling, extracting a dictionary of columns actually saves memory. (reducing about 6GB for 5-fold cv with istella-s)
We might want to iterate through the columns in C like what we currently do for cuDF.

The text was updated successfully, but these errors were encountered:

s-banach · 2023-05-11T17:40:51Z

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

trivialfis · 2023-05-12T21:18:11Z

I can't be sure. That's internal to pandas and arrow, I will have to assume that even if it's true today, it can change in the future.

trivialfis · 2023-06-02T20:01:09Z

Related: pandas-dev/pandas#51463 .

phofl · 2023-07-30T16:20:23Z

I'd recommend giving Copy-on-Write a shot if you are concerned with inefficient memory usage. We removed a lot of Deep copies and generally made stuff more efficient (I wouldn't recommend using it with pandas < 2.0 though).

I didn't read the code very carefully, but if you make your qid column a pyarrow-backed pandas Series, can it then be added and dropped without copying the other columns?

No. That is independent of the dtype.

s-banach · 2023-07-30T16:54:15Z

I thought the point of arrow was that the columns are stored separately, whereas the pandas default is to store columns of the same dtype in a 2d numpy array, which would obviously need to be reallocated if you add or drop a column.

phofl · 2023-07-30T17:04:14Z

You can still use views without reallocating the arrays. The problem is a bit different though:

pandas enables inplace modifications, e.g. mutating objects inplace. Most operations perform defensive copies to avoid side-effects

trivialfis added feature-request performance labels Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize memory usage with pandas input. #8927

Optimize memory usage with pandas input. #8927

trivialfis commented Mar 16, 2023 •

edited

Loading

s-banach commented May 11, 2023

trivialfis commented May 12, 2023

trivialfis commented Jun 2, 2023

phofl commented Jul 30, 2023

s-banach commented Jul 30, 2023

phofl commented Jul 30, 2023

Optimize memory usage with pandas input. #8927

Optimize memory usage with pandas input. #8927

Comments

trivialfis commented Mar 16, 2023 • edited Loading

s-banach commented May 11, 2023

trivialfis commented May 12, 2023

trivialfis commented Jun 2, 2023

phofl commented Jul 30, 2023

s-banach commented Jul 30, 2023

phofl commented Jul 30, 2023

trivialfis commented Mar 16, 2023 •

edited

Loading