Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

Merged

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Aug 17, 2023

Related to #262 and resolves #241

@theroggy theroggy marked this pull request as draft August 17, 2023 15:13
@theroggy theroggy marked this pull request as ready for review August 17, 2023 15:30
@jorisvandenbossche
Copy link
Member

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

@theroggy
Copy link
Member Author

theroggy commented Aug 17, 2023

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

I did a quick test to see the impact of this change on the peak memory usage of read_dataframe... I restarted the python process between each test, to avoid any influence of the order I ran the tests.

Obviously it is just one specific case, so not sure if it is comparable for other files, but it is better to have one test than no test :-).

I used the following script/file, as it was the motive to make the change: the script crashed with memory errors on my laptop. I ran the test on a real computer.
The file read is an openstreetmap (.pbf) file of 540 MB with 5.732.130 rows and 26 columns.

import psutil
import pyogrio

url = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons")
print(psutil.Process().memory_info())

Results:

# split_blocks=True, self_destruct=True: peak_wset=9.296.629.760 -> 9,3 GB
# split_blocks=False, self_destruct=True: peak_wset=10430320640 -> 10,4 GB
# split_blocks=False, self_destruct=False: peak_wset=12530679808 -> 12,5 GB

@jorisvandenbossche do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

@theroggy
Copy link
Member Author

theroggy commented Aug 17, 2023

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

Something like a an extra parameter for read_dataframe called e.g. arrow_to_pandas_kwargs: Dict[str, Any] we can pass to the to_pandas call?

I gave it a try like that in the latest commit.

@theroggy theroggy changed the title Decrease memory usage in read_dataframe with use_arrow=true Add param arrow_to_pandas_kwargs to read_dataframe to pass kwargs to arrow.to_pandas + decrease memory usage Aug 17, 2023
@theroggy theroggy changed the title Add param arrow_to_pandas_kwargs to read_dataframe to pass kwargs to arrow.to_pandas + decrease memory usage Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage Aug 17, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, and thanks for the update!

do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
@theroggy
Copy link
Member Author

theroggy commented Sep 23, 2023

Sorry for the late reply, and thanks for the update!

do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.

OK, I removed the overrule of the split_blocks param.

@jorisvandenbossche jorisvandenbossche added this to the 0.7.0 milestone Sep 28, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!

cc @brendan-ward are you ok with the arrow_to_pandas_kwargs? It's quite long, but explicit

pyogrio/geopandas.py Outdated Show resolved Hide resolved
Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theroggy !

arrow_to_pandas_kwargs is fine by me.

@theroggy theroggy merged commit dffb502 into geopandas:main Sep 30, 2023
14 checks passed
@theroggy theroggy deleted the Decrease-memory-usage-for-use_arrow=True branch September 30, 2023 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load GeoDataFrame with arrow attribute dtypes?
3 participants