-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273
Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273
Conversation
There are a lot of options in the arrow -> pandas For this The |
I did a quick test to see the impact of this change on the peak memory usage of read_dataframe... I restarted the python process between each test, to avoid any influence of the order I ran the tests. Obviously it is just one specific case, so not sure if it is comparable for other files, but it is better to have one test than no test :-). I used the following script/file, as it was the motive to make the change: the script crashed with memory errors on my laptop. I ran the test on a real computer. import psutil
import pyogrio
url = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons")
print(psutil.Process().memory_info()) Results:
@jorisvandenbossche do you see any disadvantages to use |
Something like a an extra parameter for I gave it a try like that in the latest commit. |
…ge-for-use_arrow=True
…ge-for-use_arrow=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, and thanks for the update!
do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.
For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.
OK, I removed the overrule of the |
…ge-for-use_arrow=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me!
cc @brendan-ward are you ok with the arrow_to_pandas_kwargs
? It's quite long, but explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @theroggy !
arrow_to_pandas_kwargs
is fine by me.
to_pandas
to influence e.g. the way data is returned.arrow
documentation, passing split_blocks=True and self_destruct=True should decrease peak memory usage ofto_pandas
in some cases. In read_dataframe it seems more logical to use those defaults:https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas
Related to #262 and resolves #241