-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get statistics metadata #2233
Comments
I think what you are looking for is under DeltaTable()._table.dataset_partitions(). We use that to construct the stats for each fragment for the pyarrow dataset |
Oh cool. Thanks for the pointer. That gets me closer. Here's what I'm getting import deltalake
t = deltalake.DeltaTable("mytable")
filename, info = t._table.dataset_partitions(t.schema().to_pyarrow())[0]
info
So clearly the stuff I want is in there, however I'm not sure that it's actually programatically accessible from Python. I guess I could raise this upstream with PyArrow to ask for compute expressions to be more introspectable, but this seems like the wrong path. Any further thoughts, aside from parsing the string repr? |
maybe you are looking for get_add_actions |
Oh cool. Yes, that seems like it likely has the information that I'm looking for. Thank you @sherlockbeard ! |
Is it possible to get the statistics metadata on a per-file basis? In particular I'm looking for the min/max/null_count for each column for each file. This data is available in the json files, but as far as I can tell from looking through the docs and poking around the API it isn't readily available through the Python API (I'd love to be wrong here though)
The text was updated successfully, but these errors were encountered: