Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.from_parquet returns empty array when columns are specified #1606

Closed
agoose77 opened this issue Aug 19, 2022 · 4 comments · Fixed by #1619
Closed

ak.from_parquet returns empty array when columns are specified #1606

agoose77 opened this issue Aug 19, 2022 · 4 comments · Fixed by #1619
Assignees
Labels
bug The problem described is something that must be fixed

Comments

@agoose77
Copy link
Collaborator

agoose77 commented Aug 19, 2022

Version of Awkward Array

HEAD

Description and code to reproduce

>>> taxi = ak.from_parquet(
    "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet",
    columns=["trip.km"],
)
>>> taxi.type.show()
7728 * {
    
}
@agoose77 agoose77 added bug (unverified) The problem described would be a bug, but needs to be triaged bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Aug 19, 2022
@agoose77
Copy link
Collaborator Author

NB: the docs branch will need modifying after this fix to re-enable strict mode for the 10-minutes-... notebook. Right now, it's allowed to fail in order to keep builds working.

@agoose77 agoose77 changed the title ak.from_parquet returns empty array when row groups are specified ak.from_parquet returns empty array when columns are specified Aug 23, 2022
@agoose77
Copy link
Collaborator Author

agoose77 commented Aug 23, 2022

@martindurant / @jpivarski I mistakenly tagged you whilst formulating a question concerning pyarrow details here. Now, however, I'll open a PR and we can discuss things there.

@jpivarski
Copy link
Member

Does "trip.km" match any columns? (We need to make it easier to use the from_parquet_metadata to answer this question, though it's currently possible.)

Oh wait: things have been changing and I'm not up to date on the changes. metadata_from_parquet is now a dict, rather than a namedtuple. The "trip.km" is definitely one of the columns:

>>> import awkward._v2 as ak
>>> ak.metadata_from_parquet(
...     "https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet"
... )["form"].columns()
['trip.sec', 'trip.km', 'trip.begin.lon', 'trip.begin.lat', 'trip.begin.time', 'trip.end.lon',
 'trip.end.lat', 'trip.end.time', 'trip.path.londiff', 'trip.path.latdiff', 'payment.fare', 'payment.tips',
 'payment.total', 'payment.type', 'company']

Getting any column by name no longer works:

>>> ak.from_parquet("https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet", columns=["trip.km"])
<Array [{}, {}, {}, {}, {}, {}, ..., {}, {}, {}, {}, {}, {}] type='7728 * {}'>

which is a regression. (These features are going to need unit tests, which is a little complicated because that means making small sample files.)

@martindurant
Copy link
Contributor

Aside from fixed quite how the columns are passed, we should presumably warn or error on an attempt to select columns that don't exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
3 participants