Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leading underscores in partition names results in empty read #42160

Closed
smorken opened this issue Jun 15, 2024 · 2 comments
Closed

leading underscores in partition names results in empty read #42160

smorken opened this issue Jun 15, 2024 · 2 comments

Comments

@smorken
Copy link

smorken commented Jun 15, 2024

Describe the bug, including details regarding any error messages, version, and platform.

python version: Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32

pyarrow version: pyarrow==16.1.0

the following code produced an unexpected empty table result:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table(
    {
        "_year": [2020, 2022, 2021, 2022, 2019, 2021],
        "n_legs": [2, 2, 4, 4, 5, 100],
        "animal": ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"],
    }
)

pq.write_to_dataset(table, root_path="dataset_v2", partition_cols=["_year"])
dataset = pq.ParquetDataset("dataset_v2/")
print(dataset.read())

printed result:

pyarrow.Table

----

I'm a little unclear on hive partitioning naming conventions and rules, and I haven't been able to find any documentation that would say leading underscores should not work.

I would have expected an error on the dataset creation, if leading underscores are not supported.

Component(s)

Parquet

@tmontes
Copy link

tmontes commented Oct 9, 2024

Hey @smorken,

I just bumped into a very similar case you're describing here. IIUC, your code will work if you add the ignore_prefixes=['.'] argument to your dataset creation.

Instead of...

dataset = pq.ParquetDataset("dataset_v2/")

...use:

dataset = pq.ParquetDataset('dataset/', ignore_prefixes=['.'])

Does it work for you?

@smorken
Copy link
Author

smorken commented Oct 10, 2024

this works for me! I had missed that parameter in the ParquetDataset docs. Thanks for the response.

@smorken smorken closed this as completed Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants