leading underscores in partition names results in empty read #42160

smorken · 2024-06-15T20:19:12Z

Describe the bug, including details regarding any error messages, version, and platform.

python version: Python 3.12.4 (tags/v3.12.4:8e8a4ba, Jun 6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)] on win32

pyarrow version: pyarrow==16.1.0

the following code produced an unexpected empty table result:

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table(
    {
        "_year": [2020, 2022, 2021, 2022, 2019, 2021],
        "n_legs": [2, 2, 4, 4, 5, 100],
        "animal": ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"],
    }
)

pq.write_to_dataset(table, root_path="dataset_v2", partition_cols=["_year"])
dataset = pq.ParquetDataset("dataset_v2/")
print(dataset.read())

printed result:

pyarrow.Table

----

I'm a little unclear on hive partitioning naming conventions and rules, and I haven't been able to find any documentation that would say leading underscores should not work.

I would have expected an error on the dataset creation, if leading underscores are not supported.

Component(s)

Parquet

The text was updated successfully, but these errors were encountered:

tmontes · 2024-10-09T12:15:15Z

Hey @smorken,

I just bumped into a very similar case you're describing here. IIUC, your code will work if you add the ignore_prefixes=['.'] argument to your dataset creation.

Instead of...

dataset = pq.ParquetDataset("dataset_v2/")

...use:

dataset = pq.ParquetDataset('dataset/', ignore_prefixes=['.'])

Does it work for you?

smorken · 2024-10-10T23:09:49Z

this works for me! I had missed that parameter in the ParquetDataset docs. Thanks for the response.

smorken added the Type: bug label Jun 15, 2024

github-actions bot added the Component: Parquet label Jun 15, 2024

tmontes mentioned this issue Oct 9, 2024

Hive partition columns with leading underscore: No match for FieldRef.Name(_file) #44352

Closed

smorken closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leading underscores in partition names results in empty read #42160

leading underscores in partition names results in empty read #42160

smorken commented Jun 15, 2024 •

edited

Loading

tmontes commented Oct 9, 2024

smorken commented Oct 10, 2024 •

edited

Loading

leading underscores in partition names results in empty read #42160

leading underscores in partition names results in empty read #42160

Comments

smorken commented Jun 15, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

tmontes commented Oct 9, 2024

smorken commented Oct 10, 2024 • edited Loading

smorken commented Jun 15, 2024 •

edited

Loading

smorken commented Oct 10, 2024 •

edited

Loading