[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

JakeRuss · 2024-12-01T02:07:34Z

Describe the enhancement requested

I have a dataset which is hosted on AWS S3 as hive partitioned parquet files. The data is written to S3 by a Python job via pandas with snappy compression and the resulting file names look like this,

s3://bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/a0866a30008848e1ba878514954e4d6a.snappy.parquet

For our use case the python job runs each hour and new hourly data is written out, but since the most recent 24 hours might have updates, the last 24 hours are also written/rewritten each hour. In the above example, each time that hour ending 10 file is written, it gets a new random string as part of the snappy file name.

The problem I am running into is after I open_dataset("s3://bucket-name/dataset/") in R, and then query this dataset, depending on my timing, I sometimes hit an error where arrow is looking for a file name that no longer exists (because the python job updated the most recent 24 hour partitions after I opened the dataset but before the query can finish. I get this error,

Error in `compute.arrow_dplyr_query()`:
! IOError: Could not open Parquet input source 'bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/5d7020e501db412b9e729bb0b5da948b.snappy.parquet': AWS Error NO_SUCH_KEY during GetObject operation: The specified key does not exist.

Would it be possible (advisable?) to add an option which allows reading whatever parquet file is in a partition, regardless of whether the parquet file name has changed mid-query?

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2024-12-21T18:11:31Z

Hi @JakeRuss , thanks for opening the issue! When a dataset is created via open_dataset(), one of the things which happens first is that it scans all possible files and keeps a list of these within the resulting dataset object. To make the change you suggest, we'd have to fundamentally change how datasets work, so it's unlikely to be feasible as a change, sorry!

Happy to help think of possible workarounds though. The first things that come to mind - and you may have already thought of these - would be modifying the upstream pipeline to not have random filenames if possible, or, creating an initial dataset based on the files you know to be fixed and then creating a smaller dataset with the file which has just been renamed and then passing both of these datasets into open_dataset() to make a single dataset. Might either of those options work for you?

JakeRuss added the Type: enhancement label Dec 1, 2024

github-actions bot added the Component: R label Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

JakeRuss commented Dec 1, 2024

thisisnic commented Dec 21, 2024

[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

Comments

JakeRuss commented Dec 1, 2024

Describe the enhancement requested

Component(s)

thisisnic commented Dec 21, 2024