Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Add option to ignore partition file names when querying an arrow open_dataset() ? #44889

Open
JakeRuss opened this issue Dec 1, 2024 · 1 comment

Comments

@JakeRuss
Copy link

JakeRuss commented Dec 1, 2024

Describe the enhancement requested

I have a dataset which is hosted on AWS S3 as hive partitioned parquet files. The data is written to S3 by a Python job via pandas with snappy compression and the resulting file names look like this,

s3://bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/a0866a30008848e1ba878514954e4d6a.snappy.parquet

For our use case the python job runs each hour and new hourly data is written out, but since the most recent 24 hours might have updates, the last 24 hours are also written/rewritten each hour. In the above example, each time that hour ending 10 file is written, it gets a new random string as part of the snappy file name.

The problem I am running into is after I open_dataset("s3://bucket-name/dataset/") in R, and then query this dataset, depending on my timing, I sometimes hit an error where arrow is looking for a file name that no longer exists (because the python job updated the most recent 24 hour partitions after I opened the dataset but before the query can finish. I get this error,

Error in `compute.arrow_dplyr_query()`:
! IOError: Could not open Parquet input source 'bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/5d7020e501db412b9e729bb0b5da948b.snappy.parquet': AWS Error NO_SUCH_KEY during GetObject operation: The specified key does not exist.

Would it be possible (advisable?) to add an option which allows reading whatever parquet file is in a partition, regardless of whether the parquet file name has changed mid-query?

Component(s)

R

@thisisnic
Copy link
Member

Hi @JakeRuss , thanks for opening the issue! When a dataset is created via open_dataset(), one of the things which happens first is that it scans all possible files and keeps a list of these within the resulting dataset object. To make the change you suggest, we'd have to fundamentally change how datasets work, so it's unlikely to be feasible as a change, sorry!

Happy to help think of possible workarounds though. The first things that come to mind - and you may have already thought of these - would be modifying the upstream pipeline to not have random filenames if possible, or, creating an initial dataset based on the files you know to be fixed and then creating a smaller dataset with the file which has just been renamed and then passing both of these datasets into open_dataset() to make a single dataset. Might either of those options work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants