You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset which is hosted on AWS S3 as hive partitioned parquet files. The data is written to S3 by a Python job via pandas with snappy compression and the resulting file names look like this,
For our use case the python job runs each hour and new hourly data is written out, but since the most recent 24 hours might have updates, the last 24 hours are also written/rewritten each hour. In the above example, each time that hour ending 10 file is written, it gets a new random string as part of the snappy file name.
The problem I am running into is after I open_dataset("s3://bucket-name/dataset/") in R, and then query this dataset, depending on my timing, I sometimes hit an error where arrow is looking for a file name that no longer exists (because the python job updated the most recent 24 hour partitions after I opened the dataset but before the query can finish. I get this error,
Error in `compute.arrow_dplyr_query()`:
! IOError: Could not open Parquet input source 'bucket-name/dataset/year=2024/month=11/day=25/hour_ending=10/5d7020e501db412b9e729bb0b5da948b.snappy.parquet': AWS Error NO_SUCH_KEY during GetObject operation: The specified key does not exist.
Would it be possible (advisable?) to add an option which allows reading whatever parquet file is in a partition, regardless of whether the parquet file name has changed mid-query?
Component(s)
R
The text was updated successfully, but these errors were encountered:
Hi @JakeRuss , thanks for opening the issue! When a dataset is created via open_dataset(), one of the things which happens first is that it scans all possible files and keeps a list of these within the resulting dataset object. To make the change you suggest, we'd have to fundamentally change how datasets work, so it's unlikely to be feasible as a change, sorry!
Happy to help think of possible workarounds though. The first things that come to mind - and you may have already thought of these - would be modifying the upstream pipeline to not have random filenames if possible, or, creating an initial dataset based on the files you know to be fixed and then creating a smaller dataset with the file which has just been renamed and then passing both of these datasets into open_dataset() to make a single dataset. Might either of those options work for you?
Describe the enhancement requested
I have a dataset which is hosted on AWS S3 as hive partitioned parquet files. The data is written to S3 by a Python job via pandas with snappy compression and the resulting file names look like this,
For our use case the python job runs each hour and new hourly data is written out, but since the most recent 24 hours might have updates, the last 24 hours are also written/rewritten each hour. In the above example, each time that hour ending 10 file is written, it gets a new random string as part of the snappy file name.
The problem I am running into is after I
open_dataset("s3://bucket-name/dataset/")
in R, and then query this dataset, depending on my timing, I sometimes hit an error where arrow is looking for a file name that no longer exists (because the python job updated the most recent 24 hour partitions after I opened the dataset but before the query can finish. I get this error,Would it be possible (advisable?) to add an option which allows reading whatever parquet file is in a partition, regardless of whether the parquet file name has changed mid-query?
Component(s)
R
The text was updated successfully, but these errors were encountered: