-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Raise future warning if invalid Parquet extensions #50092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
|
@bveeramani this didn't seem to work for me? |
| "parquet.snappy", | ||
| "snappy.parquet", | ||
| # Gzip compression | ||
| "parquet.gz", | ||
| # Brotili compression | ||
| "parquet.br", | ||
| # Lz4 compression | ||
| "parquet.lz4", | ||
| # Zstd compression | ||
| "parquet.zst", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand where these are coming from? It should be .snappy.parquet for ex, not the other way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the canonical file extensions for the compression formats that PyArrow supports.
I agree that Misread your comment. I've seen bothparquet.snappy is more common, but I've also seen snappy.parquet, so I included it.
How should I change this list?
@richardliaw how are your warnings configured? Do you have Ray Data emits the warning when I test it an interactive session and with the unit test:
|
|
Interesting, well I guess in theory the code looks right. I don't have warnings configured, so not sure why it's not showing up. |
|
tests failing |
Investigating 👀 |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Why are these changes needed?
People often have non-Parquet files in their datasets (e.g.,
_SUCCESSor stale files). However, the default forfile_extensionsisNone, soread_parquettries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like["parquet"]. This PR adds a warning for that change.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.