Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Jan 27, 2025

Why are these changes needed?

People often have non-Parquet files in their datasets (e.g., _SUCCESS or stale files). However, the default for file_extensions is None, so read_parquet tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like ["parquet"]. This PR adds a warning for that change.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner January 27, 2025 22:52
@richardliaw
Copy link
Contributor

@bveeramani this didn't seem to work for me?

In [7]: ray.data.read_parquet("hello.doc")
Out[7]: 
Dataset(
   num_rows=150,
   schema={
      sepal.length: double,
      sepal.width: double,
      petal.length: double,
      petal.width: double,
      variety: string
   }
)

Comment on lines +166 to +175
"parquet.snappy",
"snappy.parquet",
# Gzip compression
"parquet.gz",
# Brotili compression
"parquet.br",
# Lz4 compression
"parquet.lz4",
# Zstd compression
"parquet.zst",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand where these are coming from? It should be .snappy.parquet for ex, not the other way around

Copy link
Member Author

@bveeramani bveeramani Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the canonical file extensions for the compression formats that PyArrow supports.

I agree that parquet.snappy is more common, but I've also seen snappy.parquet, so I included it. Misread your comment. I've seen both

How should I change this list?

@bveeramani
Copy link
Member Author

@bveeramani this didn't seem to work for me?

In [7]: ray.data.read_parquet("hello.doc")
Out[7]: 
Dataset(
   num_rows=150,
   schema={
      sepal.length: double,
      sepal.width: double,
      petal.length: double,
      petal.width: double,
      variety: string
   }
)

@richardliaw how are your warnings configured? Do you have PYTHONWARNINGS configured or something?

Ray Data emits the warning when I test it an interactive session and with the unit test:

❯ python -c "import ray; ray.data.read_parquet('iris')"
2025-01-28 09:55:54,620 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Parquet Files Sample 0: 100%|███████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 4.34 file/s]
/Users/balaji/ray/python/ray/data/_internal/datasource/parquet_datasource.py:760: FutureWarning: The default file_extensions for read_parquet will change from None to ['parquet', 'parquet.snappy', 'snappy.parquet', 'parquet.gz', 'parquet.br', 'parquet.lz4', 'parquet.zst'] after Ray 2.43, and your dataset contains files that don't match the new file_extensions. To maintain backwards compatibility, set file_extensions=None explicitly.
warnings.warn(

@richardliaw
Copy link
Contributor

Interesting, well I guess in theory the code looks right. I don't have warnings configured, so not sure why it's not showing up.

@richardliaw richardliaw added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jan 29, 2025
@richardliaw
Copy link
Contributor

tests failing

@bveeramani
Copy link
Member Author

tests failing

Investigating 👀

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) February 4, 2025 01:53
@bveeramani bveeramani merged commit 91780d1 into master Feb 4, 2025
6 checks passed
@bveeramani bveeramani deleted the parquet-file-extensions branch February 4, 2025 02:03
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
…t#50092)

People often have non-Parquet files in their datasets (e.g., `_SUCCESS`
or stale files). However, the default for `file_extensions` is `None`,
so `read_parquet` tries reading the non-Parquet files. To avoid this
issue, we'll change the default file extensions to something like
`["parquet"]`. This PR adds a warning for that change.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
xsuler pushed a commit to antgroup/ant-ray that referenced this pull request Mar 4, 2025
…t#50092)

People often have non-Parquet files in their datasets (e.g., `_SUCCESS`
or stale files). However, the default for `file_extensions` is `None`,
so `read_parquet` tries reading the non-Parquet files. To avoid this
issue, we'll change the default file extensions to something like
`["parquet"]`. This PR adds a warning for that change.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…t#50092)

People often have non-Parquet files in their datasets (e.g., `_SUCCESS`
or stale files). However, the default for `file_extensions` is `None`,
so `read_parquet` tries reading the non-Parquet files. To avoid this
issue, we'll change the default file extensions to something like
`["parquet"]`. This PR adds a warning for that change.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants