Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

Closed
fodrh1201 opened this issue May 9, 2022 · 3 comments · Fixed by #25747
Closed

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

fodrh1201 opened this issue May 9, 2022 · 3 comments · Fixed by #25747
Assignees
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@fodrh1201
Copy link

fodrh1201 commented May 9, 2022

Description

Error occurs when reading parquet files from multiple directories in aws s3.

  • Read multiple directories.

ray.data.read_parquet(["s3://bucket/path1", "s3://bucket/path2"])

OSError: Error creating dataset. Could not read schema from 's3://bucket/path1': Path does not exist 's3://bucket/path1'. Is this a 'parquet' file?

Use case

I think it would be nice if you could read it in multiple directories like read_json or read_csv.

@fodrh1201 fodrh1201 added the enhancement Request for new feature and/or capability label May 9, 2022
@jianoaix jianoaix added the data Ray Data-related issues label May 11, 2022
@jianoaix
Copy link
Contributor

jianoaix commented May 11, 2022

If you are blocked, you may try this to achieve the same:
ds1 = ray.data.read_parquet(["s3://bucket/path1"])
ds2 = ray.data.read_parquet(["s3://bucket/path2"])
ds = ds1.union(ds2)

@jianoaix jianoaix added the P2 Important issue, but not time-critical label May 11, 2022
@clarkzinzow clarkzinzow changed the title [Data] Cannot Read parquet from multiple directories in aws s3. [Datasets] Cannot Read parquet from multiple directories in aws s3. May 16, 2022
@Joeavaikath
Copy link

This is still an active issue, documentation states multiple directories are allowed. Is that only for local and not remote?

@clarkzinzow clarkzinzow self-assigned this Jun 14, 2022
@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jun 14, 2022

@Joeavaikath This unfortunately doesn't work with ray.data.read_parquet() since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html

In addition to the workaround that @jianoaix mentioned, we also have a ray.data.read_parquet_bulk() API which does not use Arrow's Dataset abstraction and would work with multiple directories if the right metadata provider is given:

import ray

ds = ray.data.read_parquet_bulk(
    ["s3://ursa-labs-taxi-data/2019/01", "s3://ursa-labs-taxi-data/2019/02"], 
    meta_provider=ray.data.datasource.DefaultFileMetadataProvider(),
)

Note that this Parquet reading API will not inline Hive directory-partitioned key-value pairs into the table, so your partition columns must exist both in the path and in the Parquet files.

I'll open up a PR that fixes the read_parquet() docs to make this clear that multiple directories are not allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants