[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

fodrh1201 · 2022-05-09T09:20:53Z

Description

Error occurs when reading parquet files from multiple directories in aws s3.

Read multiple directories.

ray.data.read_parquet(["s3://bucket/path1", "s3://bucket/path2"])

OSError: Error creating dataset. Could not read schema from 's3://bucket/path1': Path does not exist 's3://bucket/path1'. Is this a 'parquet' file?

Use case

I think it would be nice if you could read it in multiple directories like read_json or read_csv.

jianoaix · 2022-05-11T00:19:48Z

If you are blocked, you may try this to achieve the same:
ds1 = ray.data.read_parquet(["s3://bucket/path1"])
ds2 = ray.data.read_parquet(["s3://bucket/path2"])
ds = ds1.union(ds2)

Joeavaikath · 2022-06-10T14:54:51Z

This is still an active issue, documentation states multiple directories are allowed. Is that only for local and not remote?

clarkzinzow · 2022-06-14T01:52:33Z

@Joeavaikath This unfortunately doesn't work with ray.data.read_parquet() since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html

In addition to the workaround that @jianoaix mentioned, we also have a ray.data.read_parquet_bulk() API which does not use Arrow's Dataset abstraction and would work with multiple directories if the right metadata provider is given:

import ray

ds = ray.data.read_parquet_bulk(
    ["s3://ursa-labs-taxi-data/2019/01", "s3://ursa-labs-taxi-data/2019/02"], 
    meta_provider=ray.data.datasource.DefaultFileMetadataProvider(),
)

Note that this Parquet reading API will not inline Hive directory-partitioned key-value pairs into the table, so your partition columns must exist both in the path and in the Parquet files.

I'll open up a PR that fixes the read_parquet() docs to make this clear that multiple directories are not allowed.

fodrh1201 added the enhancement Request for new feature and/or capability label May 9, 2022

jianoaix added the data Ray Data-related issues label May 11, 2022

jianoaix added the P2 Important issue, but not time-critical label May 11, 2022

clarkzinzow changed the title ~~[Data] Cannot Read parquet from multiple directories in aws s3.~~ [Datasets] Cannot Read parquet from multiple directories in aws s3. May 16, 2022

clarkzinzow self-assigned this Jun 14, 2022

clarkzinzow mentioned this issue Jun 14, 2022

[Datasets] Make it clear that read_parquet() does not support multiple directories. #25747

Merged

6 tasks

clarkzinzow closed this as completed in #25747 Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

fodrh1201 commented May 9, 2022 •

edited

Loading

jianoaix commented May 11, 2022 •

edited

Loading

Joeavaikath commented Jun 10, 2022

clarkzinzow commented Jun 14, 2022 •

edited

Loading

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

[Datasets] Cannot Read parquet from multiple directories in aws s3. #24598

Comments

fodrh1201 commented May 9, 2022 • edited Loading

Description

Use case

jianoaix commented May 11, 2022 • edited Loading

Joeavaikath commented Jun 10, 2022

clarkzinzow commented Jun 14, 2022 • edited Loading

fodrh1201 commented May 9, 2022 •

edited

Loading

jianoaix commented May 11, 2022 •

edited

Loading

clarkzinzow commented Jun 14, 2022 •

edited

Loading