Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Register multiple S3 parquet files as one table? #442

Closed
danthegoodman1 opened this issue Jul 30, 2023 · 2 comments · Fixed by #460
Closed

[QUESTION] Register multiple S3 parquet files as one table? #442

danthegoodman1 opened this issue Jul 30, 2023 · 2 comments · Fixed by #460

Comments

@danthegoodman1
Copy link

Let's say I have a known list of files in s3 like s3://bucket/file1.parquet, s3://bucket/file2.parquet, etc.

How can I register them all under the same table so I could do something like select count(*) from parquet_files?

@danthegoodman1 danthegoodman1 changed the title [QUESTION] Register multipl S3 parquet files as one table? [QUESTION] Register multiple S3 parquet files as one table? Jul 30, 2023
@mesejo
Copy link
Contributor

mesejo commented Aug 21, 2023

Hey! If all the files are under the same bucket, you could do the following:

import os
import datafusion
from datafusion.object_store import AmazonS3

region = "us-east-1"
bucket_name = "yellow-trips"

s3 = AmazonS3(
    bucket_name=bucket_name,
    region=region,
    access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
)

ctx = datafusion.SessionContext()
path = f"s3://{bucket_name}/"
ctx.register_object_store(path, s3)

ctx.register_parquet("trips", path)

df = ctx.sql("select count(passenger_count) from trips")
df.show()

DataFusion

>>> datafusion.__version__
'28.0.0'

@danthegoodman1
Copy link
Author

@mesejo I guess I can do that with the S3 proxy I've made to trick clients, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants