Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate creating a Command mode or ObjectStore tab #29

Open
matthewmturner opened this issue Mar 21, 2022 · 6 comments
Open

Investigate creating a Command mode or ObjectStore tab #29

matthewmturner opened this issue Mar 21, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@matthewmturner
Copy link
Collaborator

To enable things like using ObjectStore.list_file to see file names that you can register. Allows you to stay in context / on terminal if you dont know exact name of file you are looking to use.

@matthewmturner
Copy link
Collaborator Author

Alternatively, if this will only have object store info, we can just add an ObjectStore tab that shows file structure.

@matthewmturner matthewmturner changed the title Investigate creating a Command mode Investigate creating a Command mode or ObjectStore tab Mar 21, 2022
@matthewmturner
Copy link
Collaborator Author

@jychen7 would you find value in having a tab like ObjectStore added where you could search / list files?

The use case im thinking of is having files with timestamps in the name and that likely wouldnt be preloaded through .datafusionrc. I think it would be useful to list those files with datafusion-tui so i could then easily use CREATE EXTERNAL TABLE with the appropriate location.

@matthewmturner matthewmturner added the enhancement New feature or request label Mar 23, 2022
@jychen7
Copy link
Member

jychen7 commented Mar 24, 2022

given Object Store like S3/GCS not really have the concept of "folder", I am afraid list_objects may not provide benefits, since the data lake can be very large.

consider a bucket with following files

s3://my_bucket/kafka_topic_1/year=x/month=x/day=x/hour=x/timestamp.parquet
s3://my_bucket/kafka_topic_2/year=x/month=x/day=x/hour=x/timestamp.parquet
s3://my_bucket/kafka_topic_3/year=x/month=x/day=x/hour=x/timestamp.parquet

It would be beneficial to know what are the "tables" can be query under my_bucket, e.g. "kafka_topic_1" to "kafka_topic_3" so I can use DDL to register.

But seems the only way to know what "tables" are there is to scan all files/objects, which seems not worth it

@matthewmturner
Copy link
Collaborator Author

While I totally agree I also think there is value in having the functionality for sizes beneath where it becomes impractical. If it's too large I also want to check some of the s3 APIs out in more detail maybe there's something there that would prevent scanning all files.

@jychen7
Copy link
Member

jychen7 commented Mar 24, 2022

maybe there's something there that would prevent scanning all files

oh, you are right, seems following one fits the need. Then I think it is valuable, showing the CommonPrefixes in a bucket

All other keys contain the delimiter character. Amazon S3 groups these keys and returns a single CommonPrefixes element with the prefix value photos/
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html

@matthewmturner
Copy link
Collaborator Author

Yup. Also:


List implementation efficiency

List performance is not substantially affected by the total number of keys in your bucket. It's also not affected by the presence or absence of the prefix, marker, maxkeys, or delimiter arguments.

Iterating through multipage results

As buckets can contain a virtually unlimited number of keys, the complete results of a list query can be extremely large. To manage large result sets, the Amazon S3 API supports pagination to split them into multiple responses. Each list keys response returns a page of up to 1,000 keys with an indicator indicating if the response is truncated. You send a series of list keys requests until you have received all the keys. AWS SDK wrapper libraries provide the same pagination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants