-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow creating native tables with different formats #2053
Comments
I don't know why but when I read this I think of an old presentation around good clean code, where the prof is quoting other leaders in industry and one of the quotes is, "a method is something that does one thing well." I think adding additional storage options means adding additional readers and writers and adds a lot of maintenence. |
But the readers and writers for the given formats would already exist? I don't think end users would face any additional maintenance as it'd be an opt-in feature. Since it is opt-in, I see this as a feature for "power users" that are aware of the tradeoffs of using different storage formats. |
Format SelectivityHaving a way to specify data format (as well as some tuning options like chunk/batch size and per-column compression formats, potentially) seems like a great capability. I think I'm vaguely partial to Having said that I'd characterize all of it as "column oriented" and therefore OLAP centric, in all of its form (not just delta, but datafusion itself). Automatic Format Detection
This is definitely a feature that sounds nice, but I think would be a lot of different work. I also can't necessarily imagine the situation/user who knows the schema of the table that they're creating but doesn't know what the table's going to be used for? Also what happens when the table is going to be used for more than one thing? How should people decide? (what do we do in this situation?) Is there a situation where people might be doing both kinds of workloads over the same data set? In a lot of ways, these data format decisions are sort of akin to "deciding which indexes to build for a table" in a OLTP database, and "which fields to index" is a thing which you can do "automatically" (though it's hard to write really good code that does that as well automatically/generically as a human would do?) In a lot of ways the whole history and design of relational databases (and maybe databases themselves) revolve around "providing users with the ability to tell the database engine how the user expects the data to be used." The other sort of challenge with the "automatic view" operation is that it makes adding new/arbitrary data formats harder, as you have to predict the workload, potentially? That seems like a pretty high burden. I'm also aware that when you create a table, we won't, yet, have the information about the usage statistics, and we also don't have a system for collecting (and storing and retaining) that data, so there are a lot of pieces. |
I've been poking around some of this code in the last week, and I was confused for a moment. I think we should definitely call this "engine" [delta, lance] as "format" made me think about CSV/JSON/etc. |
Description
With the upcoming support of lance format I'd like to be able to create native tables that use lance instead of delta
Some dialects use the
USING
keyword. others (clickhouse) useENGINE =
. Probably some other keywords used in other dialects as well.As a potential followup:
I think it would be cool if we could inspect the schema of the table (maybe statistics too) to determine the optimal table representation.
For example, Lance is highly optimized for ml use cases
FixedSizeList, Binary, ...
while Delta is optimized for more traditional OLAP workloads.We could likely infer if the workload is ML, or standard OLAP based off of the schema + statistics. Then we could use different backends (delta or lance) depending on the workload.
The text was updated successfully, but these errors were encountered: