Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clickhouse): add implementations of create_table, create_view, and create_database #4316

Closed
mharrisb1 opened this issue Aug 5, 2022 · 4 comments · Fixed by #5736
Closed
Labels
clickhouse The ClickHouse backend ddl Issues related to creating or altering data definitions feature Features or general enhancements ux User experience related issues
Milestone

Comments

@mharrisb1
Copy link
Contributor

mharrisb1 commented Aug 5, 2022

All of these are accomplishable through raw_sql method but it would be nice for these to be fully implemented. In addition, it would also be useful to have the inverse methods implemented: drop_table, drop_view, and drop_database.

Proposal

From my experience with Clickhouse, and some experience with implementing these for my own use cases, I've outlined a brief propsal for implementing these methods. Feedback would be greatly appreciated.

Create Table

This one would be the most complicated to implement. There would be two major use cases:

  1. Standard table creation (from Schema object)
  2. Create table from Expr object

Implementation for the first use case would also be used in the second.

For the first use case, a Schema object would be passed in. You would be able to grab the name and type for each field of the table from that object.

For the second use case, an Expr object would be passed in. You would be able to grab the name and type of each field from the Expr schema attribute using the same code as the previous use case.

One thing to note for creating a table from an expression: in my experience it is better to call the CREATE TABLE... DDL statement and then separately call the INSERT ... statement instead of CREATE TABLE ... AS ....

You would also need to check to see if the table already exists. If it exists, there are a few ways to replace it. You could create a temp table and once that is complete drop the target table and rename the temp table. This is how dbt-clickhouse handles it but I've found this can present challenges with the ReplicatedMergeTree table engine. Another method is to drop the target table and then run the DDL and insertion statements. The downside to this is that there is a longer period of time where not table with the identifier will be available for users to query.

Backing up a bit, there is also the question of whether a full replace is desirable to the user. Should the user have an option to abandon the process if the table already exists?

There are also a number of additional options that would need to be available to the user when creating a table:

Option Description Argument Type
Database The database to create the table in. If not provided, then the session database passed in to connect() should be used str
Table name Since not all Expr child objects have a name attribute, this would need to be provided by the user str
Engine Clickhouse table engine str
On Cluster Specify a cluster str
Partition By Values to partition the table by. Should be a tuple of strings. Could potentially use column objects but that might be more difficult for defining partition keys wrapped in a method (i.e. toYYYYMM(some_timestamp)) tuple[str]
Order By Sorting keys tuple[str]
Primary Key Indexes tuple[str]
Settings Key-value pairs of additional table settings dict[str, Any]

Create View

This one would be much easier to implement and Clickhouse provides the standard CREATE OR REPLACE VIEW ... option. This would just be created from an Expr object.

Only the following options would need to be provided to the user:

Option Description Argument Type
Database The database to create the table in. If not provided, then the session database passed in to connect() should be used str
View name Since not all Expr child objects have a name attribute, this would need to be provided by the user str
On Cluster Specify a cluster str

Create Database

This one is likely the lowest priority out of the three. It would also likely be the easiest to implement. The user would need to be presented the following options:

Option Description Argument Type
Database Name Database name str
On Cluster Specify a cluster str
Engine Clickhouse database engine str

IF NOT EXISTS Behavior

It would likely be beneficial to also allow some way to only create the object if it doesn't already exist.

[TODO]

Notes

I'll keep working on this proposal and flesh out some ideas in a project I already have going. I'm pretty sure I've got the codegen down for the create table piece but I'll need to do some more digging to see how it is already handled in Ibis so I'm following the current approaches.

I'm happy to be the one to implement this over a weekend or two but I'd like some feedback before I dive into the code.

@mharrisb1
Copy link
Contributor Author

An alternative to dropping a table if it already exists is to truncate the table if the schema passed in and the table schema match

@cpcloud cpcloud added feature Features or general enhancements ux User experience related issues clickhouse The ClickHouse backend ddl Issues related to creating or altering data definitions labels Aug 5, 2022
@cpcloud cpcloud added this to the 3.2.0 milestone Aug 5, 2022
@cpcloud
Copy link
Member

cpcloud commented Aug 5, 2022

@mharrisb1 This is a really great proposal.

I'll go in reverse order to get the easier bits out of the way first.

Let's dive in.

Preliminaries

It'd be great if each of the CREATE TABLE/VIEW/DATABASE are done as three separate feat(clickhouse) PRs! If you want to include the DROP ... operations in each of those PRs that's fine, and a follow-up afterwards is fine too.

Backend-specific Keyword Arguments

We'll likely have to change the base class create_foo signatures to accept **kwargs so that child backend class create_foo implementations can accept backend-specific arguments. That's fine and should be a backwards compatible change.

APIs

create_database

Agree with your assessment here. Probably easy to implement and likely not a heavily used API so not a high priority. That said, it's your proposal, so feel free to prioritize as you like.

create_view

What you've proposed looks good to me. This in theory would also allow the .sql method to work with ClickHouse, which is pretty nice!

In terms of difficulty and priority, I think you're spot-on there as well: easier than create_table and less often used, but more often used than create_database.

create_table

This is where the fun begins 😅.

Implementation for the first use case would also be used in the second.

Sounds good. IIRC, the way this works in the other create_table implementations is to always create the table first. This happens inside a single method body, so it's not literally calling another implementation, but of course that is up to the implementer.

For the first use case, a Schema object would be passed in. You would be able to grab the name and type for each field of the table from that object.

Yup, sounds right! Incidentally, I've added API docs for schema in #4318. The way to iterate over the pairs of name, type is using the .items() method on Schema objects.

For the second use case, an Expr object would be passed in. You would be able to grab the name and type of each field from the Expr schema attribute using the same code as the previous use case.

Yep! You can get the schema of a TableExpr using its .schema() method.

One thing to note for creating a table from an expression: in my experience it is better to call the CREATE TABLE... DDL statement and then separately call the INSERT ... statement instead of CREATE TABLE ... AS ....

It sounds like you've got a lot of ClickHouse experience, so IMO this is entirely up to you.

You would also need to check to see if the table already exists. If it exists, there are a few ways to replace it. You could create a temp table and once that is complete drop the target table and rename the temp table. This is how dbt-clickhouse handles it but I've found this can present challenges with the ReplicatedMergeTree table engine. Another method is to drop the target table and then run the DDL and insertion statements. The downside to this is that there is a longer period of time where not table with the identifier will be available for users to query.

It might make implementation and review of the PR for create_table easier if we forego replacement since we don't yet have an API to do this using create_table. We're working on making this possible but it'll be a bit before we have something.

Backing up a bit, there is also the question of whether a full replace is desirable to the user. Should the user have an option to abandon the process if the table already exists?

Right now, this is the default behavior: force=False will compile to CREATE TABLE whereas force=True will compile to CREATE TABLE IF NOT EXISTS and the backend will fail if the table already exists. I don't believe we have any backends that both support SQL and have REPLACE behavior with CREATE TABLE by default.

Let me know if there's anything else on which you'd like me to give feedback!

Looking forward to your PRs.

@cpcloud
Copy link
Member

cpcloud commented Aug 15, 2022

@mharrisb1 Thoughts here?

I know @saulpw has been thinking about DDL-in-ibis recently and likely also has some thoughts here.

@mharrisb1
Copy link
Contributor Author

@cpcloud that all sounds good 👍 thanks for the point-by-point feedback

If @saulpw has a general approach to DDL in Ibis I would love to follow their lead. Either using another backend as a reference or guinea-pig some ideas with Clickhouse.

We've currently have an in-house solution for this so there is no way I can convince my team to use company time to implement this in Ibis so I wouldn't count on this being done (at least well) anytime soon.

I think a good first step here could be creating a topic in Discussions about Ibis + DDL to get some community feedback if people are interested

@cpcloud cpcloud modified the milestones: 3.2.0, 4.0.0 Aug 31, 2022
@cpcloud cpcloud removed this from the 4.0.0 milestone Nov 7, 2022
@cpcloud cpcloud added this to the 5.0 milestone Jan 30, 2023
@cpcloud cpcloud changed the title feat: Add implementations for Clickhouse create_table, create_view, and create_database feat(clickhouse): add implementations of create_table, create_view, and create_database Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend ddl Issues related to creating or altering data definitions feature Features or general enhancements ux User experience related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants