-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV uploads for ClickHouse Cloud #236
Conversation
(let [sql (create-table!-sql driver table-name column-definitions :primary-key primary-key)] | ||
(qp.writeback/execute-write-sql! db-id sql))) | ||
|
||
(defmethod driver/insert-into! :clickhouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than use the default sql-jdbc implementation, I've chosen to use the approach recommended here, to allow for large uploads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, for what it's worth
(doseq [row values] | ||
(when (seq row) | ||
(doseq [[idx v] (map-indexed (fn [x y] [(inc x) y]) row)] | ||
(condp isa? (type v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a bottleneck, but making a protocol like (ps-set! v ps idx)
could remove the reflection of type
and the O(n) branching of condp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, how do you imagine this being implemented exactly? The tricky part is this multimethod doesn't know the type of each column without using reflection, and values can be nil
. I'm imagining this:
- for each value, call
(ps-set! v ps idx)
px-set!
looks up the method to use for theidx
in a volatile map, e.g.setString
etc. If the map contains a method for theidx
, it uses it. Otherwise calculate the method using a similarcondp
expression and save it to the map under the keyidx
, and execute the method.
That way we'll only use reflection once for every column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but I was thinking we'd just implement the protocol directly on each of these java types for v
, eschewing any reflection or map look-up. I think you can just use Object
for the fallback case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, anyway this seems pretty inconsequential compared to all the slow stuff we're doing on the Metabase side. I think I'd prefer to keep it simple and not change this condp
for now
@slvrtrn I have a few questions: Executing this query
gives me this error
(3) If there's no solution to (2), would you be willing to accept this contribution if it only worked for ClickHouse Cloud in the meantime? That's the deployment type we care about most. With this commit I've updated the PR so we only support ClickHouse Cloud. |
(1) — You can provide the hostname and port via the env variables; IIRC, it won't work correctly unless you add (2) - I am only aware of the mentioned method
You don't need this in CH Cloud. It's necessary only for on-premise. But yes, these grants are one of the main pain points. (3) It will work for single-node and ClickHouse Cloud, then. But we need a way to disable it for the on-premise cluster if it's detected. Can we do that dynamically? It's the feature that is enabled or disabled on the driver level, not on the connection level. |
(1) Thanks, I can run the tests locally with ClickHouse Cloud. I should have been more specific with my question: Is there a way to run the tests that run with Github Actions with ClickHouse Cloud deployments? (2) Ok thanks for clarifying (3) How can we know whether it's an on-premise cluster vs on-premise single node without getting the node_count? You only suggested an approach using
|
@slvrtrn tagging you because you might have missed my last comment above, where I didn't tag you |
@calherries I have another question, though. Is there a way to disable a particular feature for a specific connection? These are defined in the |
(3) Okay, for now I'll support CH Cloud only.
Yes, we generally use I think this PR is ready to be reviewed now, but we'll need to wait for the changes in this PR to be released in the 49 branch before there is a released Metabase version that is compatible with this. |
For this:
What version should these changes under in the changelog? |
I will add this, don't worry. Otherwise, there will be conflicts with #232
Thanks, this is great. |
src/metabase/driver/clickhouse.clj
Outdated
:quoted true | ||
:dialect (sql.qp/quote-style driver))) | ||
"ENGINE = MergeTree" | ||
(format "PRIMARY KEY (%s)" (str/join ", " (map quote-name primary-key))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can drop this and put the primary-key in the ORDER BY clause instead.
i.e. the table definition should be just:
CREATE TABLE foo
(col1 Type, ...)
ENGINE MergeTree
ORDER BY (primary-key)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#order_by
ClickHouse uses the sorting key as a primary key if the primary key is not defined explicitly by the PRIMARY KEY clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I've made that change Use order by to specify primary key
btw
It doesn't; IIUC, the user that requested this feature, uses on-prem. |
@calherries, some tests are failing on the CI. There is one more thing.
This is an absolute necessity if we add the Cloud test run cause we need to immediately get the test data back after the table is created; otherwise, we will experience weird flakiness during the runs. This also can happen in real-world usage (albeit more rarely). |
I struggled with this, because we're using a connection pool and don't have any way to set settings in a SQL query. I've made an attempt here but I'm concerned about the performance implications of having this set for every query, even if it's only needed for queries on uploaded tables with CH Cloud. update: I've implemented a workaround with Avoid setting select_sequential_consistency for on-premise |
(defmethod sql-jdbc.conn/connection-details->spec :clickhouse | ||
[_ details] | ||
(cond-> (connection-details->spec* details) | ||
(cloud? details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, select_sequential_consistency just limits which replica you query from, and doesn't affect the performance of the query on that specific replica.
This logic seems like it will break consistency for multi-replica on prem installations, and needlessly disable it on single replica ones. It seems best to just to enable it unconditionally for now?
For any multi-replica installation, enabling it for all connections seems to wipe out most of the benefits of connecting to a distributed installation, but I don't see any way around this besides reworking metabase to support separate pools for different connection flavors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're mistaking the behaviour of select_sequential_consistency
on CH Cloud (docs here), thinking it's the same as on-premise (the link you shared). We're only enabling select_sequential_consistency
on CH Cloud, where it's much less of an issue.
There's more on SharedMergeTree in this blog post.
Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn’t need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query.
If your use case requires consistency guarantees that each server is delivering the same query result, then you can run the SYNC REPLICA system statement, which is a much more lightweight operation with the SharedMergeTree. Instead of syncing data (or metadata with zero-copy replication) between servers, each server just needs to fetch the current version of metadata from Keeper.
And I can't imagine fetching metadata from Keeper can be a bottleneck. I also got this DM from Serge on select_sequential_consistency
:
With CH Cloud, there should not be any issues with that, as it uses “shared” merge tree and this is merely a metadata sync before executing the query (this is super fast, as it is a fetch of the meta from zookeeper).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, select_sequential_consistency
is correctly enabled for the Cloud type of instance only.
For on-premise clusters (in a follow-up PR), we should just set insert_quorum
equal to the nodes count so that replica sync won't be necessary and will be effectively done during the CSV insert, and it won't affect other read-only operations.
Let's bump the MB version to 0.49.11 in .github/workflows/check.yml#L20 and merge if the CI is green. |
Summary
Issue tracking CSV uploads: #230
This PR adds support for the
uploads
driver feature for ClickHouse. The feature is currently only supported for ClickHouse Cloud. It requires Metabasex.49.11x.49.12 to work correctly.Limitations:
SERIAL
in PostgreSQL.Checklist
Delete items not relevant to your PR: