Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV uploads for ClickHouse Cloud #236

Merged
merged 11 commits into from
May 17, 2024
Merged

Conversation

calherries
Copy link
Contributor

@calherries calherries commented May 10, 2024

Summary

Issue tracking CSV uploads: #230

This PR adds support for the uploads driver feature for ClickHouse. The feature is currently only supported for ClickHouse Cloud. It requires Metabase x.49.11 x.49.12 to work correctly.

Limitations:

  • OffsetDateTime values in the CSV e.g. "2022-01-01T00:00:00+01" are inserted as strings, while in other officially supported drivers they would be inserted as a type with a time zone. This is blocked by Add fully fledged time zones support #200
  • There is no automatically generated primary key column. This is because ClickHouse doesn't support an auto-incrementing type like SERIAL in PostgreSQL.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@CLAassistant
Copy link

CLAassistant commented May 10, 2024

CLA assistant check
All committers have signed the CLA.

@calherries calherries marked this pull request as draft May 10, 2024 19:30
(let [sql (create-table!-sql driver table-name column-definitions :primary-key primary-key)]
(qp.writeback/execute-write-sql! db-id sql)))

(defmethod driver/insert-into! :clickhouse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use the default sql-jdbc implementation, I've chosen to use the approach recommended here, to allow for large uploads.

Copy link
Contributor

@crisptrutski crisptrutski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, for what it's worth

(doseq [row values]
(when (seq row)
(doseq [[idx v] (map-indexed (fn [x y] [(inc x) y]) row)]
(condp isa? (type v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a bottleneck, but making a protocol like (ps-set! v ps idx) could remove the reflection of type and the O(n) branching of condp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, how do you imagine this being implemented exactly? The tricky part is this multimethod doesn't know the type of each column without using reflection, and values can be nil. I'm imagining this:

  • for each value, call (ps-set! v ps idx)
  • px-set! looks up the method to use for the idx in a volatile map, e.g. setString etc. If the map contains a method for the idx, it uses it. Otherwise calculate the method using a similar condp expression and save it to the map under the key idx, and execute the method.

That way we'll only use reflection once for every column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but I was thinking we'd just implement the protocol directly on each of these java types for v, eschewing any reflection or map look-up. I think you can just use Object for the fallback case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, anyway this seems pretty inconsequential compared to all the slow stuff we're doing on the Metabase side. I think I'd prefer to keep it simple and not change this condp for now

src/metabase/driver/clickhouse.clj Show resolved Hide resolved
@calherries
Copy link
Contributor Author

calherries commented May 15, 2024

@slvrtrn I have a few questions:
(1) Is there a way to run tests with ClickHouse Cloud deployments?
(2) Is there an alternative way to determine the number of replicas without using system.macros, as you suggested here? I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

Executing this query

SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution

gives me this error

> Not enough privileges. To execute this query, it's necessary to have the grant SELECT(substitution) ON system.macros

(3) If there's no solution to (2), would you be willing to accept this contribution if it only worked for ClickHouse Cloud in the meantime? That's the deployment type we care about most. With this commit I've updated the PR so we only support ClickHouse Cloud.

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 15, 2024

(1) — You can provide the hostname and port via the env variables; IIRC, it won't work correctly unless you add wait_end_of_query=1 to the JDBC connection string, cause it will not wait until the tables are created on all replicas, and the driver might immediately call the wrong one.

(2) - I am only aware of the mentioned method

I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

You don't need this in CH Cloud. It's necessary only for on-premise. But yes, these grants are one of the main pain points.

(3) It will work for single-node and ClickHouse Cloud, then. But we need a way to disable it for the on-premise cluster if it's detected. Can we do that dynamically? It's the feature that is enabled or disabled on the driver level, not on the connection level.

@calherries
Copy link
Contributor Author

calherries commented May 15, 2024

(1) Thanks, I can run the tests locally with ClickHouse Cloud. I should have been more specific with my question: Is there a way to run the tests that run with Github Actions with ClickHouse Cloud deployments?

(2) Ok thanks for clarifying

(3) How can we know whether it's an on-premise cluster vs on-premise single node without getting the node_count? You only suggested an approach using node_count in your comment here.

is_cloud = 0, nodes_count = 1 -> on-premise single node

@calherries
Copy link
Contributor Author

@slvrtrn tagging you because you might have missed my last comment above, where I didn't tag you

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 15, 2024

@calherries
(1) - currently, no. We need to add this; I will do that as a follow-up (after both this and #232 are merged).
(3) - this is the best option I am aware of. I will clarify if there is anything else available, but chances are that that's it. The grants issue is annoying, but it's necessary only for on-premise deployments (and not CH Cloud, as you don't need this query there); we can add a doc entry about the required grants (after we add on-prem clusters support as a follow-up PR).

I have another question, though. Is there a way to disable a particular feature for a specific connection? These are defined in the driver/database-supports? override, but a cleaner solution would've been to enable it based on the deployment type (i.e. after we have the connection details). This is useful in this PR atm and also in #232 cause the impersonation (another feature defined via an override) should be enabled only on the most recent CH version.

@calherries calherries marked this pull request as ready for review May 16, 2024 02:00
@calherries calherries marked this pull request as draft May 16, 2024 02:00
@calherries
Copy link
Contributor Author

calherries commented May 16, 2024

(3) Okay, for now I'll support CH Cloud only.

Is there a way to disable a particular feature for a specific connection?

Yes, we generally use driver/database-supports? for this, you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

I think this PR is ready to be reviewed now, but we'll need to wait for the changes in this PR to be released in the 49 branch before there is a released Metabase version that is compatible with this.

@calherries calherries marked this pull request as ready for review May 16, 2024 04:00
@calherries
Copy link
Contributor Author

For this:

A human-readable description of the changes was provided to include in CHANGELOG

What version should these changes under in the changelog?

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 16, 2024

What version should these changes under in the changelog?

I will add this, don't worry. Otherwise, there will be conflicts with #232

you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

Thanks, this is great.

:quoted true
:dialect (sql.qp/quote-style driver)))
"ENGINE = MergeTree"
(format "PRIMARY KEY (%s)" (str/join ", " (map quote-name primary-key)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this and put the primary-key in the ORDER BY clause instead.

i.e. the table definition should be just:

CREATE TABLE foo
(col1 Type, ...)
ENGINE MergeTree
ORDER BY (primary-key)

Copy link
Collaborator

@slvrtrn slvrtrn May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#order_by

ClickHouse uses the sorting key as a primary key if the primary key is not defined explicitly by the PRIMARY KEY clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've made that change Use order by to specify primary key

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 16, 2024

btw

Resolves #230

It doesn't; IIUC, the user that requested this feature, uses on-prem.

@calherries calherries changed the title CSV uploads CSV uploads for ClickHouse Cloud May 16, 2024
@slvrtrn
Copy link
Collaborator

slvrtrn commented May 16, 2024

@calherries, some tests are failing on the CI.

There is one more thing.
In my issue comment, I mentioned

To immediately get the data from any replica regardless of its sync status, it’s also better to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.

This is an absolute necessity if we add the Cloud test run cause we need to immediately get the test data back after the table is created; otherwise, we will experience weird flakiness during the runs. This also can happen in real-world usage (albeit more rarely).

@slvrtrn slvrtrn mentioned this pull request May 16, 2024
3 tasks
@calherries
Copy link
Contributor Author

calherries commented May 16, 2024

@slvrtrn

select_sequential_consistency=1.

I struggled with this, because we're using a connection pool and don't have any way to set settings in a SQL query. I've made an attempt here but I'm concerned about the performance implications of having this set for every query, even if it's only needed for queries on uploaded tables with CH Cloud.
set select_sequential_consistency in connection details

update: I've implemented a workaround with Avoid setting select_sequential_consistency for on-premise

(defmethod sql-jdbc.conn/connection-details->spec :clickhouse
[_ details]
(cond-> (connection-details->spec* details)
(cloud? details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, select_sequential_consistency just limits which replica you query from, and doesn't affect the performance of the query on that specific replica.

This logic seems like it will break consistency for multi-replica on prem installations, and needlessly disable it on single replica ones. It seems best to just to enable it unconditionally for now?

For any multi-replica installation, enabling it for all connections seems to wipe out most of the benefits of connecting to a distributed installation, but I don't see any way around this besides reworking metabase to support separate pools for different connection flavors.

Copy link
Contributor Author

@calherries calherries May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're mistaking the behaviour of select_sequential_consistency on CH Cloud (docs here), thinking it's the same as on-premise (the link you shared). We're only enabling select_sequential_consistency on CH Cloud, where it's much less of an issue.

There's more on SharedMergeTree in this blog post.

Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn’t need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query.

If your use case requires consistency guarantees that each server is delivering the same query result, then you can run the SYNC REPLICA system statement, which is a much more lightweight operation with the SharedMergeTree. Instead of syncing data (or metadata with zero-copy replication) between servers, each server just needs to fetch the current version of metadata from Keeper.

And I can't imagine fetching metadata from Keeper can be a bottleneck. I also got this DM from Serge on select_sequential_consistency:

With CH Cloud, there should not be any issues with that, as it uses “shared” merge tree and this is merely a metadata sync before executing the query (this is super fast, as it is a fetch of the meta from zookeeper).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, select_sequential_consistency is correctly enabled for the Cloud type of instance only.

For on-premise clusters (in a follow-up PR), we should just set insert_quorum equal to the nodes count so that replica sync won't be necessary and will be effectively done during the CSV insert, and it won't affect other read-only operations.

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 17, 2024

Let's bump the MB version to 0.49.11 in .github/workflows/check.yml#L20 and merge if the CI is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants