-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClickHouse sharding plan #8652
Comments
There's multiple parts to this. Here's my thinking right now: Problem being solved
Also related: https://github.com/PostHog/product-internal/blob/main/requests-for-comments/2022-01-01-highscale-querying.md Steps1. Get schemas in syncPostHog cloud and self-hosted have currently different schemas. Related PRs: 2. Support sharding in the helm chartCode-wise should be relatively straight-forward as clickhouse-operator has options for this. Some known subtasks:
Related PRs: 3. Figure out how to upgrade from old schema to newThis is slightly tricky since table engines are changing - meaning we can't rename data-containing tables.
An annoyance here is the differentation between async migrations and normal clickhouse migrations. This will probably be an async migration even though it might not need to actually run anything long-running. A related gotcha is Related PRs: 4. Rebalancing dataWe should figure out a mechanism for "rebalancing" data across a cluster. Note that this isn't strictly needed right now but operationally important for users scaling out. Note that this logic can be relatively dumb for now and only work for self-hosted - e.g. we can stop ingestion if needed and even duplicate data temporarily as data is getting moved. 5. Documentation, removing
|
As for |
See this:
So no, kind of not really due to the design of async migrations and that we need to set up new tables in clickhouse migrations. |
Note: After investigating solutions for a while, I'll skip Rationale:
|
All the above has finished, some unexpected things have cropped up:
|
One last issue that needs resolving: clickhouse-operator only syncs some tables onto new nodes. See #8912 for the posthog-specific solution, but one table we don't replicate that way is the clickhouse migrations one. I'll add a Distributed engine table to our fork of clickhouse orm. |
Tested this by:
cloud: local
cloud: "local"
image:
repository: macobo/posthog-test
sha: sha256:56d7b853ed4fffa158d869bd4877388252f96e35a628a3cba4407076d7a374a9
cloud: "local"
image:
repository: macobo/posthog-test
sha: sha256:56d7b853ed4fffa158d869bd4877388252f96e35a628a3cba4407076d7a374a9
clickhouse:
layout:
shardsCount: 2
replicasCount: 2
Click to see SQLSELECT
hostName() AS host,
name,
engine_full
FROM clusterAllReplicas('posthog', system, tables)
WHERE name IN ('events', 'person', 'person_distinct_id', 'session_recording_events', 'sharded_events', 'sharded_session_recording_events', 'infi_clickhouse_orm_migrations', 'infi_clickhouse_orm_migrations_distributed')
ORDER BY
name ASC,
host ASC
FORMAT Vertical
Query id: a060b40a-4a2b-4793-87dc-0534b191662d
Row 1:
──────
host: chi-posthog-posthog-0-0-0
name: events
engine_full: Distributed('posthog', 'posthog', 'sharded_events', sipHash64(distinct_id))
Row 2:
──────
host: chi-posthog-posthog-0-0-0
name: events
engine_full:
Row 3:
──────
host: chi-posthog-posthog-0-1-0
name: events
engine_full: Distributed('posthog', 'posthog', 'sharded_events', sipHash64(distinct_id))
Row 4:
──────
host: chi-posthog-posthog-0-1-0
name: events
engine_full:
Row 5:
──────
host: chi-posthog-posthog-1-0-0
name: events
engine_full: Distributed('posthog', 'posthog', 'sharded_events', sipHash64(distinct_id))
Row 6:
──────
host: chi-posthog-posthog-1-0-0
name: events
engine_full:
Row 7:
──────
host: chi-posthog-posthog-1-1-0
name: events
engine_full: Distributed('posthog', 'posthog', 'sharded_events', sipHash64(distinct_id))
Row 8:
──────
host: chi-posthog-posthog-1-1-0
name: events
engine_full:
Row 9:
───────
host: chi-posthog-posthog-0-0-0
name: infi_clickhouse_orm_migrations
engine_full: MergeTree PARTITION BY toYYYYMM(applied) ORDER BY (package_name, module_name) SETTINGS index_granularity = 8192
Row 10:
───────
host: chi-posthog-posthog-0-1-0
name: infi_clickhouse_orm_migrations
engine_full: MergeTree PARTITION BY toYYYYMM(applied) ORDER BY (package_name, module_name) SETTINGS index_granularity = 8192
Row 11:
───────
host: chi-posthog-posthog-1-0-0
name: infi_clickhouse_orm_migrations
engine_full: MergeTree PARTITION BY toYYYYMM(applied) ORDER BY (package_name, module_name) SETTINGS index_granularity = 8192
Row 12:
───────
host: chi-posthog-posthog-1-1-0
name: infi_clickhouse_orm_migrations
engine_full: MergeTree PARTITION BY toYYYYMM(applied) ORDER BY (package_name, module_name) SETTINGS index_granularity = 8192
Row 13:
───────
host: chi-posthog-posthog-0-0-0
name: infi_clickhouse_orm_migrations_distributed
engine_full: Distributed('posthog', 'posthog', 'infi_clickhouse_orm_migrations', rand())
Row 14:
───────
host: chi-posthog-posthog-0-1-0
name: infi_clickhouse_orm_migrations_distributed
engine_full: Distributed('posthog', 'posthog', 'infi_clickhouse_orm_migrations', rand())
Row 15:
───────
host: chi-posthog-posthog-1-0-0
name: infi_clickhouse_orm_migrations_distributed
engine_full: Distributed('posthog', 'posthog', 'infi_clickhouse_orm_migrations', rand())
Row 16:
───────
host: chi-posthog-posthog-1-1-0
name: infi_clickhouse_orm_migrations_distributed
engine_full: Distributed('posthog', 'posthog', 'infi_clickhouse_orm_migrations', rand())
Row 17:
───────
host: chi-posthog-posthog-0-0-0
name: person
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_noshard/posthog.person', '{replica}-{shard}', _timestamp) ORDER BY (team_id, id) SETTINGS index_granularity = 8192
Row 18:
───────
host: chi-posthog-posthog-0-1-0
name: person
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_noshard/posthog.person', '{replica}-{shard}', _timestamp) ORDER BY (team_id, id) SETTINGS index_granularity = 8192
Row 19:
───────
host: chi-posthog-posthog-1-0-0
name: person
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_noshard/posthog.person', '{replica}-{shard}', _timestamp) ORDER BY (team_id, id) SETTINGS index_granularity = 8192
Row 20:
───────
host: chi-posthog-posthog-1-1-0
name: person
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_noshard/posthog.person', '{replica}-{shard}', _timestamp) ORDER BY (team_id, id) SETTINGS index_granularity = 8192
Row 21:
───────
host: chi-posthog-posthog-0-0-0
name: person_distinct_id
engine_full: CollapsingMergeTree(_sign) ORDER BY (team_id, distinct_id, person_id) SETTINGS index_granularity = 8192
Row 22:
───────
host: chi-posthog-posthog-0-1-0
name: person_distinct_id
engine_full: CollapsingMergeTree(_sign) ORDER BY (team_id, distinct_id, person_id) SETTINGS index_granularity = 8192
Row 23:
───────
host: chi-posthog-posthog-1-0-0
name: person_distinct_id
engine_full: CollapsingMergeTree(_sign) ORDER BY (team_id, distinct_id, person_id) SETTINGS index_granularity = 8192
Row 24:
───────
host: chi-posthog-posthog-1-1-0
name: person_distinct_id
engine_full: CollapsingMergeTree(_sign) ORDER BY (team_id, distinct_id, person_id) SETTINGS index_granularity = 8192
Row 25:
───────
host: chi-posthog-posthog-0-0-0
name: session_recording_events
engine_full: Distributed('posthog', 'posthog', 'sharded_session_recording_events', sipHash64(distinct_id))
Row 26:
───────
host: chi-posthog-posthog-0-1-0
name: session_recording_events
engine_full: Distributed('posthog', 'posthog', 'sharded_session_recording_events', sipHash64(distinct_id))
Row 27:
───────
host: chi-posthog-posthog-1-0-0
name: session_recording_events
engine_full: Distributed('posthog', 'posthog', 'sharded_session_recording_events', sipHash64(distinct_id))
Row 28:
───────
host: chi-posthog-posthog-1-1-0
name: session_recording_events
engine_full: Distributed('posthog', 'posthog', 'sharded_session_recording_events', sipHash64(distinct_id))
Row 29:
───────
host: chi-posthog-posthog-0-0-0
name: sharded_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.events', '{replica}', _timestamp) PARTITION BY toYYYYMM(timestamp) ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)) SAMPLE BY cityHash64(distinct_id) SETTINGS index_granularity = 8192
Row 30:
───────
host: chi-posthog-posthog-0-1-0
name: sharded_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.events', '{replica}', _timestamp) PARTITION BY toYYYYMM(timestamp) ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)) SAMPLE BY cityHash64(distinct_id) SETTINGS index_granularity = 8192
Row 31:
───────
host: chi-posthog-posthog-1-0-0
name: sharded_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.events', '{replica}', _timestamp) PARTITION BY toYYYYMM(timestamp) ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)) SAMPLE BY cityHash64(distinct_id) SETTINGS index_granularity = 8192
Row 32:
───────
host: chi-posthog-posthog-1-1-0
name: sharded_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.events', '{replica}', _timestamp) PARTITION BY toYYYYMM(timestamp) ORDER BY (team_id, toDate(timestamp), event, cityHash64(distinct_id), cityHash64(uuid)) SAMPLE BY cityHash64(distinct_id) SETTINGS index_granularity = 8192
Row 33:
───────
host: chi-posthog-posthog-0-0-0
name: sharded_session_recording_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.session_recording_events', '{replica}', _timestamp) PARTITION BY toYYYYMMDD(timestamp) ORDER BY (team_id, toHour(timestamp), session_id, timestamp, uuid) TTL toDate(created_at) + toIntervalWeek(3) SETTINGS index_granularity = 512
Row 34:
───────
host: chi-posthog-posthog-0-1-0
name: sharded_session_recording_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.session_recording_events', '{replica}', _timestamp) PARTITION BY toYYYYMMDD(timestamp) ORDER BY (team_id, toHour(timestamp), session_id, timestamp, uuid) TTL toDate(created_at) + toIntervalWeek(3) SETTINGS index_granularity = 512
Row 35:
───────
host: chi-posthog-posthog-1-0-0
name: sharded_session_recording_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.session_recording_events', '{replica}', _timestamp) PARTITION BY toYYYYMMDD(timestamp) ORDER BY (team_id, toHour(timestamp), session_id, timestamp, uuid) TTL toDate(created_at) + toIntervalWeek(3) SETTINGS index_granularity = 512
Row 36:
───────
host: chi-posthog-posthog-1-1-0
name: sharded_session_recording_events
engine_full: ReplicatedReplacingMergeTree('/clickhouse/tables/am0004_20220318084324_{shard}/posthog.session_recording_events', '{replica}', _timestamp) PARTITION BY toYYYYMMDD(timestamp) ORDER BY (team_id, toHour(timestamp), session_id, timestamp, uuid) TTL toDate(created_at) + toIntervalWeek(3) SETTINGS index_granularity = 512
36 rows in set. Elapsed: 0.123 sec. |
👋 @macobo can we consider this ✅ ? |
Plan for how we are going to shard ClickHouse on CH-Operator based installs and Cloud (and externally hosted CH as well)
The text was updated successfully, but these errors were encountered: