-
Notifications
You must be signed in to change notification settings - Fork 1.9k
enhancement(clickhouse sink): Add ArrowStream format
#24075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @benjamin-awd, thank you for this contribution. It will take a while to review since it's >2.5k LoC. I was wondering if we can split it this somehow. Would be make it sense to make a generic arrow codec? Just an idea. |
|
Hey @pront, thanks for taking a look -- I think it'd be nice to have a generic arrow codec (assuming you mean like I think current approach is a decent compromise since it's relatively generic. The only requirement is an override at the request-builder level meaning that any sink can implement it. |
|
So after playing around with Claude code for a bit (probably burned the equivalent of a few trees), it seems that it is somewhat possible to create a generic Arrow codec, although this requires the creation of a Although of course it will most likely take quite a bit of effort & time to push it through the gate compared to this one |
I think it's worth exploring as a followup instead. Also, it would be very helpful to create a feature request to record these details and gauge community interest. |
|
Hi @pront, I've pushed the arrow codec changes to #24124. Regarding the feature request, I think it lies somewhere between adding support for batch codecs and supporting the Arrow as a serialization format -- this also opens up some interesting paths where we could write to Arrow’s |
Summary
This PR adds the
ArrowStreamformat option for the Clickhouse sink. This provides a more efficient binary protocol for ingesting log data into ClickHouse compared to the existing JSON formats, with improved performance at high throughput.Vector configuration
How did you test this PR?
Tested locally and in development environment using data at a rate of a few hundred thousand rows per second.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Closes #24074
Notes
Internal comparison between formats (pointing Vector at two identical tables, the only difference being format)
WITH a_log AS ( SELECT `table`, format, rows, bytes, flush_query_id FROM system.asynchronous_insert_log WHERE (status = 'Ok') AND (`table` IN ('t1', 't2')) AND (event_time >= (now() - toIntervalMinute(15))) ), q_log AS ( SELECT query_id, query_duration_ms FROM system.query_log PREWHERE (type = 'QueryFinish') AND (query_kind = 'AsyncInsertFlush') AND (event_time >= (now() - toIntervalMinute(15))) ) SELECT a.`table`, a.format, count() AS total_flushes, sum(a.rows) AS total_rows_inserted, formatReadableSize(sum(a.bytes)) AS total_data_inserted, sum(q.query_duration_ms) AS total_flush_time_ms, sum(a.rows) / sum(q.query_duration_ms / 1000.) AS avg_rows_per_second, concat(formatReadableSize(sum(a.bytes) / sum(q.query_duration_ms / 1000.)), '/s') AS avg_bytes_per_second, sum(a.rows) / count() AS avg_rows_per_flush FROM a_log AS a INNER JOIN q_log AS q ON a.flush_query_id = q.query_id GROUP BY a.`table`, a.format ORDER BY a.`table` ASC, a.format ASC Query id: d41311c0-cb10-403c-8d4d-7f6cf6cb8f13 Row 1: ────── table: jsoneachrow_table format: JSONEachRow total_flushes: 34084 total_rows_inserted: 42829429 -- 42.83 million total_data_inserted: 65.39 GiB total_flush_time_ms: 14707745 -- 14.71 million avg_rows_per_second: 2912.0323339845772 avg_bytes_per_second: 4.55 MiB/s avg_rows_per_flush: 1256.5845851425888 Row 2: ────── table: arrowstream_table format: ArrowStream total_flushes: 35934 total_rows_inserted: 45153872 -- 45.15 million total_data_inserted: 17.27 GiB total_flush_time_ms: 3356282 -- 3.36 million avg_rows_per_second: 13453.539362902164 avg_bytes_per_second: 5.27 MiB/s avg_rows_per_flush: 1256.5779484610675