Skip to content

Conversation

@benjamin-awd
Copy link
Contributor

@benjamin-awd benjamin-awd commented Oct 25, 2025

Summary

This PR adds the ArrowStream format option for the Clickhouse sink. This provides a more efficient binary protocol for ingesting log data into ClickHouse compared to the existing JSON formats, with improved performance at high throughput.

Vector configuration

  sinks:
    clickhouse:
      type: clickhouse
      endpoint: http://localhost:8123
      database: mydatabase
      table: logs
      format: arrow_stream  # New format option (defaults to JSONEachRow)
      compression: gzip
      auth:
        strategy: basic
        user: default
        password: "${CLICKHOUSE_PASSWORD}"

How did you test this PR?

Tested locally and in development environment using data at a rate of a few hundred thousand rows per second.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Closes #24074

Notes

Internal comparison between formats (pointing Vector at two identical tables, the only difference being format)

WITH
    a_log AS
    (
        SELECT
            `table`,
            format,
            rows,
            bytes,
            flush_query_id
        FROM system.asynchronous_insert_log
        WHERE (status = 'Ok') AND (`table` IN ('t1', 't2')) AND (event_time >= (now() - toIntervalMinute(15)))
    ),
    q_log AS
    (
        SELECT
            query_id,
            query_duration_ms
        FROM system.query_log
        PREWHERE (type = 'QueryFinish') AND (query_kind = 'AsyncInsertFlush') AND (event_time >= (now() - toIntervalMinute(15)))
    )
SELECT
    a.`table`,
    a.format,
    count() AS total_flushes,
    sum(a.rows) AS total_rows_inserted,
    formatReadableSize(sum(a.bytes)) AS total_data_inserted,
    sum(q.query_duration_ms) AS total_flush_time_ms,
    sum(a.rows) / sum(q.query_duration_ms / 1000.) AS avg_rows_per_second,
    concat(formatReadableSize(sum(a.bytes) / sum(q.query_duration_ms / 1000.)), '/s') AS avg_bytes_per_second,
    sum(a.rows) / count() AS avg_rows_per_flush
FROM a_log AS a
INNER JOIN q_log AS q ON a.flush_query_id = q.query_id
GROUP BY
    a.`table`,
    a.format
ORDER BY
    a.`table` ASC,
    a.format ASC

Query id: d41311c0-cb10-403c-8d4d-7f6cf6cb8f13

Row 1:
──────
table:                jsoneachrow_table
format:               JSONEachRow
total_flushes:        34084
total_rows_inserted:  42829429 -- 42.83 million
total_data_inserted:  65.39 GiB
total_flush_time_ms:  14707745 -- 14.71 million
avg_rows_per_second:  2912.0323339845772
avg_bytes_per_second: 4.55 MiB/s
avg_rows_per_flush:   1256.5845851425888

Row 2:
──────
table:                arrowstream_table
format:               ArrowStream
total_flushes:        35934
total_rows_inserted:  45153872 -- 45.15 million
total_data_inserted:  17.27 GiB
total_flush_time_ms:  3356282 -- 3.36 million
avg_rows_per_second:  13453.539362902164
avg_bytes_per_second: 5.27 MiB/s
avg_rows_per_flush:   1256.5779484610675

@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:15
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Oct 25, 2025
@benjamin-awd benjamin-awd requested a review from a team as a code owner October 25, 2025 14:30
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Oct 25, 2025
@pront
Copy link
Member

pront commented Oct 27, 2025

Hi @benjamin-awd, thank you for this contribution. It will take a while to review since it's >2.5k LoC. I was wondering if we can split it this somehow. Would be make it sense to make a generic arrow codec? Just an idea.

@pront pront self-assigned this Oct 27, 2025
@benjamin-awd
Copy link
Contributor Author

Hey @pront, thanks for taking a look -- I think it'd be nice to have a generic arrow codec (assuming you mean like lib/codecs/src/encoding/format/arrow.rs) but it's quite tricky because of the batching logic involved. If I'm not wrong, this will require a significant overhaul of Vector's code and I'm not sure if it's something I have bandwidth for at the moment 😕

I think current approach is a decent compromise since it's relatively generic. The only requirement is an override at the request-builder level meaning that any sink can implement it.

@benjamin-awd
Copy link
Contributor Author

So after playing around with Claude code for a bit (probably burned the equivalent of a few trees), it seems that it is somewhat possible to create a generic Arrow codec, although this requires the creation of a BatchEncoder struct and BatchSerializer trait. If that's something you're keen to review, I can split it into a separate PR without the Clickhouse changes https://github.com/vectordotdev/vector/compare/master...benjamin-awd:vector:add-ch-arrow-codec?expand=1

Although of course it will most likely take quite a bit of effort & time to push it through the gate compared to this one

@pront
Copy link
Member

pront commented Oct 29, 2025

So after playing around with Claude code for a bit (probably burned the equivalent of a few trees), it seems that it is somewhat possible to create a generic Arrow codec, although this requires the creation of a BatchEncoder struct and BatchSerializer trait. If that's something you're keen to review, I can split it into a separate PR without the Clickhouse changes master...benjamin-awd:vector:add-ch-arrow-codec?expand=1 (compare)

Although of course it will most likely take quite a bit of effort & time to push it through the gate compared to this one

I think it's worth exploring as a followup instead. Also, it would be very helpful to create a feature request to record these details and gauge community interest.

@benjamin-awd
Copy link
Contributor Author

Hi @pront, I've pushed the arrow codec changes to #24124. Regarding the feature request, I think it lies somewhere between adding support for batch codecs and supporting the Arrow as a serialization format -- this also opens up some interesting paths where we could write to Arrow’s Variant type and write to Parquet without needing a schema which would be quite nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ArrowStream format to Clickhouse sink

3 participants