Add support for Databend sink #15727

everpcpc · 2022-12-26T02:15:29Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

sink logs to databend

Attempted Solutions

Http sink can do the trick, but it's too complicated.

Proposal

We need to do some further improvement for the cloud database, such as using presigned url for upload to save most of the transfer fee.

References

https://github.com/datafuselabs/databend

Version

0.26.0

spencergilbert · 2023-01-04T18:36:28Z

FWIW, Databend seems to suggest using the clickhouse sink - https://databend.rs/doc/integrations/data-tool/vector#configure-vector

We have been told about some performance limitations of that sink as-is, so it's worth investigating what a native sink may look like.

everpcpc · 2023-01-05T07:06:53Z

We intend to do a 3-step batch sink:

Get a presigned url for object storage before a batch by the query:
```
PRESIGN UPLOAD @stage_name/stage_path;
```
Format data into csv/ndjson or some else, and upload directly into object storage with the presigned url.
Insert with the uploaded file with stage attachment in previous step:
https://databend.rs/doc/sql-commands/dml/dml-insert#insert-with-stage-attachment
(Maybe this step can be replaced by just trigger the prebuilt insert pipeline or background worker)

BTW, direct insert mode would also be supported, and configurable via settings.

spencergilbert · 2023-01-05T14:13:10Z

Haha, I should have checked to see if you were part of @datafuselabs/databend before responding 😆

I'm definitely not familiar enough with the application to have strong opinions of how to implement it - I expect we'd lean on y'alls expertise for what's the most performant/reliable/has-whatever-features-we-need and go that way.

Since object storage would be involved with the 3-step, it could be an opportunity to spike/explore using OpenDAL (#15715)

everpcpc · 2023-01-06T01:43:15Z

Actually object storage is not involved with the 3-step.
We get a presigned url with step 1, which is generated by OpenDAL inside Databend, and makes step 2 just a http upload, which could be more faster than inserting into databases, and with no need for OpenDAL on vector side.

ref: https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html

spencergilbert · 2023-01-09T16:55:17Z

Oh, I see - that seems handy!

I saw that databendlabs/databend#9448 included "Datadog Vector integrate with Rust-driver", is this issue/work something Databend is considering contributing?

everpcpc · 2023-01-10T05:06:09Z

Yes, I'm working on this these days.

davidhuie-dd · 2023-02-10T21:13:21Z

Hi @everpcpc! We've been taking a look at this request, and we're wondering if you could provide us with more details about the issues you're facing with Vector that require a new sink. Mainly, it would be great to have some background on what makes the HTTP or Clickhouse workarounds too complicated, then also some context on what the exact bandwidth concerns the existing sinks face. We're happy to extend Vector's surface area to include new projects, but we're also being careful about increasing the project's surface area if workarounds already exist. Sorry about chiming in so late in the game, and thanks in advance! 😸

everpcpc · 2023-02-12T04:01:28Z

hi @davidhuie-dd,
As a cloud warehouse, we are mostly handling large amounts of data, and the transfer fee can be extremely high with direct insert into the database. So we take advantage of the s3 pre-signed url, which is commonly used by cloud warehouse providers such as Snowflake. Since the S3 upload is all free, with the help of pre-signed url, we can directly upload data into s3 for the database, and little network transfer on public network to the database. Even in a private VPC network, this feature also helps since cross AZ transfer fee can still be incredibly high. Neither http nor clickhouse can do this now, so a new sink is needed.

Also, with pre-signed insert, we are able to do insert with cluster, not the single instance that handles the insert statement, which could gain much more performance.

Besides, for the later CSV sink format, neither http nor clickhouse can be easily adopted since we need to configure the exact sink fields and generate corresponding insert SQL statement.

davidhuie-dd · 2023-02-13T22:18:37Z

@everpcpc For documentation purposes: since ingress bandwidth is free on AWS, this is for saving egress bandwidth costs? It seems like it would help when traffic is between a Databend client and server within the same region, but within different AZs. That would make the bandwidth cost free. Thanks.

everpcpc · 2023-02-14T00:56:17Z

@davidhuie-dd some additional notes:

The ingress-bandwith-free only applies to EC2 machine with an external IP address. In a common enterprise setup, people usually have a load balancer and NAT gateway, that could cost even more than data transfer, both ingress and egress.
Yes, in real production cases, people would mostly like their business to be fault tolerance with available zones. And cross-AZ deployment is also recommended by all cloud providers. So there are always logs written from everywhere.

everpcpc added the type: feature A value-adding code addition that introduce new functionality. label Dec 26, 2022

jszwedko added the sink: new A request for a new sink label Dec 27, 2022

This was referenced Jan 11, 2023

bug: file format options not applied for insert with stage attachment databendlabs/databend#9569

Closed

feat(new sink): Initial databend sink #15898

Merged

spencergilbert closed this as completed in #15898 Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Databend sink #15727

Add support for Databend sink #15727

everpcpc commented Dec 26, 2022 •

edited

Loading

spencergilbert commented Jan 4, 2023

everpcpc commented Jan 5, 2023 •

edited

Loading

spencergilbert commented Jan 5, 2023

everpcpc commented Jan 6, 2023 •

edited

Loading

spencergilbert commented Jan 9, 2023

everpcpc commented Jan 10, 2023

davidhuie-dd commented Feb 10, 2023

everpcpc commented Feb 12, 2023 •

edited

Loading

davidhuie-dd commented Feb 13, 2023

everpcpc commented Feb 14, 2023 •

edited

Loading

Add support for Databend sink #15727

Add support for Databend sink #15727

Comments

everpcpc commented Dec 26, 2022 • edited Loading

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

spencergilbert commented Jan 4, 2023

everpcpc commented Jan 5, 2023 • edited Loading

spencergilbert commented Jan 5, 2023

everpcpc commented Jan 6, 2023 • edited Loading

spencergilbert commented Jan 9, 2023

everpcpc commented Jan 10, 2023

davidhuie-dd commented Feb 10, 2023

everpcpc commented Feb 12, 2023 • edited Loading

davidhuie-dd commented Feb 13, 2023

everpcpc commented Feb 14, 2023 • edited Loading

everpcpc commented Dec 26, 2022 •

edited

Loading

everpcpc commented Jan 5, 2023 •

edited

Loading

everpcpc commented Jan 6, 2023 •

edited

Loading

everpcpc commented Feb 12, 2023 •

edited

Loading

everpcpc commented Feb 14, 2023 •

edited

Loading