Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Databend sink #15727

Closed
everpcpc opened this issue Dec 26, 2022 · 10 comments · Fixed by #15898
Closed

Add support for Databend sink #15727

everpcpc opened this issue Dec 26, 2022 · 10 comments · Fixed by #15898
Labels
sink: new A request for a new sink type: feature A value-adding code addition that introduce new functionality.

Comments

@everpcpc
Copy link
Contributor

everpcpc commented Dec 26, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

sink logs to databend

Attempted Solutions

Http sink can do the trick, but it's too complicated.

Proposal

We need to do some further improvement for the cloud database, such as using presigned url for upload to save most of the transfer fee.

References

Version

0.26.0

@everpcpc everpcpc added the type: feature A value-adding code addition that introduce new functionality. label Dec 26, 2022
@jszwedko jszwedko added the sink: new A request for a new sink label Dec 27, 2022
@spencergilbert
Copy link
Contributor

FWIW, Databend seems to suggest using the clickhouse sink - https://databend.rs/doc/integrations/data-tool/vector#configure-vector

We have been told about some performance limitations of that sink as-is, so it's worth investigating what a native sink may look like.

@everpcpc
Copy link
Contributor Author

everpcpc commented Jan 5, 2023

We intend to do a 3-step batch sink:

  1. Get a presigned url for object storage before a batch by the query:
    PRESIGN UPLOAD @stage_name/stage_path;
  2. Format data into csv/ndjson or some else, and upload directly into object storage with the presigned url.
  3. Insert with the uploaded file with stage attachment in previous step:
    https://databend.rs/doc/sql-commands/dml/dml-insert#insert-with-stage-attachment
    (Maybe this step can be replaced by just trigger the prebuilt insert pipeline or background worker)

BTW, direct insert mode would also be supported, and configurable via settings.

@spencergilbert
Copy link
Contributor

Haha, I should have checked to see if you were part of @datafuselabs/databend before responding 😆

I'm definitely not familiar enough with the application to have strong opinions of how to implement it - I expect we'd lean on y'alls expertise for what's the most performant/reliable/has-whatever-features-we-need and go that way.

Since object storage would be involved with the 3-step, it could be an opportunity to spike/explore using OpenDAL (#15715)

@everpcpc
Copy link
Contributor Author

everpcpc commented Jan 6, 2023

Actually object storage is not involved with the 3-step.
We get a presigned url with step 1, which is generated by OpenDAL inside Databend, and makes step 2 just a http upload, which could be more faster than inserting into databases, and with no need for OpenDAL on vector side.

ref: https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html

@spencergilbert
Copy link
Contributor

Oh, I see - that seems handy!

I saw that databendlabs/databend#9448 included "Datadog Vector integrate with Rust-driver", is this issue/work something Databend is considering contributing?

@everpcpc
Copy link
Contributor Author

Yes, I'm working on this these days.

@davidhuie-dd
Copy link
Contributor

Hi @everpcpc! We've been taking a look at this request, and we're wondering if you could provide us with more details about the issues you're facing with Vector that require a new sink. Mainly, it would be great to have some background on what makes the HTTP or Clickhouse workarounds too complicated, then also some context on what the exact bandwidth concerns the existing sinks face. We're happy to extend Vector's surface area to include new projects, but we're also being careful about increasing the project's surface area if workarounds already exist. Sorry about chiming in so late in the game, and thanks in advance! 😸

@everpcpc
Copy link
Contributor Author

everpcpc commented Feb 12, 2023

hi @davidhuie-dd,
As a cloud warehouse, we are mostly handling large amounts of data, and the transfer fee can be extremely high with direct insert into the database. So we take advantage of the s3 pre-signed url, which is commonly used by cloud warehouse providers such as Snowflake. Since the S3 upload is all free, with the help of pre-signed url, we can directly upload data into s3 for the database, and little network transfer on public network to the database. Even in a private VPC network, this feature also helps since cross AZ transfer fee can still be incredibly high. Neither http nor clickhouse can do this now, so a new sink is needed.

Also, with pre-signed insert, we are able to do insert with cluster, not the single instance that handles the insert statement, which could gain much more performance.

Besides, for the later CSV sink format, neither http nor clickhouse can be easily adopted since we need to configure the exact sink fields and generate corresponding insert SQL statement.

@davidhuie-dd
Copy link
Contributor

@everpcpc For documentation purposes: since ingress bandwidth is free on AWS, this is for saving egress bandwidth costs? It seems like it would help when traffic is between a Databend client and server within the same region, but within different AZs. That would make the bandwidth cost free. Thanks.

@everpcpc
Copy link
Contributor Author

everpcpc commented Feb 14, 2023

@davidhuie-dd some additional notes:

  1. The ingress-bandwith-free only applies to EC2 machine with an external IP address. In a common enterprise setup, people usually have a load balancer and NAT gateway, that could cost even more than data transfer, both ingress and egress.
  2. Yes, in real production cases, people would mostly like their business to be fault tolerance with available zones. And cross-AZ deployment is also recommended by all cloud providers. So there are always logs written from everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: new A request for a new sink type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants