feat(new sink): new `postgres` sink #22481

Ichmed · 2025-02-20T13:18:32Z

Summary

A zero-copy postgres sink that requires no new dependencies (it adds one feature on tokio-postgres).
The sink uses a prepared statement to insert the data in pure SQL instead of serializing the data to JSON and deserializing it in the database.

For now the sink can only handle Logs and Traces.

Tests are still missing but it can be E2E tested it using this setup:

sources:
  stdin:
    type: stdin

transforms:
  foobar:
    type: remap
    inputs:
      - stdin
    source: |-
      .json_field = del(.)
      .array_field = [true, true, true]
      .id = "some_id"
      .ignored_field = 1324

sinks:
  posti:
    type: postgres
    host: localhost
    port: 5432
    table: jsontest
    inputs:
      - foobar

CREATE TABLE IF NOT EXISTS public.jsontest
(
    id character varying(255) COLLATE pg_catalog."default",
    json_field json,
    array_field boolean[]
)

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

Please read our Vector contributor resources.
- make check-all is a good command to run locally. This check is
  defined here. Some of these
  checks might not be relevant to your PR. For Rust changes, at the very least you should run:
  - cargo fmt --all
  - cargo clippy --workspace --all-targets -- -D warnings
  - cargo nextest run --workspace (alternatively, you can run cargo test --all)
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

bits-bot · 2025-02-20T13:18:37Z

All committers have signed the CLA.

pront · 2025-02-20T15:09:01Z

Hi @Ichmed, thank you this PR! There is an existing PR that introduces a postgres sink which is almost there: #21248

isbm · 2025-02-20T15:42:24Z

@pront We are fully aware of it and analysed it. 😉 And yet we don't think it is the right way to do. Please take a closer look at the code. Features we can add, no worries. But we have a really enormous data and we need it in large amounts into Postgres and TimescaleDB. We specifically need that optimised for cloud usage (mem/CPU matters!).

In a worst case you will have two sinks! 😆 Call it "lightweight PgSink".

pront · 2025-02-20T16:17:14Z

@pront We are fully aware of it and analysed it. 😉 And yet we don't think it is the right way to do. Please take a closer look at the code. Features we can add, no worries.

Sure will do. It will take some time though so please bear with me.

But we have a really enormous data and we need it in large amounts into Postgres and TimescaleDB. We specifically need that optimised for cloud usage (mem/CPU matters!).

Did you compare both implementations against some benchmarks?

In a worst case you will have two sinks! 😆 Call it "lightweight PgSink".

Having two sinks doing the same thing is probably not what we want. I do like that #21248 has support for all telemetry data, Vector features such as ACKs and good UX. And most importantly, a lot of testing.

Again, I didn't dive into the differences and I need some time to do so. I wonder, since you looked the existing PR, can you work on optimizing that after it lands?

isbm · 2025-02-20T16:36:15Z

Sure will do. It will take some time though so please bear with me.

Thanks!

Having two sinks doing the same thing is probably not what we want. I do like that #21248 has support for all telemetry data, Vector features such as ACKs and good UX. And most importantly, a lot of testing.

To our defence, our day one Chapter 1 is not half-year Chapter 128 😛. We specifically focused on having it zero-copy, no dependencies generic micro-sink. Adding features is not a problem, ACKs are coming, as it is a necessity.

Again, I didn't dive into the differences and I need some time to do so. I wonder, since you looked the existing PR, can you work on optimizing that after it lands?

We would definitely support and maintain ours — that's for sure, because it will go into production straight away. Alternatively, it can land in "contrib" section: more options to choose from is always better. We are interested in bringing more sinks/transforms in a near future.

Ichmed · 2025-02-21T19:48:28Z

Hi, @pront with these changes we should have feature parity with the other PR aside from configuration.

Is there a nice way to do benchmarks? I looked at the benches directory but didn't really understand how to apply that to this use case.
AFAIK this implementation should be faster than the other, since we are simply doing less work, should have zero allocations per event, are using a prepared statement and have no deserialization happening on the DB side, so if we are slower in any usecase I would consider that a bug that can be fixed.

jorgehermo9 · 2025-02-22T14:20:31Z

Hi, I would like to drop my opinion on this.

AFAIK this implementation should be faster than the other, since we are simply doing less work

I'm not really sure about this and claiming about performance improvements and optimizations without measuring it, is a mistake.

I see that you are not batching events and every ingested event results in a network trip. I would be surprised to see that this approach results in a higher throughput than batching them.

Taking a look at your implementation, I'm not sure it would work in a general case. For example, this prepared statement

vector/src/sinks/postgres/mod.rs

Line 94 in e9b0c8c

"INSERT INTO {table} ({}) VALUES ({})",

formats the columns in a specific order (which is the one that the columns are returned by the DB, not deterministic as you are not ordering them in the query)

and then when inserting the column values

vector/src/sinks/postgres/mod.rs

Line 191 in e9b0c8c

.map(|k| v.get(k.as_str()).unwrap_or(&Value::Null))

you are depending on the BTreeMap ordering (which is alphabetically), but it is a different order (at least right now) from the one that you are formatting the columns in the prepared statement (which is non-deterministic and decided by the DB, althought it should be aphabetically ordered also, but you can't ensure that with the current implementation)

Moreover, as you are loading the table column's on sink's startup

vector/src/sinks/postgres/mod.rs

Line 90 in e9b0c8c

    
           let columns: Vec<_> = client.query("SELECT column_name from INFORMATION_SCHEMA.COLUMNS WHERE table_name = $1 AND table_schema = $2", &[&table, &schema]).await?.into_iter().map(|x| x.get(0)).collect();

your implementation does not allow to alter tables and insert new columns while runnning (which #21248 does), you must have to restart the sink so new columns are taken into account. Also, deleting columns while running would cause all events to fail until Vector is restarted.

And also, I'm not sure your implementation works for Composite types (maybe does, but I'm currently not sure if it does).

The implementations are not feature-wise equal so I don't think that a performance comparison makes sense in this case though (whichever would be the fastest).

should have zero allocations per event

No allocation does not always imply to be faster. It generally is faster to not allocate, but does not imply to be faster.

are using a prepared statement

so does #21248. https://docs.rs/sqlx/latest/sqlx/fn.query.html
The connection will transparently prepare and cache the statement, which means it only needs to be parsed once in the connection’s lifetime

more options to choose from is always better

This is also a fallacy. From a new user experience, not having a single solution is actually worse, as users would struggle deciding which one to use, for example. Moreover, it is a maintenance overhead for maintainers to have multiple implementations for nearly the same.

From my point of view, we should not be talking about should be faster and actually measuring it, but as I think this is not feature-wise equal to #21248, I don't know if it makes sense to just choose the fastest

jorgehermo9 · 2025-02-22T14:27:06Z

Also, you state that

should have zero allocations per event

but clearly using a BytesMut to encode every event's field

vector/src/sinks/postgres/mod.rs

Line 192 in e9b0c8c

.map(Wrapper);

, you are doing several allocations per event. Those statements about no allocation done should come from validating it with tools like valgrind. Stating that allocations are not done purely based on your written code and not on your dependencies' code (which also must be taken into account) is wrong.

Ichmed added 10 commits February 20, 2025 13:37

Create postgres sink

022406f

Fix docstring

8f93c52

Remove unwrap

13a5fe6

Move prepare statement to sink construction

7de11c6

Make Value serializaion zero-copy

9744380

Remove potential panics

67bdc46

Add functionality for traces

cb7266d

Add proper healthcheck

441466f

Distinguish storage for traces and logs

3162544

Make Clippy happy

b7551a9

Ichmed requested a review from a team as a code owner February 20, 2025 13:18

github-actions bot added the domain: sinks Anything related to the Vector's sinks label Feb 20, 2025

Ichmed added 3 commits February 20, 2025 21:19

Add missing status update

42e62af

Add metric storage

875873f

Extract Wrappers

e9b0c8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(new sink): new `postgres` sink #22481

feat(new sink): new `postgres` sink #22481

Ichmed commented Feb 20, 2025

bits-bot commented Feb 20, 2025 •

edited

Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025 •

edited

Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025

Ichmed commented Feb 21, 2025

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

feat(new sink): new postgres sink #22481

Are you sure you want to change the base?

feat(new sink): new postgres sink #22481

Conversation

Ichmed commented Feb 20, 2025

Summary

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Checklist

References

bits-bot commented Feb 20, 2025 • edited Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025 • edited Loading

pront commented Feb 20, 2025

isbm commented Feb 20, 2025

Ichmed commented Feb 21, 2025

jorgehermo9 commented Feb 22, 2025 • edited Loading

jorgehermo9 commented Feb 22, 2025 • edited Loading

feat(new sink): new `postgres` sink #22481

feat(new sink): new `postgres` sink #22481

bits-bot commented Feb 20, 2025 •

edited

Loading

isbm commented Feb 20, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading

jorgehermo9 commented Feb 22, 2025 •

edited

Loading