Component telemetry inaccurate for some components in 0.32.0 #18265

jgournet · 2023-08-16T01:24:17Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Sorry, this is a very light ticket - just describing the issue we're facing:
Since this morning (16/08), we're having alerts because the events in vs events out metric is increasing.
However, logs are still being pushed to our S3 sink ... it seems the metric is not functioning well ?

Is someone else affected and could help provide more info ?

Configuration

No response

Version

vector 0.32.0 (x86_64-unknown-linux-musl 1b403e1 2023-08-15 14:56:36.089460954)

Debug Output

No response

Example Data

No response

Additional Context

Reverting to 0.31.0-alpine seems help with the metric

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2023-08-16T14:25:17Z

Hi @jgournet ,

Could you share some graphs? Also your configuration? Did you change anything this morning? I see you are running 0.32.0 which would say to me that maybe you upgraded Vector. Is that the case? If so, is the behavior different with 0.31.0?

Note that I would expect to see that: events increasing compared to events out if:

The sink is falling behind
Additional batch partitions are being created in parallel

jgournet · 2023-08-16T21:40:34Z

Hi @jszwedko ,
Thanks for those questions, and sorry for the lack of information - I was hoping it would be a widespread issue enough that someone could give useful information.

In the meantime, here's some info:
Graph:

prometheus query for this grap is:

vector_component_received_events_total{component_id="out_s3_default"} - vector_component_sent_events_total{component_id="out_s3_default"}

A few notes:

the drops are us restarting the agents, until the last massive drop where we set K8S to run image 0.31
as far as we know, we did not lose any logs, and sink was working properly (checked by running "vector top" on a running agent: the "in" events were at ~500k and climbing fast, while the "out" was at around ~3k and climbing vveeerryyyy slowly; all the while, we could see that logs were being received properly.
we did not change anything at all around this.
it's hard to see from this graph, but you can see quite a few agents starting to climb up as well.
previously, we've never seen anything like this

For the "version" question: we actually use image timberio/vector:latest-alpine ... so we guess that agents that started up recently got auto-upgraded to 0.32, and started showing this issue.

Also: in one of our test environment, we currently have 5 vector agents: 3 are running 0.32, and 2 are still on old 0.31.
The 3 0.32 agents are showing that behavior, so if you have more questions, I can hopefully use those to help out.

jszwedko · 2023-08-16T22:45:59Z

Thanks for the additional details @jgournet . That was enough for me to try to reproduce this. I think I was able to. I'm bisecting down to find the commit that introduced the bug now. I'll flag this to be fixed in 0.32.1. For now, I'd use the 0.31.0 docker image.

jszwedko · 2023-08-16T23:09:39Z

Bisected down to 0bf6abd

This is the configuration I was testing with:

[sources.source0]
type = "demo_logs"
interval = 0
format = "json"
decoding.codec = "json"

[sinks.sink0]
type = "aws_s3"
inputs = ["source0"]
bucket = "timberio-jesse-test"
key_prefix = "18265/date=%F/"
encoding.codec = "json"
framing.method = "newline_delimited"

[sources.source1]
type = "internal_metrics"

[sinks.sink1]
type = "datadog_metrics"
inputs = ["source1"]
default_api_key = "${DD_API_KEY}"

It appears that component_sent_events_total is very small (sub 10) compared to component_received_events_total (around 50k). I was seeing differences of 10s of thousands in my test setup.

jgournet · 2023-08-16T23:41:23Z

Thank you @jszwedko ! Quite impressive how you managed to track this down with so few information !
(and yes, we'll use 0.31 in the meantime - although, now that we know it's "false alerts", we're not too worried)

jgournet · 2023-08-22T23:49:10Z

@jszwedko :
FYI: I tried 0.32.1 this morning and this issue is still there - not sure if it's intended or not ;)

jszwedko · 2023-08-23T00:07:49Z

@jszwedko : FYI: I tried 0.32.1 this morning and this issue is still there - not sure if it's intended or not ;)

Definitely not intended 🙂

Can you share the configuration you are using? I tested again just now and the aws_s3 sink, at least, appears to be reporting the correct metrics with v0.32.1.

jgournet · 2023-08-23T00:19:21Z

sorry ... ignore that: seems like we had some old nodes that did not pull "latest" properly ... will try again with "Always" as PullPolicy, but it seems ok after all

jszwedko · 2023-08-23T15:07:07Z

sorry ... ignore that: seems like we had some old nodes that did not pull "latest" properly ... will try again with "Always" as PullPolicy, but it seems ok after all

Thanks for confirming and for the initial report!

jgournet added the type: bug A code related bug. label Aug 16, 2023

jszwedko added domain: observability Anything related to monitoring/observing Vector meta: regression This issue represents a regression labels Aug 16, 2023

jszwedko added this to the Vector 0.32.1 milestone Aug 16, 2023

jszwedko changed the title ~~Sink events in and out metric issue~~ Component telemetry inaccurate for some components in 0.32.0 Aug 16, 2023

jszwedko pinned this issue Aug 16, 2023

StephenWakely self-assigned this Aug 17, 2023

StephenWakely mentioned this issue Aug 17, 2023

fix(observability): add all events that are being encoded #18289

Merged

StephenWakely closed this as completed in #18289 Aug 17, 2023

StephenWakely mentioned this issue Aug 17, 2023

chore: tidy encode_input function #18300

Merged

jszwedko unpinned this issue Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Component telemetry inaccurate for some components in 0.32.0 #18265

Component telemetry inaccurate for some components in 0.32.0 #18265

jgournet commented Aug 16, 2023 •

edited

Loading

jszwedko commented Aug 16, 2023

jgournet commented Aug 16, 2023

jszwedko commented Aug 16, 2023

jszwedko commented Aug 16, 2023 •

edited

Loading

jgournet commented Aug 16, 2023

jgournet commented Aug 22, 2023

jszwedko commented Aug 23, 2023 •

edited

Loading

jgournet commented Aug 23, 2023

jszwedko commented Aug 23, 2023

Component telemetry inaccurate for some components in 0.32.0 #18265

Component telemetry inaccurate for some components in 0.32.0 #18265

Comments

jgournet commented Aug 16, 2023 • edited Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Aug 16, 2023

jgournet commented Aug 16, 2023

jszwedko commented Aug 16, 2023

jszwedko commented Aug 16, 2023 • edited Loading

jgournet commented Aug 16, 2023

jgournet commented Aug 22, 2023

jszwedko commented Aug 23, 2023 • edited Loading

jgournet commented Aug 23, 2023

jszwedko commented Aug 23, 2023

jgournet commented Aug 16, 2023 •

edited

Loading

jszwedko commented Aug 16, 2023 •

edited

Loading

jszwedko commented Aug 23, 2023 •

edited

Loading