Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component telemetry inaccurate for some components in 0.32.0 #18265

Closed
jgournet opened this issue Aug 16, 2023 · 9 comments · Fixed by #18289
Closed

Component telemetry inaccurate for some components in 0.32.0 #18265

jgournet opened this issue Aug 16, 2023 · 9 comments · Fixed by #18289
Assignees
Labels
domain: observability Anything related to monitoring/observing Vector meta: regression This issue represents a regression type: bug A code related bug.
Milestone

Comments

@jgournet
Copy link

jgournet commented Aug 16, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Sorry, this is a very light ticket - just describing the issue we're facing:
Since this morning (16/08), we're having alerts because the events in vs events out metric is increasing.
However, logs are still being pushed to our S3 sink ... it seems the metric is not functioning well ?

Is someone else affected and could help provide more info ?

Configuration

No response

Version

vector 0.32.0 (x86_64-unknown-linux-musl 1b403e1 2023-08-15 14:56:36.089460954)

Debug Output

No response

Example Data

No response

Additional Context

Reverting to 0.31.0-alpine seems help with the metric

References

No response

@jgournet jgournet added the type: bug A code related bug. label Aug 16, 2023
@jszwedko
Copy link
Member

Hi @jgournet ,

Could you share some graphs? Also your configuration? Did you change anything this morning? I see you are running 0.32.0 which would say to me that maybe you upgraded Vector. Is that the case? If so, is the behavior different with 0.31.0?

Note that I would expect to see that: events increasing compared to events out if:

  • The sink is falling behind
  • Additional batch partitions are being created in parallel

@jgournet
Copy link
Author

Hi @jszwedko ,
Thanks for those questions, and sorry for the lack of information - I was hoping it would be a widespread issue enough that someone could give useful information.

In the meantime, here's some info:
Graph:
image

prometheus query for this grap is:

vector_component_received_events_total{component_id="out_s3_default"} - vector_component_sent_events_total{component_id="out_s3_default"}

A few notes:

  • the drops are us restarting the agents, until the last massive drop where we set K8S to run image 0.31
  • as far as we know, we did not lose any logs, and sink was working properly (checked by running "vector top" on a running agent: the "in" events were at ~500k and climbing fast, while the "out" was at around ~3k and climbing vveeerryyyy slowly; all the while, we could see that logs were being received properly.
  • we did not change anything at all around this.
  • it's hard to see from this graph, but you can see quite a few agents starting to climb up as well.
  • previously, we've never seen anything like this

For the "version" question: we actually use image timberio/vector:latest-alpine ... so we guess that agents that started up recently got auto-upgraded to 0.32, and started showing this issue.

Also: in one of our test environment, we currently have 5 vector agents: 3 are running 0.32, and 2 are still on old 0.31.
The 3 0.32 agents are showing that behavior, so if you have more questions, I can hopefully use those to help out.

@jszwedko
Copy link
Member

Thanks for the additional details @jgournet . That was enough for me to try to reproduce this. I think I was able to. I'm bisecting down to find the commit that introduced the bug now. I'll flag this to be fixed in 0.32.1. For now, I'd use the 0.31.0 docker image.

@jszwedko jszwedko added domain: observability Anything related to monitoring/observing Vector meta: regression This issue represents a regression labels Aug 16, 2023
@jszwedko jszwedko added this to the Vector 0.32.1 milestone Aug 16, 2023
@jszwedko
Copy link
Member

jszwedko commented Aug 16, 2023

Bisected down to 0bf6abd

This is the configuration I was testing with:

[sources.source0]
type = "demo_logs"
interval = 0
format = "json"
decoding.codec = "json"

[sinks.sink0]
type = "aws_s3"
inputs = ["source0"]
bucket = "timberio-jesse-test"
key_prefix = "18265/date=%F/"
encoding.codec = "json"
framing.method = "newline_delimited"

[sources.source1]
type = "internal_metrics"

[sinks.sink1]
type = "datadog_metrics"
inputs = ["source1"]
default_api_key = "${DD_API_KEY}"

It appears that component_sent_events_total is very small (sub 10) compared to component_received_events_total (around 50k). I was seeing differences of 10s of thousands in my test setup.

@jszwedko jszwedko changed the title Sink events in and out metric issue Component telemetry inaccurate for some components in 0.32.0 Aug 16, 2023
@jszwedko jszwedko pinned this issue Aug 16, 2023
@jgournet
Copy link
Author

Thank you @jszwedko ! Quite impressive how you managed to track this down with so few information !
(and yes, we'll use 0.31 in the meantime - although, now that we know it's "false alerts", we're not too worried)

@jgournet
Copy link
Author

@jszwedko :
FYI: I tried 0.32.1 this morning and this issue is still there - not sure if it's intended or not ;)

@jszwedko
Copy link
Member

jszwedko commented Aug 23, 2023

@jszwedko : FYI: I tried 0.32.1 this morning and this issue is still there - not sure if it's intended or not ;)

Definitely not intended 🙂

Can you share the configuration you are using? I tested again just now and the aws_s3 sink, at least, appears to be reporting the correct metrics with v0.32.1.

@jgournet
Copy link
Author

sorry ... ignore that: seems like we had some old nodes that did not pull "latest" properly ... will try again with "Always" as PullPolicy, but it seems ok after all

@jszwedko
Copy link
Member

sorry ... ignore that: seems like we had some old nodes that did not pull "latest" properly ... will try again with "Always" as PullPolicy, but it seems ok after all

Thanks for confirming and for the initial report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector meta: regression This issue represents a regression type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants