-
Notifications
You must be signed in to change notification settings - Fork 179
re-add the missing prometheus_tsdb_wal_corruptions_total #473
re-add the missing prometheus_tsdb_wal_corruptions_total #473
Conversation
@codesome mind having a quick look? |
after implementing the new WAL this metric was missing so adding it again. Also added it in a test to make sure it works as expected. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
a3ac035
to
9e51d56
Compare
Bit occupied in KubeCon and AFK. Will take a look in couple of days. |
@@ -480,10 +486,10 @@ func (h *Head) Init(minValidTime int64) error { | |||
return nil | |||
} | |||
level.Warn(h.logger).Log("msg", "encountered WAL error, attempting repair", "err", err) | |||
h.metrics.walCorruptionsTotal.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we detect WAL corruptions when we call loadWAL
. So do we also need to increment it for this: https://github.com/prometheus/tsdb/blob/9e51d56e08958f22f55daf26795ee477def7797e/head.go#L471-L473
And also maybe a small test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this happens Prometheus will exist, so why would we increment there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no way to recover this Inc()
info after we return, then there is no need of adding it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the metrics directly to wal.Repair()
so we can know when there is a corruption and whether or not it has been repaired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think it makes much difference and where we place it, but not a bad idea.
How would we know if the corruption has been repaired?
LGTM 👍 |
@dkalashnik how are you using the |
@krasi-georgiev We are using it as a part of the generic prometheus dashboard in grafana, so no specific case. |
@dkalashnik how is your alerting defined based on this metric? |
@krasi-georgiev We don't have alerts based on that metric, just a panel in dashboard |
https://github.com/prometheus/prometheus/releases Some of these changes seem to be interesting enough to update [ENHANCEMENT] Query performance improvements. prometheus-junkyard/tsdb#531 [BUGFIX] Scrape: catch errors when creating HTTP clients #5182. Adds new metrics: prometheus_target_scrape_pools_* deprecating the flag storage.tsdb.retention -> use storage.tsdb.retention.time [FEATURE] Add subqueries to PromQL. [ENHANCEMENT] Kubernetes SD: Add service external IP and external name to the discovery metadata. #4940 [ENHANCEMENT] Add metric for number of rule groups loaded. #5090 BUGFIX] Make sure the retention period does not overflow. #5112 [BUGFIX] Make sure the blocks do not get very large. #5112 [BUGFIX] Do not generate blocks with no samples. prometheus-junkyard/tsdb#374 [BUGFIX] Reintroduce metric for WAL corruptions. prometheus-junkyard/tsdb#473 Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
closes #471
after implementing the new WAL this metric was missing so adding it again.
Also added it in a test to make sure it works as expected.
Signed-off-by: Krasi Georgiev kgeorgie@redhat.com