-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: add thanos_store_up metric to StoreSet #900
Query: add thanos_store_up metric to StoreSet #900
Conversation
7d6a311
to
0062039
Compare
Hey there 👋 ! Love the change ! Would it make sense to have that logic in the |
0062039
to
619fda7
Compare
Hey 👋 |
By the way, I'm working on a similar subject with the Stores page in Query UI. What should happen when a store leaves the DNS/File Config ? Right now on the UI page it stays forever, working on cleaning this up. I guess it should be the same for the metrics ? 🤔 Any thoughts @bwplotka ? |
Happy to hear that! |
Maybe let's wait for #910 then 👍 ? |
Yep already patched in your PR to test and it's pretty simple to just deregister it with your logic. |
a399fdc
to
295a894
Compare
295a894
to
d472b3a
Compare
…o add_store_status_metric # Conflicts: # pkg/query/storeset.go
d472b3a
to
ee001d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, I love the idea, but I have to reject this. This PR in this form, introduces cardinality bomb 💣 💣 💣 ^^
All in comments.
storeNodeStatus := prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "thanos_store_up", | ||
Help: "Represents the status of each store node.", | ||
}, []string{"node"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bit of inconsistency, you put here
node
.. I don't like this,node
feels like something host related but we are talking about apps, services, also everywhere else store is passed asaddr
. addr
is anIP:port
and it CHANGES on every rollout potentially, independently to pod_name (for k8s) etc. On top of this it is a unique unicode string so non-deduplicatable in TSDB.... this might be nasty cardinality bomb on top of regular pod name, instance etc stuff
We need something smarter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to 1. I know above metric is called node as well, but it's.. legacy ;p
@@ -349,6 +357,7 @@ func (s *StoreSet) updateStoreStatus(store *storeRef, err error) { | |||
s.storesStatusesMtx.Lock() | |||
defer s.storesStatusesMtx.Unlock() | |||
|
|||
s.storeNodeStatus.WithLabelValues(store.addr).Set(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I think we are making a mistake here. Why we are flipping gauge on every update? So e.g if some store API is up 100% time, prometheus can still see it being down because it scrapes concurrently to this code.
So if scrape happens while we are in line 367 we share incorrect infomation. Please fix it by changing Gauge value here based on error all the time. Also not sure if this is a correct place, method name suggests we are updating status
(for UI) NOT the metric. This might be quite confusing for a reader.
I belive setting this metric near the healthcheck is the best solution and the most intuitive. What do you think? (:
@@ -421,6 +432,7 @@ func (s *StoreSet) cleanUpStoreStatuses() { | |||
for addr, status := range s.storeStatuses { | |||
if _, ok := s.stores[addr]; !ok { | |||
if now.Sub(status.LastCheck) >= s.unhealthyStoreTimeout { | |||
s.storeNodeStatus.DeleteLabelValues(addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, how Prometheus does it? I think it keeps the thing forever AFAIK. Again, let's separate UI with metric system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus's metrics for the Alertmaanger are the closest thing to this, looks like we aren't doing any cleanup there currently which should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Add issue on AM
@@ -85,6 +85,7 @@ type StoreSet struct { | |||
storesStatusesMtx sync.RWMutex | |||
stores map[string]*storeRef | |||
storeNodeConnections prometheus.Gauge | |||
storeNodeStatus *prometheus.GaugeVec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the name?
Node sounds again like related to host
to me.
Proposition:
storeNodeStatus *prometheus.GaugeVec | |
storeUpnessMetric *prometheus.GaugeVec |
The major problem of I think the solution here, instead of BTW this is the code for Prometheus |
you alert will look like then |
Also.. Why we cannot use Thanks for your work @mreichardt95 but I think improving |
We had some offline discussion. We could potentially in this PR:
If some discovery gives IP I think some opinion from @juliusv and @bbrazil would be helpful. As this is classic discovery + upness + scrape problem. However we don not scrape here, just discover and query and maintain list of valid store API set.
The ext labels label will give you false impression on store gateway (no labels), but we want to add custom label to Store GW at some point to allow better filtering.
|
Metric labels should not depend on the other side providing metadata, it's all about what this particular server knows. In this case I'd stick to metrics that aren't per endpoint, such as total number of stores and how many of those are up. If you need more information you can look at the UI. |
Thanks @brian-brazil that makes sense. This raises the question, do we need to expose in this metric any component type / external labels . Kind of follow up question is that it's not easy to tell which component introduces latency, so e.g client_grpc histogram could have those external labels or at least component type as well included. |
Component sounds okay to me cardinality wise. For backends you're probably looking at logs or tracing, as I can easily imagine there being hundreds to thousands of stores. |
I also wanted to add grpc request counts & duration buckets in Thanos Querier per Store API. So it would be GRPC client side metrics for each Thanos Store backend. IMO it would be super helpful for debugability of Thanos, as you could go and tell that this concrete Store API is misbehaving. Should we still consider this or a hard no, due to cardinality? We could also consider having |
So, yes I think something with external labels when The reason for such design is that everything is embedded in iterators within |
Moving this from 0.4.0 plan as it needs to more work ): |
BTW, totally forgot to mention but we added this metric: #1260 Do you think it solves your use cases? (: |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Fixes #880
Changes
thanos_up{node="{NODE_IP:PORT}"} 0/1
metric to StoreSetVerification
Tested with: