Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: add thanos_store_up metric to StoreSet #900

Closed

Conversation

mreichardt95
Copy link
Contributor

Fixes #880

Changes

  • added thanos_up{node="{NODE_IP:PORT}"} 0/1 metric to StoreSet

Verification

Tested with:

  • static store
  • DNS A discovery
  • file discovery

@adrien-f
Copy link
Member

adrien-f commented Mar 9, 2019

Hey there 👋 !

Love the change ! Would it make sense to have that logic in the updateStoreStatus function ? I'm working on handling stores disappearing in it on a WIP PR. Also not sure about the metric name ? Maybe thanos_store_up ?

@mreichardt95
Copy link
Contributor Author

Hey 👋
Thanks for your input. It sure does look a lot cleaner this way. Totally overlooked that function.
Initial intention was to have an additional type label, but I realised it's all of type store and forgot to change the name🤦‍♂️.

@mreichardt95 mreichardt95 changed the title Query: add thanos_up metric to StoreSet Query: add thanos_store_up metric to StoreSet Mar 9, 2019
@adrien-f
Copy link
Member

adrien-f commented Mar 9, 2019

By the way, I'm working on a similar subject with the Stores page in Query UI.

What should happen when a store leaves the DNS/File Config ? Right now on the UI page it stays forever, working on cleaning this up. I guess it should be the same for the metrics ? 🤔

Any thoughts @bwplotka ?

@mreichardt95
Copy link
Contributor Author

Happy to hear that!
Yes right now the metric stays at 0.0 until query gets restarted. It makes more sense to properly deregister it as soon as it is not in FileSD or DNS anymore.
So I guess this is dependent on your PR and then has to be adjusted. Should I close this one for now?

@adrien-f
Copy link
Member

Maybe let's wait for #910 then 👍 ?

@mreichardt95
Copy link
Contributor Author

Yep already patched in your PR to test and it's pretty simple to just deregister it with your logic.

pkg/query/storeset.go Outdated Show resolved Hide resolved
…o add_store_status_metric

# Conflicts:
#	pkg/query/storeset.go
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, I love the idea, but I have to reject this. This PR in this form, introduces cardinality bomb 💣 💣 💣 ^^

All in comments.

storeNodeStatus := prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "thanos_store_up",
Help: "Represents the status of each store node.",
}, []string{"node"})
Copy link
Member

@bwplotka bwplotka Apr 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Bit of inconsistency, you put here node.. I don't like this, node feels like something host related but we are talking about apps, services, also everywhere else store is passed as addr.
  2. addr is an IP:port and it CHANGES on every rollout potentially, independently to pod_name (for k8s) etc. On top of this it is a unique unicode string so non-deduplicatable in TSDB.... this might be nasty cardinality bomb on top of regular pod name, instance etc stuff

We need something smarter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to 1. I know above metric is called node as well, but it's.. legacy ;p

@@ -349,6 +357,7 @@ func (s *StoreSet) updateStoreStatus(store *storeRef, err error) {
s.storesStatusesMtx.Lock()
defer s.storesStatusesMtx.Unlock()

s.storeNodeStatus.WithLabelValues(store.addr).Set(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I think we are making a mistake here. Why we are flipping gauge on every update? So e.g if some store API is up 100% time, prometheus can still see it being down because it scrapes concurrently to this code.

So if scrape happens while we are in line 367 we share incorrect infomation. Please fix it by changing Gauge value here based on error all the time. Also not sure if this is a correct place, method name suggests we are updating status (for UI) NOT the metric. This might be quite confusing for a reader.

I belive setting this metric near the healthcheck is the best solution and the most intuitive. What do you think? (:

@@ -421,6 +432,7 @@ func (s *StoreSet) cleanUpStoreStatuses() {
for addr, status := range s.storeStatuses {
if _, ok := s.stores[addr]; !ok {
if now.Sub(status.LastCheck) >= s.unhealthyStoreTimeout {
s.storeNodeStatus.DeleteLabelValues(addr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, how Prometheus does it? I think it keeps the thing forever AFAIK. Again, let's separate UI with metric system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus's metrics for the Alertmaanger are the closest thing to this, looks like we aren't doing any cleanup there currently which should be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add issue on AM

@@ -85,6 +85,7 @@ type StoreSet struct {
storesStatusesMtx sync.RWMutex
stores map[string]*storeRef
storeNodeConnections prometheus.Gauge
storeNodeStatus *prometheus.GaugeVec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the name?

Node sounds again like related to host to me.

Proposition:

Suggested change
storeNodeStatus *prometheus.GaugeVec
storeUpnessMetric *prometheus.GaugeVec

@bwplotka
Copy link
Member

The major problem of addr in store up metric label relates to @adrien-f PR with UI statuses and the question, HOW to identify StoreAPIs. Currently we do that by IP:port and it's wrong, as the same "StoreAPI" can have many during lifetime.

I think the solution here, instead of addr is essentially having component label (type of storeAPI). We already have component types and ALL the components should properly propagate its type in Info call. If Info is down, thus the StoreAPI is down, we just don't do anything (upness of type that this StoreAPI used to be will be decreased). This means that upness metric is gauge of all working StoreAPIs per type, and we could implement this by counting all healthy stores per type essentially.

BTW this is the code for Prometheus up: https://github.com/prometheus/prometheus/blob/12708acd1516e0d1756006a54684ccaa76f1aaf2/scrape/scrape.go#L1178

@bwplotka
Copy link
Member

you alert will look like then absent(thanos_store_up{component="sidecar"}) or thanos_store_up{component="sidecar"} < [number you expect]

@bwplotka
Copy link
Member

bwplotka commented Apr 13, 2019

Also.. Why we cannot use thanos_store_node_info for the same logic? It is missing this component info, which we could add, but essentially, this is our up and even better, not killing cardinality (however having long label value which concatenates all labels)

Thanks for your work @mreichardt95 but I think improving thanos_store_node_info might the way to go?

@bwplotka
Copy link
Member

We had some offline discussion.

We could potentially in this PR:

  • changing thanos_store_node_info to thanos_store_up. However this might be confusing as we don't have any notion of what should be up. E.g

If some discovery gives IP abc:port and this is not a valid grpc storeAPI. we will do up 0... but for what label? If we never seen what labels are propagated by this service?

I think some opinion from @juliusv and @bbrazil would be helpful. As this is classic discovery + upness + scrape problem. However we don not scrape here, just discover and query and maintain list of valid store API set.

  • leaving those external labels as the only label for thanos_store_up
    • alternatively: append labels directly into metric with some ext_lname_... prefix for label name, so: thanos_store_node_info{external_labels="cluster='a';env='b'"} would become: thanos_store_up{ext_lname_cluster="a", ext_lname="b"} which would potentially help in querying later on.

The ext labels label will give you false impression on store gateway (no labels), but we want to add custom label to Store GW at some point to allow better filtering.

  • remove node mention for this metric in variables. Also move handling to the more heatlh related.

@brian-brazil
Copy link
Contributor

If we never seen what labels are propagated by this service?

Metric labels should not depend on the other side providing metadata, it's all about what this particular server knows.

In this case I'd stick to metrics that aren't per endpoint, such as total number of stores and how many of those are up. If you need more information you can look at the UI.

@bwplotka
Copy link
Member

Thanks @brian-brazil that makes sense. This raises the question, do we need to expose in this metric any component type / external labels .

Kind of follow up question is that it's not easy to tell which component introduces latency, so e.g client_grpc histogram could have those external labels or at least component type as well included.

@brian-brazil
Copy link
Contributor

Component sounds okay to me cardinality wise. For backends you're probably looking at logs or tracing, as I can easily imagine there being hundreds to thousands of stores.

@povilasv
Copy link
Member

povilasv commented Apr 15, 2019

I also wanted to add grpc request counts & duration buckets in Thanos Querier per Store API. So it would be GRPC client side metrics for each Thanos Store backend.

IMO it would be super helpful for debugability of Thanos, as you could go and tell that this concrete Store API is misbehaving.

Should we still consider this or a hard no, due to cardinality?

We could also consider having --extensive-metrics option for people running small amount of Thanos Store's

@bwplotka
Copy link
Member

bwplotka commented Apr 15, 2019

IMO it would be super helpful for debugability of Thanos, as you could go and tell that this concrete Store API is misbehaving.
Should we still consider this or a hard no, due to cardinality?
We could also consider having --extensive-metrics option for people running small amount of Thanos Store's

So, yes I think something with external labels when --extensive-metrics is set might be nice. The problem is the nature of grpc requests we do. Because we process all with no buffer on Querier side all grpc client requests are closed ONLY after all of those succeed (including delayed ones). Essentially I experienced this effect with tracing. Even though single component was slow, rest fast, tracing indicated equal timings. Let's look closely on this problem when we will be adding such extension @povilasv

The reason for such design is that everything is embedded in iterators within SeriesSet in kind of best effort manner. Can explain offline

@bwplotka
Copy link
Member

Moving this from 0.4.0 plan as it needs to more work ):

@bwplotka
Copy link
Member

BTW, totally forgot to mention but we added this metric: #1260

Do you think it solves your use cases? (:

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@stale stale bot closed this Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

query: provide up for store discovery
7 participants