Query: add thanos_store_up metric to StoreSet #900

mreichardt95 · 2019-03-08T23:38:55Z

Fixes #880

Changes

added thanos_up{node="{NODE_IP:PORT}"} 0/1 metric to StoreSet

Verification

Tested with:

static store
DNS A discovery
file discovery

adrien-f · 2019-03-09T18:00:43Z

Hey there 👋 !

Love the change ! Would it make sense to have that logic in the updateStoreStatus function ? I'm working on handling stores disappearing in it on a WIP PR. Also not sure about the metric name ? Maybe thanos_store_up ?

mreichardt95 · 2019-03-09T20:15:21Z

Hey 👋
Thanks for your input. It sure does look a lot cleaner this way. Totally overlooked that function.
Initial intention was to have an additional type label, but I realised it's all of type store and forgot to change the name🤦‍♂️.

adrien-f · 2019-03-09T21:10:10Z

By the way, I'm working on a similar subject with the Stores page in Query UI.

What should happen when a store leaves the DNS/File Config ? Right now on the UI page it stays forever, working on cleaning this up. I guess it should be the same for the metrics ? 🤔

Any thoughts @bwplotka ?

mreichardt95 · 2019-03-09T23:46:27Z

Happy to hear that!
Yes right now the metric stays at 0.0 until query gets restarted. It makes more sense to properly deregister it as soon as it is not in FileSD or DNS anymore.
So I guess this is dependent on your PR and then has to be adjusted. Should I close this one for now?

adrien-f · 2019-03-17T14:44:44Z

Maybe let's wait for #910 then 👍 ?

mreichardt95 · 2019-03-18T10:03:19Z

Yep already patched in your PR to test and it's pretty simple to just deregister it with your logic.

pkg/query/storeset.go

…o add_store_status_metric # Conflicts: # pkg/query/storeset.go

GiedriusS

LGTM

bwplotka

Guys, I love the idea, but I have to reject this. This PR in this form, introduces cardinality bomb 💣 💣 💣 ^^

All in comments.

bwplotka · 2019-04-13T20:56:23Z

pkg/query/storeset.go

+	storeNodeStatus := prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "thanos_store_up",
+		Help: "Represents the status of each store node.",
+	}, []string{"node"})


Bit of inconsistency, you put here node.. I don't like this, node feels like something host related but we are talking about apps, services, also everywhere else store is passed as addr.

addr is an IP:port and it CHANGES on every rollout potentially, independently to pod_name (for k8s) etc. On top of this it is a unique unicode string so non-deduplicatable in TSDB.... this might be nasty cardinality bomb on top of regular pod name, instance etc stuff

We need something smarter.

to 1. I know above metric is called node as well, but it's.. legacy ;p

bwplotka · 2019-04-13T21:00:57Z

pkg/query/storeset.go

@@ -349,6 +357,7 @@ func (s *StoreSet) updateStoreStatus(store *storeRef, err error) {
 	s.storesStatusesMtx.Lock()
 	defer s.storesStatusesMtx.Unlock()

+	s.storeNodeStatus.WithLabelValues(store.addr).Set(0)


Sorry, but I think we are making a mistake here. Why we are flipping gauge on every update? So e.g if some store API is up 100% time, prometheus can still see it being down because it scrapes concurrently to this code.

So if scrape happens while we are in line 367 we share incorrect infomation. Please fix it by changing Gauge value here based on error all the time. Also not sure if this is a correct place, method name suggests we are updating status (for UI) NOT the metric. This might be quite confusing for a reader.

I belive setting this metric near the healthcheck is the best solution and the most intuitive. What do you think? (:

bwplotka · 2019-04-13T21:02:36Z

pkg/query/storeset.go

@@ -421,6 +432,7 @@ func (s *StoreSet) cleanUpStoreStatuses() {
 	for addr, status := range s.storeStatuses {
 		if _, ok := s.stores[addr]; !ok {
 			if now.Sub(status.LastCheck) >= s.unhealthyStoreTimeout {
+				s.storeNodeStatus.DeleteLabelValues(addr)


Not sure, how Prometheus does it? I think it keeps the thing forever AFAIK. Again, let's separate UI with metric system

Prometheus's metrics for the Alertmaanger are the closest thing to this, looks like we aren't doing any cleanup there currently which should be fixed.

TODO: Add issue on AM

bwplotka · 2019-04-13T21:04:04Z

pkg/query/storeset.go

@@ -85,6 +85,7 @@ type StoreSet struct {
 	storesStatusesMtx     sync.RWMutex
 	stores                map[string]*storeRef
 	storeNodeConnections  prometheus.Gauge
+	storeNodeStatus       *prometheus.GaugeVec


Can we change the name?

Node sounds again like related to host to me.

Proposition:

Suggested change

storeNodeStatus *prometheus.GaugeVec

storeUpnessMetric *prometheus.GaugeVec

bwplotka · 2019-04-13T21:30:49Z

The major problem of addr in store up metric label relates to @adrien-f PR with UI statuses and the question, HOW to identify StoreAPIs. Currently we do that by IP:port and it's wrong, as the same "StoreAPI" can have many during lifetime.

I think the solution here, instead of addr is essentially having component label (type of storeAPI). We already have component types and ALL the components should properly propagate its type in Info call. If Info is down, thus the StoreAPI is down, we just don't do anything (upness of type that this StoreAPI used to be will be decreased). This means that upness metric is gauge of all working StoreAPIs per type, and we could implement this by counting all healthy stores per type essentially.

BTW this is the code for Prometheus up: https://github.com/prometheus/prometheus/blob/12708acd1516e0d1756006a54684ccaa76f1aaf2/scrape/scrape.go#L1178

bwplotka · 2019-04-13T21:32:22Z

you alert will look like then absent(thanos_store_up{component="sidecar"}) or thanos_store_up{component="sidecar"} < [number you expect]

bwplotka · 2019-04-13T21:37:52Z

Also.. Why we cannot use thanos_store_node_info for the same logic? It is missing this component info, which we could add, but essentially, this is our up and even better, not killing cardinality (however having long label value which concatenates all labels)

Thanks for your work @mreichardt95 but I think improving thanos_store_node_info might the way to go?

bwplotka · 2019-04-14T11:09:58Z

We had some offline discussion.

We could potentially in this PR:

changing thanos_store_node_info to thanos_store_up. However this might be confusing as we don't have any notion of what should be up. E.g

If some discovery gives IP abc:port and this is not a valid grpc storeAPI. we will do up 0... but for what label? If we never seen what labels are propagated by this service?

I think some opinion from @juliusv and @bbrazil would be helpful. As this is classic discovery + upness + scrape problem. However we don not scrape here, just discover and query and maintain list of valid store API set.

leaving those external labels as the only label for thanos_store_up
- alternatively: append labels directly into metric with some ext_lname_... prefix for label name, so: thanos_store_node_info{external_labels="cluster='a';env='b'"} would become: thanos_store_up{ext_lname_cluster="a", ext_lname="b"} which would potentially help in querying later on.

The ext labels label will give you false impression on store gateway (no labels), but we want to add custom label to Store GW at some point to allow better filtering.

remove node mention for this metric in variables. Also move handling to the more heatlh related.

brian-brazil · 2019-04-14T11:15:45Z

If we never seen what labels are propagated by this service?

Metric labels should not depend on the other side providing metadata, it's all about what this particular server knows.

In this case I'd stick to metrics that aren't per endpoint, such as total number of stores and how many of those are up. If you need more information you can look at the UI.

bwplotka · 2019-04-14T11:20:17Z

Thanks @brian-brazil that makes sense. This raises the question, do we need to expose in this metric any component type / external labels .

Kind of follow up question is that it's not easy to tell which component introduces latency, so e.g client_grpc histogram could have those external labels or at least component type as well included.

brian-brazil · 2019-04-14T11:22:03Z

Component sounds okay to me cardinality wise. For backends you're probably looking at logs or tracing, as I can easily imagine there being hundreds to thousands of stores.

povilasv · 2019-04-15T07:31:05Z

I also wanted to add grpc request counts & duration buckets in Thanos Querier per Store API. So it would be GRPC client side metrics for each Thanos Store backend.

IMO it would be super helpful for debugability of Thanos, as you could go and tell that this concrete Store API is misbehaving.

Should we still consider this or a hard no, due to cardinality?

We could also consider having --extensive-metrics option for people running small amount of Thanos Store's

bwplotka · 2019-04-15T12:22:23Z

IMO it would be super helpful for debugability of Thanos, as you could go and tell that this concrete Store API is misbehaving.
Should we still consider this or a hard no, due to cardinality?
We could also consider having --extensive-metrics option for people running small amount of Thanos Store's

So, yes I think something with external labels when --extensive-metrics is set might be nice. The problem is the nature of grpc requests we do. Because we process all with no buffer on Querier side all grpc client requests are closed ONLY after all of those succeed (including delayed ones). Essentially I experienced this effect with tracing. Even though single component was slow, rest fast, tracing indicated equal timings. Let's look closely on this problem when we will be adding such extension @povilasv

The reason for such design is that everything is embedded in iterators within SeriesSet in kind of best effort manner. Can explain offline

bwplotka · 2019-04-18T10:05:05Z

Moving this from 0.4.0 plan as it needs to more work ):

bwplotka · 2019-06-28T13:50:14Z

BTW, totally forgot to mention but we added this metric: #1260

Do you think it solves your use cases? (:

stale · 2020-01-11T03:42:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mreichardt95 force-pushed the add_store_status_metric branch from 7d6a311 to 0062039 Compare March 8, 2019 23:46

add thanos_up metric

619fda7

mreichardt95 force-pushed the add_store_status_metric branch from 0062039 to 619fda7 Compare March 9, 2019 20:08

mreichardt95 changed the title ~~Query: add thanos_up metric to StoreSet~~ Query: add thanos_store_up metric to StoreSet Mar 9, 2019

domgreen added the component: query label Mar 16, 2019

adrien-f mentioned this pull request Mar 18, 2019

query: cleanup store statuses as they come and go #910

Merged

2 tasks

bwplotka added help wanted priority: P0 labels Mar 28, 2019

mreichardt95 force-pushed the add_store_status_metric branch 2 times, most recently from a399fdc to 295a894 Compare April 8, 2019 15:19

GiedriusS reviewed Apr 8, 2019

View reviewed changes

pkg/query/storeset.go Outdated Show resolved Hide resolved

mreichardt95 force-pushed the add_store_status_metric branch from 295a894 to d472b3a Compare April 9, 2019 09:20

Merge branch 'master' of https://github.com/improbable-eng/thanos int…

ee001d9

…o add_store_status_metric # Conflicts: # pkg/query/storeset.go

mreichardt95 force-pushed the add_store_status_metric branch from d472b3a to ee001d9 Compare April 9, 2019 09:23

GiedriusS approved these changes Apr 9, 2019

View reviewed changes

povilasv approved these changes Apr 13, 2019

View reviewed changes

bwplotka requested changes Apr 13, 2019

View reviewed changes

bwplotka mentioned this pull request Apr 15, 2019

query: add store.read-timeout parameter to avoid partial response failure when one of stores timed out #895

Closed

bwplotka removed the priority: P0 label Apr 18, 2019

stale bot added the stale label Jan 11, 2020

stale bot closed this Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: add thanos_store_up metric to StoreSet #900

Query: add thanos_store_up metric to StoreSet #900

mreichardt95 commented Mar 8, 2019

adrien-f commented Mar 9, 2019 •

edited

Loading

mreichardt95 commented Mar 9, 2019

adrien-f commented Mar 9, 2019 •

edited

Loading

mreichardt95 commented Mar 9, 2019

adrien-f commented Mar 17, 2019

mreichardt95 commented Mar 18, 2019

GiedriusS left a comment

bwplotka left a comment

bwplotka Apr 13, 2019 •

edited

Loading

bwplotka Apr 13, 2019

bwplotka Apr 13, 2019

bwplotka Apr 13, 2019

brian-brazil Apr 14, 2019

bwplotka Apr 18, 2019

bwplotka Apr 13, 2019

bwplotka commented Apr 13, 2019

bwplotka commented Apr 13, 2019

bwplotka commented Apr 13, 2019 •

edited

Loading

bwplotka commented Apr 14, 2019

brian-brazil commented Apr 14, 2019

bwplotka commented Apr 14, 2019

brian-brazil commented Apr 14, 2019

povilasv commented Apr 15, 2019 •

edited

Loading

bwplotka commented Apr 15, 2019 •

edited

Loading

bwplotka commented Apr 18, 2019

bwplotka commented Jun 28, 2019

stale bot commented Jan 11, 2020

	storeNodeStatus *prometheus.GaugeVec
	storeUpnessMetric *prometheus.GaugeVec

Query: add thanos_store_up metric to StoreSet #900

Query: add thanos_store_up metric to StoreSet #900

Conversation

mreichardt95 commented Mar 8, 2019

Changes

Verification

adrien-f commented Mar 9, 2019 • edited Loading

mreichardt95 commented Mar 9, 2019

adrien-f commented Mar 9, 2019 • edited Loading

mreichardt95 commented Mar 9, 2019

adrien-f commented Mar 17, 2019

mreichardt95 commented Mar 18, 2019

GiedriusS left a comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka Apr 13, 2019 • edited Loading

Choose a reason for hiding this comment

bwplotka Apr 13, 2019

Choose a reason for hiding this comment

bwplotka Apr 13, 2019

Choose a reason for hiding this comment

bwplotka Apr 13, 2019

Choose a reason for hiding this comment

brian-brazil Apr 14, 2019

Choose a reason for hiding this comment

bwplotka Apr 18, 2019

Choose a reason for hiding this comment

bwplotka Apr 13, 2019

Choose a reason for hiding this comment

bwplotka commented Apr 13, 2019

bwplotka commented Apr 13, 2019

bwplotka commented Apr 13, 2019 • edited Loading

bwplotka commented Apr 14, 2019

brian-brazil commented Apr 14, 2019

bwplotka commented Apr 14, 2019

brian-brazil commented Apr 14, 2019

povilasv commented Apr 15, 2019 • edited Loading

bwplotka commented Apr 15, 2019 • edited Loading

bwplotka commented Apr 18, 2019

bwplotka commented Jun 28, 2019

stale bot commented Jan 11, 2020

adrien-f commented Mar 9, 2019 •

edited

Loading

adrien-f commented Mar 9, 2019 •

edited

Loading

bwplotka Apr 13, 2019 •

edited

Loading

bwplotka commented Apr 13, 2019 •

edited

Loading

povilasv commented Apr 15, 2019 •

edited

Loading

bwplotka commented Apr 15, 2019 •

edited

Loading