-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change metrics naming scheme #776
Conversation
This PR depends on jaegertracing/jaeger-lib#43 Here's how the metric names look like, before and after this change: Metric names for Cassandra on v1.3: https://git.io/vpLiI Metric names for in memory on v1.3: https://git.io/vpLiU One thing to clarify is whether this would break existing setups, specially expVar ones. While the "core" of the change is on the Prometheus integration, the metric names for Cassandra were I'm marking this as a WIP because jaegertracing/jaeger-lib#43 needs to be merged first. |
@simonpasquier, would you be able to review the metric names? Please, see the links in my previous comment. |
Travis is failing because the dependent change is not merged yet:
On my local machine, when I symlink jaeger-lib in
|
serviceNameIndex: casMetrics.NewTable(metricsFactory, "ServiceNameIndex"), | ||
serviceOperationIndex: casMetrics.NewTable(metricsFactory, "ServiceOperationIndex"), | ||
durationIndex: casMetrics.NewTable(metricsFactory, "DurationIndex"), | ||
traces: casMetrics.NewTable(metricsFactory, "traces"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are changing these, should we also switch to tag based metrics? Rather than using the namespace the table name can be just a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we leave this decision to a follow-up issue/PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #778 for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed on the call, I would rather make the change here because this PR is already a breaking change since it renames the metrics, so I would go further and fix them for good by putting table name as a tag instead of appending to the metric name. I commented on #778 how it can be (easily) done.
serviceOperationIndex: casMetrics.NewTable(metricsFactory, "ServiceOperationIndex"), | ||
durationIndex: casMetrics.NewTable(metricsFactory, "DurationIndex"), | ||
traces: casMetrics.NewTable(metricsFactory, "traces"), | ||
tagIndex: casMetrics.NewTable(metricsFactory, "tag_ndex"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/tag_ndex/tag_index
Credit goes to @burmanm :)
There are also these which could use rename and tags:
|
These are the metrics I have that are "by_svc":
Just so that I'm clear, you mean to convert the above into:
Gist updated to include metrics generated on demand when spans are received |
@jpkrohling Can we reduce the number of times 'jaeger' is used in the metric name to 1 :) e.g.
|
@jpkrohling yes, that's what I meant. |
cmd/standalone/main.go
Outdated
@@ -91,7 +91,7 @@ func main() { | |||
} | |||
|
|||
mBldr := new(pMetrics.Builder).InitFromViper(v) | |||
metricsFactory, err := mBldr.CreateMetricsFactory("jaeger-standalone") | |||
metricsFactory, err := mBldr.CreateMetricsFactory("standalone") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if this would be better as jaeger
instead of standalone
- so that the metric name is prefixed by jaeger
. The other jaeger that should ideally be removed is in the part jaeger_spans_by_svc
and jaeger_traces_by_svc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my intention but I pushed without testing first :/
This PR has been updated to include the changes suggested by @mabn, @objectiser and #778 and is ready for review. The related PR for jaeger-lib has also been updated. The Gist has also been updated to show the state of metrics before and after this change when deployed using Cassandra as the backing storage. |
Other possible suggestions:
|
Makes sense, especially to keep the metrics generated by the |
Updated based on the last suggestion from @objectiser. There's still a set of metrics with the name starting as |
I would not recommend wholesale removal of 'jaeger' prefix. The general pattern we used was that all metrics for backend components start with the name of the binary, like
Again, the prefix is the binary name, say BTW, this is why I am not very happy with the change to jaeger-lib, having a separator that is distinct from the words separator is very useful and makes the namespacing obvious, like |
PS metrics emitted by the clients always start with |
I think the latest Gist is already very close to what @yurishkuro is proposing. Here's an overview of the prefixes we have on that Gist. Note that there are four files in that Gist, three with the output from this PR and one with the base from v1.3.
I just checked how the metrics are reported when starting the individual components and there were a few mismatches, fixed in the latest commit. About |
One more thing about the duplication @objectiser mentioned. There are two kinds "received" and "saved". For received right now there are:
The second "jaeger" is the format name. So merge them into just three metrics:
I guess it's a similar story with the corresponding "saved-by-svc" metrics. |
@mabn: I agree about the format, but not sure about the other suggestion. Here's what we would have by applying only the
Your suggestion, as I understand, is to have this:
(perhaps with one more label, While it wouldn't be hard to come up with a Prometheus query to get all @simonpasquier: is there a best practice for cases like this? |
As I will not be able to work on this PR until at least next Monday (Apr 30th) and potentially until May 7th (the week after KubeCon), I just added a couple of tasks to the first comment in this PR, so that we know what else there is to be done here. @mabn, @yurishkuro, @simonpasquier: feel free to edit (if you can) or request a new item to be added to the list. |
Personally I've found it less confusing when an application doesn't expose aggregated metrics that I can get from other metrics with a bit of PromQL. If you only have:
Then getting the equivalent of |
PR updated with the changes suggested by @mabn. I applied the same logic to the "rejected" counter (but couldn't simulate this, to get it showing as metric). The Gist has been updated as well. Here's a short overview:
|
@mabn, @objectiser, @yurishkuro: is there something else you'd like to see in this PR before it gets merged? The only thing left (see issue description) is to get a proper version for |
cmd/collector/app/metrics.go
Outdated
ZipkinFormatType: newCountsBySpanType(serviceMetrics.Namespace(ZipkinFormatType, nil)), | ||
JaegerFormatType: newCountsBySpanType(serviceMetrics.Namespace(JaegerFormatType, nil)), | ||
UnknownFormatType: newCountsBySpanType(serviceMetrics.Namespace(UnknownFormatType, nil)), | ||
ZipkinFormatType: newCountsBySpanType(serviceMetrics.Namespace("", map[string]string{"format": "zipkin"})), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/"zipkin"/ZipkinFormatType?
@@ -33,7 +33,7 @@ type Table struct { | |||
// NewTable takes a metrics scope and creates a table metrics struct | |||
func NewTable(factory metrics.Factory, tableName string) *Table { | |||
t := storageMetrics.WriteMetrics{} | |||
metrics.Init(&t, factory.Namespace(tableName, nil), nil) | |||
metrics.Init(&t, factory.Namespace("", map[string]string{"table": tableName}), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about this one, this allows users to aggregate metrics across different c* tables. For attempts and inserts (total c* throughput) it makes sense but allowing users to aggregate latencies across different tables doesn't seem right (workload will be different per table). I guess giving the ability to aggregate doesn't imply people will aggregate and technically people using non-tag based metrics systems could already aggregate in the previous rendition. TLDR; I've convinced myself this is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR; I've convinced myself this is fine
Glad to hear :D
I think the metrics and tags are fine - but prefer the names without "by_svc" - as this implies the metric is per service, whereas that is just one of the tags. Maybe reuse the previous names (the ones without the service tags, e.g. |
+1. Is anyone against it? |
it's not possible to reuse the same metric name with different set of labels (only different label values) |
It was my understanding that there would only now be one use of that metric name. But now with a service tag. Or are you thinking about queries on the metric server that spans before and after this change? |
I meant that we probably already have |
(moved this comment to the other related issue: #811 (comment)) |
We had, until the commit titled "Change rejected to rejected-by-svc, remove recd metrics". This was removed based on the suggestion by @mabn and part of the task list on the PR description for a couple weeks now :) I just added another commit based on @objectiser's request, but with 'received' instead of 'recd' (as we have also 'rejected') |
If there are no objections, this is now ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the gists reflect the final state of the names?
@@ -90,20 +90,21 @@ func main() { | |||
builderOpts := new(builder.CollectorOptions).InitFromViper(v) | |||
|
|||
mBldr := new(pMetrics.Builder).InitFromViper(v) | |||
metricsFactory, err := mBldr.CreateMetricsFactory("jaeger-collector") | |||
baseFactory, err := mBldr.CreateMetricsFactory("jaeger") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing the -collector
suffix here? Isn't it better to be able to tell apart which binary emitted the metric? The storage metrics won't be grouped by collector name anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we removing the -collector suffix here?
It's just so that the jaeger
prefix is used for the storage metrics without the component's name. collector
is added later as a namespace to the collector-specific metrics.
Isn't it better to be able to tell apart which binary emitted the metric?
I think the agreement was to have per-component metrics, instead of per-binary, so that metrics generated by the collector via standalone-linux
would be consistent with the metrics generated by collector via collector-linux
:
(before)
jaeger_standalone_jaeger_collector_in_queue_latency_bucket{host="caju",le="0.005"} 0
vs
(after, across all binaries)
jaeger_collector_in_queue_latency_bucket{host="caju",le="0.005"} 0
The storage metrics won't be grouped by collector name anymore.
You mean, jaeger_cassandra_inserts{table="service_names"}
would be used by collector
, query
and standalone
? This would potentially create three different metric names for the same thing: jaeger_standalone_cassandra_inserts
, jaeger_collector_cassandra_inserts
, jaeger_query_cassandra_inserts
. I personally prefer to have one single metrics, but I do see the point of having one per component (not binary). Just not sure how we'd tell the components apart when using the standalone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If knowing the source binary is important, then it could be added as a label instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really surprised Prometheus community has no guidelines around this. Can't imagine any minimally complex system not running into the same exact issues with multiple namespaces and hierarchies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really surprised Prometheus community has no guidelines around this. Can't imagine any minimally complex system not running into the same exact issues with multiple namespaces and hierarchies.
The Prometheus scraper can add arbitrary labels, so, this kind of information would arguably be added there, along with some metadata not known to the "microservice", like the Geo, DC, Rack, and so on. Our "binary" name could also be added at scrape time, for cases where it has more value than a consistent set of metrics across all binaries.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cstatic_config%3E
How about we merge it with the current state, publish a blog post about this change + how to add arbitrary labels, and ask for community feedback? If we hear that having this as part of the metrics we emit, we add it as a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should discuss on Friday. We are accumulating a set of breaking changes for 1.5: metrics, docker commands, and Cassandra schema.
cmd/agent/app/builder.go
Outdated
@@ -117,7 +117,7 @@ func (b *Builder) getMetricsFactory() (metrics.Factory, error) { | |||
if b.metricsFactory != nil { | |||
return b.metricsFactory, nil | |||
} | |||
return b.Metrics.CreateMetricsFactory("jaeger_agent") | |||
return b.Metrics.CreateMetricsFactory("jaeger-agent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reasoning here? Maybe this should be b.Metrics.CreateMetricsFactory("jaeger").Namespace("agent",nil)
? That would be more consistent with the way metrics are constructed in the other binaries
May have jumped the gun, Yuri has good points. Will review again post
@yurishkuro any news about this one? |
Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
* master: (38 commits) Preparing release 1.5.0 (jaegertracing#847) Add bounds to memory storage (jaegertracing#845) Add metric for debug traces (jaegertracing#796) Change metrics naming scheme (jaegertracing#776) Bump gocql version (jaegertracing#829) Remove ParentSpanID from domain model (jaegertracing#831) Make gas run quiet (jaegertracing#838) Revert "Make gas run quite" Revert "Install gas from install-ci" Install gas from install-ci Make gas run quite Add 'gas' for security problems scanning (jaegertracing#830) Add ability to adjust static sampling probabilities per operation (jaegertracing#827) Support log-level flag on agent (jaegertracing#828) Remove unused function (jaegertracing#822) Add healthcheck to standalone (jaegertracing#784) Do not use KeyValue fields directly and use KeyValues as decorator only (jaegertracing#810) Add ContaAzul to the adopters list (jaegertracing#806) Add ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE (jaegertracing#805) Upgrade to go 1.10 (jaegertracing#792) ... # Conflicts: # cmd/agent/app/builder.go # cmd/collector/main.go # cmd/query/main.go # cmd/standalone/main.go
Tasks pending:
jaeger-lib
jaeger_collector_spans_recd
(see Change metrics naming scheme #776 (comment))Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de