Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change metrics naming scheme #776

Merged
merged 1 commit into from
May 24, 2018
Merged

Change metrics naming scheme #776

merged 1 commit into from
May 24, 2018

Conversation

jpkrohling
Copy link
Contributor

@jpkrohling jpkrohling commented Apr 19, 2018

Tasks pending:

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

@jpkrohling
Copy link
Contributor Author

This PR depends on jaegertracing/jaeger-lib#43

Here's how the metric names look like, before and after this change:

Metric names for Cassandra on v1.3: https://git.io/vpLiI
Metric names for Cassandra on master: https://git.io/vpLiY

Metric names for in memory on v1.3: https://git.io/vpLiU
Metric names for in memory on master: https://git.io/vpLin

One thing to clarify is whether this would break existing setups, specially expVar ones. While the "core" of the change is on the Prometheus integration, the metric names for Cassandra were CamelCase and are now under_score for all metric backends. I think this change is justified, specially because it seems that all other metrics are indeed following the underscore pattern.

I'm marking this as a WIP because jaegertracing/jaeger-lib#43 needs to be merged first.

@jpkrohling
Copy link
Contributor Author

@simonpasquier, would you be able to review the metric names? Please, see the links in my previous comment.

@jpkrohling
Copy link
Contributor Author

Travis is failing because the dependent change is not merged yet:

FAIL	github.com/jaegertracing/jaeger/pkg/metrics	0.023s

On my local machine, when I symlink jaeger-lib in vendor with my local clone, this passes:

ok  	github.com/jaegertracing/jaeger/pkg/metrics	1.019s

serviceNameIndex: casMetrics.NewTable(metricsFactory, "ServiceNameIndex"),
serviceOperationIndex: casMetrics.NewTable(metricsFactory, "ServiceOperationIndex"),
durationIndex: casMetrics.NewTable(metricsFactory, "DurationIndex"),
traces: casMetrics.NewTable(metricsFactory, "traces"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are changing these, should we also switch to tag based metrics? Rather than using the namespace the table name can be just a label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we leave this decision to a follow-up issue/PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #778 for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed on the call, I would rather make the change here because this PR is already a breaking change since it renames the metrics, so I would go further and fix them for good by putting table name as a tag instead of appending to the metric name. I commented on #778 how it can be (easily) done.

serviceOperationIndex: casMetrics.NewTable(metricsFactory, "ServiceOperationIndex"),
durationIndex: casMetrics.NewTable(metricsFactory, "DurationIndex"),
traces: casMetrics.NewTable(metricsFactory, "traces"),
tagIndex: casMetrics.NewTable(metricsFactory, "tag_ndex"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tag_ndex/tag_index

Credit goes to @burmanm :)

@mabn
Copy link

mabn commented Apr 20, 2018

There are also these which could use rename and tags:

  • jaeger-collector.spans.saved-by-svc.x
  • jaeger-collector.spans.by-svc.x
  • jaeger-collector.jaeger.traces.by-svc.x

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Apr 23, 2018

These are the metrics I have that are "by_svc":

jaeger_standalone_jaeger_collector_jaeger_spans_by_svc_customer 10
jaeger_standalone_jaeger_collector_jaeger_traces_by_svc_customer 5
jaeger_standalone_jaeger_collector_spans_saved_by_svc_customer 10
jaeger_standalone_jaeger_collector_traces_saved_by_svc_customer 5

Just so that I'm clear, you mean to convert the above into:

jaeger_standalone_jaeger_collector_jaeger_spans_by_svc{service="customer"} 10
jaeger_standalone_jaeger_collector_jaeger_traces_by_svc{service="customer"} 5
jaeger_standalone_jaeger_collector_spans_saved_by_svc{service="customer"} 10
jaeger_standalone_jaeger_collector_traces_saved_by_svc{service="customer"} 5

Gist updated to include metrics generated on demand when spans are received

@objectiser
Copy link
Contributor

@jpkrohling Can we reduce the number of times 'jaeger' is used in the metric name to 1 :) e.g.

jaeger_standalone_collector_spans_by_svc{service="customer"} 10

@mabn
Copy link

mabn commented Apr 23, 2018

@jpkrohling yes, that's what I meant.

@@ -91,7 +91,7 @@ func main() {
}

mBldr := new(pMetrics.Builder).InitFromViper(v)
metricsFactory, err := mBldr.CreateMetricsFactory("jaeger-standalone")
metricsFactory, err := mBldr.CreateMetricsFactory("standalone")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this would be better as jaeger instead of standalone - so that the metric name is prefixed by jaeger. The other jaeger that should ideally be removed is in the part jaeger_spans_by_svc and jaeger_traces_by_svc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my intention but I pushed without testing first :/

@jpkrohling
Copy link
Contributor Author

This PR has been updated to include the changes suggested by @mabn, @objectiser and #778 and is ready for review. The related PR for jaeger-lib has also been updated.

The Gist has also been updated to show the state of metrics before and after this change when deployed using Cassandra as the backing storage.

@objectiser
Copy link
Contributor

Other possible suggestions:

jaeger_standalone_cassandra... -> jaeger_storage_cassandra...

jaeger_standalone_collector - not sure I understand the reason for 'standalone' - can't it just be jaeger_collector...

jaeger_standalone_jaeger... - these seem to be all client related to possibly jaeger_client_... rather than having standalone and jaeger twice.

jaeger_standalone_agent_ - just jaeger_agent_

@jpkrohling
Copy link
Contributor Author

jaeger_standalone_collector - not sure I understand the reason for 'standalone' - can't it just be jaeger_collector...

Makes sense, especially to keep the metrics generated by the collector binary to be consistent with the ones from the standalone binary.

@jpkrohling
Copy link
Contributor Author

Updated based on the last suggestion from @objectiser. There's still a set of metrics with the name starting as jaeger_jaeger_, from the Jaeger client. As the client does not provide a way to override its metrics namespace, jaeger gets duplicated.

@yurishkuro
Copy link
Member

I would not recommend wholesale removal of 'jaeger' prefix. The general pattern we used was that all metrics for backend components start with the name of the binary, like jaeger_collector. For all-in-one we could actually drop it or make it optional, so that all-in-one will be emitting the same set of metrics as the components running independently (which makes a lot of sense, since it's just "all-in-one" deployment, not an architectural change).

jaeger_standalone_cassandra... -> jaeger_storage_cassandra...

Again, the prefix is the binary name, say jaeger_collector. We can add storage if desired, it doesn't make a lot of difference, although could be useful if the binary runs multiple storage implementations (we have that internally), and adding storage groups the storage metrics together. We certainly cannot do jaeger_collector_storage_something{storage=cassandra} because different types of storage will have different metrics, so the storage type is better as a namespace.

BTW, this is why I am not very happy with the change to jaeger-lib, having a separator that is distinct from the words separator is very useful and makes the namespacing obvious, like jaeger_collector.storage.cassandra.something (m3db allows dots as separators).

@yurishkuro
Copy link
Member

PS metrics emitted by the clients always start with jaeger., I would prefer not to mix that namespace with metrics from the binaries.

@jpkrohling
Copy link
Contributor Author

jpkrohling commented Apr 24, 2018

I think the latest Gist is already very close to what @yurishkuro is proposing. Here's an overview of the prefixes we have on that Gist. Note that there are four files in that Gist, three with the output from this PR and one with the base from v1.3.

  • jaeger_agent_
  • jaeger_cassandra
  • jaeger_client_jaeger_
  • jaeger_collector_

I just checked how the metrics are reported when starting the individual components and there were a few mismatches, fixed in the latest commit.

About jaeger_collector_cassandra vs. jaeger_cassandra, I think it makes more sense to not have collector in the name, as those metrics are collected at the plugin layer, which might be consumed at the same time by more than one component (collector/query). We probably want metrics at that layer to be joined, instead of having, say, jaeger_collector_cassandra_inserts and jaeger_query_cassandra_inserts

@mabn
Copy link

mabn commented Apr 24, 2018

One more thing about the duplication @objectiser mentioned. There are two kinds "received" and "saved". For received right now there are:

jaeger_collector_jaeger_spans_recd 19
jaeger_collector_jaeger_spans_rejected 0
jaeger_collector_jaeger_spans_by_svc{service="customer"}
jaeger_collector_jaeger_debug_spans_by_svc{service="customer"}
jaeger_collector_jaeger_traces_by_svc{service="query"}

The second "jaeger" is the format name. So merge them into just three metrics:

jaeger_collector_spans_recd{format="jaeger", service="customer", debug=false}
jaeger_collector_spans_rejected{format="jaeger", service="customer", debug=false}
jaeger_collector_traces_recd{ format="jaeger", service="customer"} # maybe debug tag too?

I guess it's a similar story with the corresponding "saved-by-svc" metrics.

@jpkrohling
Copy link
Contributor Author

@mabn: I agree about the format, but not sure about the other suggestion. Here's what we would have by applying only the format suggestion:

jaeger_collector_spans_recd{format="jaeger"} 36
jaeger_collector_spans_by_svc{format="jaeger",service="customer"} 18
jaeger_collector_spans_by_svc{format="jaeger",service="preference"} 18

Your suggestion, as I understand, is to have this:

jaeger_collector_spans_recd{format="jaeger",service="customer"} 18
jaeger_collector_spans_recd{format="jaeger",service="preference"} 18

(perhaps with one more label, debug).

While it wouldn't be hard to come up with a Prometheus query to get all jaeger_collector_spans_recd{format="jaeger"}, it would probably be more convenient to leave what we have today.

@simonpasquier: is there a best practice for cases like this?

@coveralls
Copy link

coveralls commented Apr 25, 2018

Coverage Status

Coverage remained the same at 100.0% when pulling a116376 on jpkrohling:732-MetricNames into 968f9fa on jaegertracing:master.

@jpkrohling
Copy link
Contributor Author

As I will not be able to work on this PR until at least next Monday (Apr 30th) and potentially until May 7th (the week after KubeCon), I just added a couple of tasks to the first comment in this PR, so that we know what else there is to be done here.

@mabn, @yurishkuro, @simonpasquier: feel free to edit (if you can) or request a new item to be added to the list.

@simonpasquier
Copy link

Personally I've found it less confusing when an application doesn't expose aggregated metrics that I can get from other metrics with a bit of PromQL.

If you only have:

jaeger_collector_spans_by_svc{format="jaeger",service="customer"} 18
jaeger_collector_spans_by_svc{format="jaeger",service="preference"} 18

Then getting the equivalent of jaeger_collector_spans_recd{format="jaeger"} is just sum(jaeger_collector_spans_by_svc{format="jaeger"}).

@jpkrohling
Copy link
Contributor Author

PR updated with the changes suggested by @mabn. I applied the same logic to the "rejected" counter (but couldn't simulate this, to get it showing as metric).

The Gist has been updated as well. Here's a short overview:

jaeger_collector_spans_by_svc{format="jaeger",service="customer"} 6
jaeger_collector_spans_by_svc{format="jaeger",service="preference"} 6
jaeger_collector_spans_saved_by_svc{service="customer"} 6
jaeger_collector_spans_saved_by_svc{service="preference"} 6
jaeger_collector_traces_by_svc{format="jaeger",service="customer"} 3
jaeger_collector_traces_saved_by_svc{service="customer"} 3

@jpkrohling
Copy link
Contributor Author

@mabn, @objectiser, @yurishkuro: is there something else you'd like to see in this PR before it gets merged? The only thing left (see issue description) is to get a proper version for jaeger-lib.

ZipkinFormatType: newCountsBySpanType(serviceMetrics.Namespace(ZipkinFormatType, nil)),
JaegerFormatType: newCountsBySpanType(serviceMetrics.Namespace(JaegerFormatType, nil)),
UnknownFormatType: newCountsBySpanType(serviceMetrics.Namespace(UnknownFormatType, nil)),
ZipkinFormatType: newCountsBySpanType(serviceMetrics.Namespace("", map[string]string{"format": "zipkin"})),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/"zipkin"/ZipkinFormatType?

@@ -33,7 +33,7 @@ type Table struct {
// NewTable takes a metrics scope and creates a table metrics struct
func NewTable(factory metrics.Factory, tableName string) *Table {
t := storageMetrics.WriteMetrics{}
metrics.Init(&t, factory.Namespace(tableName, nil), nil)
metrics.Init(&t, factory.Namespace("", map[string]string{"table": tableName}), nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this one, this allows users to aggregate metrics across different c* tables. For attempts and inserts (total c* throughput) it makes sense but allowing users to aggregate latencies across different tables doesn't seem right (workload will be different per table). I guess giving the ability to aggregate doesn't imply people will aggregate and technically people using non-tag based metrics systems could already aggregate in the previous rendition. TLDR; I've convinced myself this is fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; I've convinced myself this is fine

Glad to hear :D

black-adder
black-adder previously approved these changes May 8, 2018
@objectiser
Copy link
Contributor

I think the metrics and tags are fine - but prefer the names without "by_svc" - as this implies the metric is per service, whereas that is just one of the tags. Maybe reuse the previous names (the ones without the service tags, e.g. jaeger_collector_spans_recd), but now with the tags?

@jpkrohling
Copy link
Contributor Author

Maybe reuse the previous names (the ones without the service tags, e.g. jaeger_collector_spans_recd), but now with the tags?

+1. Is anyone against it?

@yurishkuro
Copy link
Member

it's not possible to reuse the same metric name with different set of labels (only different label values)

@objectiser
Copy link
Contributor

It was my understanding that there would only now be one use of that metric name. But now with a service tag.

Or are you thinking about queries on the metric server that spans before and after this change?

@yurishkuro
Copy link
Member

I meant that we probably already have jaeger_collector_spans_recd as a cumulative metric, which has no labels.

@benley
Copy link
Contributor

benley commented May 8, 2018

(moved this comment to the other related issue: #811 (comment))

@jpkrohling
Copy link
Contributor Author

I meant that we probably already have jaeger_collector_spans_recd as a cumulative metric, which has no labels.

We had, until the commit titled "Change rejected to rejected-by-svc, remove recd metrics". This was removed based on the suggestion by @mabn and part of the task list on the PR description for a couple weeks now :)

I just added another commit based on @objectiser's request, but with 'received' instead of 'recd' (as we have also 'rejected')

@jpkrohling jpkrohling changed the title WIP - Change metrics naming scheme Change metrics naming scheme May 11, 2018
@jpkrohling
Copy link
Contributor Author

jpkrohling commented May 11, 2018

If there are no objections, this is now ready to be merged. The only thing missing is to change the dependency to use a proper version for jaeger-lib, but it can be done in a follow-up commit. (jaeger-lib has been released, this PR has been updated)

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the gists reflect the final state of the names?

@@ -90,20 +90,21 @@ func main() {
builderOpts := new(builder.CollectorOptions).InitFromViper(v)

mBldr := new(pMetrics.Builder).InitFromViper(v)
metricsFactory, err := mBldr.CreateMetricsFactory("jaeger-collector")
baseFactory, err := mBldr.CreateMetricsFactory("jaeger")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the -collector suffix here? Isn't it better to be able to tell apart which binary emitted the metric? The storage metrics won't be grouped by collector name anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the -collector suffix here?

It's just so that the jaeger prefix is used for the storage metrics without the component's name. collector is added later as a namespace to the collector-specific metrics.

Isn't it better to be able to tell apart which binary emitted the metric?

I think the agreement was to have per-component metrics, instead of per-binary, so that metrics generated by the collector via standalone-linux would be consistent with the metrics generated by collector via collector-linux:

(before)
jaeger_standalone_jaeger_collector_in_queue_latency_bucket{host="caju",le="0.005"} 0

vs

(after, across all binaries)
jaeger_collector_in_queue_latency_bucket{host="caju",le="0.005"} 0

The storage metrics won't be grouped by collector name anymore.

You mean, jaeger_cassandra_inserts{table="service_names"} would be used by collector, query and standalone? This would potentially create three different metric names for the same thing: jaeger_standalone_cassandra_inserts, jaeger_collector_cassandra_inserts, jaeger_query_cassandra_inserts. I personally prefer to have one single metrics, but I do see the point of having one per component (not binary). Just not sure how we'd tell the components apart when using the standalone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If knowing the source binary is important, then it could be added as a label instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really surprised Prometheus community has no guidelines around this. Can't imagine any minimally complex system not running into the same exact issues with multiple namespaces and hierarchies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really surprised Prometheus community has no guidelines around this. Can't imagine any minimally complex system not running into the same exact issues with multiple namespaces and hierarchies.

The Prometheus scraper can add arbitrary labels, so, this kind of information would arguably be added there, along with some metadata not known to the "microservice", like the Geo, DC, Rack, and so on. Our "binary" name could also be added at scrape time, for cases where it has more value than a consistent set of metrics across all binaries.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cstatic_config%3E

How about we merge it with the current state, publish a blog post about this change + how to add arbitrary labels, and ask for community feedback? If we hear that having this as part of the metrics we emit, we add it as a label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should discuss on Friday. We are accumulating a set of breaking changes for 1.5: metrics, docker commands, and Cassandra schema.

@@ -117,7 +117,7 @@ func (b *Builder) getMetricsFactory() (metrics.Factory, error) {
if b.metricsFactory != nil {
return b.metricsFactory, nil
}
return b.Metrics.CreateMetricsFactory("jaeger_agent")
return b.Metrics.CreateMetricsFactory("jaeger-agent")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reasoning here? Maybe this should be b.Metrics.CreateMetricsFactory("jaeger").Namespace("agent",nil)? That would be more consistent with the way metrics are constructed in the other binaries

@black-adder black-adder dismissed their stale review May 11, 2018 18:06

May have jumped the gun, Yuri has good points. Will review again post

@jpkrohling
Copy link
Contributor Author

@yurishkuro any news about this one?

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@yurishkuro yurishkuro merged commit 70dd38f into jaegertracing:master May 24, 2018
@ghost ghost removed the review label May 24, 2018
mabn pushed a commit to mabn/jaeger that referenced this pull request May 28, 2018
* master: (38 commits)
  Preparing release 1.5.0 (jaegertracing#847)
  Add bounds to memory storage (jaegertracing#845)
  Add metric for debug traces (jaegertracing#796)
  Change metrics naming scheme (jaegertracing#776)
  Bump gocql version (jaegertracing#829)
  Remove ParentSpanID from domain model (jaegertracing#831)
  Make gas run quiet (jaegertracing#838)
  Revert "Make gas run quite"
  Revert "Install gas from install-ci"
  Install gas from install-ci
  Make gas run quite
  Add 'gas' for security problems scanning (jaegertracing#830)
  Add ability to adjust static sampling probabilities per operation (jaegertracing#827)
  Support log-level flag on agent (jaegertracing#828)
  Remove unused function (jaegertracing#822)
  Add healthcheck to standalone (jaegertracing#784)
  Do not use KeyValue fields directly and use KeyValues as decorator only (jaegertracing#810)
  Add ContaAzul to the adopters list (jaegertracing#806)
  Add ISSUE_TEMPLATE and PULL_REQUEST_TEMPLATE (jaegertracing#805)
  Upgrade to  go 1.10 (jaegertracing#792)
  ...

# Conflicts:
#	cmd/agent/app/builder.go
#	cmd/collector/main.go
#	cmd/query/main.go
#	cmd/standalone/main.go
@jpkrohling jpkrohling deleted the 732-MetricNames branch July 28, 2021 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants