Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics are redundant and slow #4644

Open
renbou opened this issue Mar 18, 2022 · 42 comments
Open

Prometheus metrics are redundant and slow #4644

renbou opened this issue Mar 18, 2022 · 42 comments
Labels
bug 🐞 Something isn't working discussion 💬 The right solution needs to be found help wanted 🆘 Extra attention is needed optimization 📉 Performance or cost improvements
Milestone

Comments

@renbou
Copy link
Collaborator

renbou commented Mar 18, 2022

While digging into Caddy's source code I've noticed that every (!) route handler is wrapped in a metricsInstrumentedHandler which updates Prometheus metrics during request execution. While it is a great feature and should definitely be enabled by default in Caddy, it currently uses up way to much CPU time and the metrics provided are quite redundant.

Since Caddy tries out every route handler in order until it gets an answer to the request, metric instrumentation is called for each handler, even if it didn't actually partake (which is also quite difficult to define in this context) in resolving the request, so handler-specific metrics are constantly updated with unrelated data and as a result pretty much all of the handler-specific metrics are meaningless, making them only usable to track server-wide stats.

As an example, here are the metrics for a simple Caddy server with 2 hosts, 1 of which only has a reverse_proxy handler, and the other has 2 respond handlers, 1 reverse_proxy handler and 1 file_server handler. The metrics were taken after running an http load-testing tool on the endpoints.
prometheus-metrics.txt
As seen in the example, all of the handler-specific metrics are pretty much the same for all handlers, even though in reality only the respond handler was requested.

The handler metrics provide even less use if the web server hosts multiple domains, since requests from all domains get mixed up in the metrics.
However, questionable metrics wouldn't be much of an issue if they were as fast as providing server-wide metrics, but they, of course, aren't, since they are getting updated multiple times until the request is finally answered.

I've ran pprof while putting load using h2load to request one of the simple respond handlers, and it turned out that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP). Here's the profile:
profile.zip
image

I really think that metrics such as these should be server-wide where they make more sense and are quicker. https://github.com/nginxinc/nginx-prometheus-exporter could be seen as an example of similar Prometheus metrics commonly used for nginx.

@renbou
Copy link
Collaborator Author

renbou commented Mar 18, 2022

I've looked through the other issues regarding Prometheus metrics and if host-based metrics might some day be implemented, they shouldn't affect performance that much. As for adding support for customizable metrics - seems like they would be an even bigger problem performance-wise

@mholt mholt added this to the 2.x milestone Mar 18, 2022
@mholt mholt added help wanted 🆘 Extra attention is needed optimization 📉 Performance or cost improvements discussion 💬 The right solution needs to be found labels Mar 18, 2022
@hairyhenderson
Copy link
Collaborator

Thanks for filing this @renbou - I'll try to find some time this coming week to dig into it. My gut feeling is that some of the issue is due to the way middleware is handled and nested by Caddy, so it might be a bit more tricky to fully fix.

On the surface this is alarming:

[...] that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP).

However percentages can sometimes be misleading, so it would be good to get some actual numbers... I haven't looked at the profile yet (and I won't this weekend - recovering from jet lag).

Caddy's handling of static responses in particular is very fast, and so it's not actually that surprising that the instrumentation surrounding it would account for more CPU time. As with all performance tuning, the most useful question isn't "how fast can we make it go?" but instead "how fast does it need to be?" In SRE terms that might translate to "we have a latency SLO of 100ms 99% of the time, but we can't reach that reliably if the metrics instrumentation takes 80ms"

@renbou
Copy link
Collaborator Author

renbou commented Mar 25, 2022

I might've solved how this and the other Prometheus issues/requests could be implemented/solved. As discussed previously in #4016, a new metrics component would need to be added to the server structure of the http app, where the wanted labels can be described somewhat like this in the Caddyfile:

metrics {
	label host {http.request.host}
	label content_type {http.response.header.Content-Type}
	label static_label "same value for everyone!"
}

The problem which surfaced previously is that we'd metrics would supposedly have to be lost on reloads. To solve this, we can globally store the currently existing prometheus.Collectors without labels (counters, GoCollector, etc) and prometheus.ObserverVecs (histograms), which are the entities performing actual metric collection.

When provisioning a server, we'll need to setup a new prometheus.Registry with the required metrics, which are:

  1. Global metrics which have not been removed
  2. New metrics
    as well as add the new metrics to the global ones (for next reloads). This way we aren't losing any metrics which haven't been changed, and the changed metrics will get lost during if we change labels anyway, since there isn't a clear way to migrate them.

When removing the old Metrics component during a reload, all we need to do is remove the "deprecated" metrics from the global store. To do so we can store a view on the metrics which were added to the current server's prometheus.Registry, which would then allow us to find the deprecated metrics by "xoring" the old server's metrics and the new global metrics.

To solve another set of issues including performance, needing information from the matchers (for example the host-matcher), we can simply leverage the already present replacer. Since, as mentioned in the beginning, labels should be configurable with placeholders, we could simply add the necessary values to the request replacer during any time of the request. For example, a host matcher could set the value for "http.server.host", the handler instrumentation (which is what currently affects performance by running the same metric instrumentation code for each route) could set the value for "http.server.handler", etc. All of the keys accumulated during the request could then be used during metric collection, which would happen one time and be one of the first handlers in the middleware chain (thus giving it access to ALL of the gathered placeholders), by simply getting them from the replacer.

The way of gathering metrics I've described here will also leave us with actually meaningful metrics about handlers (since the http.server.handler key will be set only once - after the handler which actually gave a response returns) and everything else, since they are gathered once per request instead of multiple times like they are now (which is why currently metrics are pretty useless).

@francislavoie
Copy link
Member

francislavoie commented Mar 25, 2022

Worth noting that metrics is already a handler directive to serve the actual metrics exposition endpoint (see https://caddyserver.com/docs/caddyfile/directives/metrics), so unfortunately I think we'll need a different config for it. We could have a labels directive maybe:

labels [<field> <value>] {
	<field> <value>
}

I'm thinking we might want to support somekind of setup where you could say when to evaluate the labels, i.e. if you want to add a label at the start of the request vs at the end (after all the next middlewares ran). Could be done with > and < prefixes on the field, so like >host and <content-type or something to that effect (as in, > for "on the way in" and < for "on the way out"). But idk if this makes sense?

How would your described changes behave with existing metrics collected? My understanding was that the histograms etc may need to be reset on config reload or something if their config changed? I'm not sure I totally follow. @hairyhenderson should definitely chime in here.

@renbou
Copy link
Collaborator Author

renbou commented Mar 25, 2022

Currently metrics is a handler field in the JSON structure and the Caddyfile, however I'm talking about adding another metrics to the http.server in thr JSON structure (would be global in Caddyfile). I'm not sure if this will be ok with how the Caddyfile is parsed though.

As for when the metrics are collected what I meant was that the metrics handler would now be initialized only once as the first middleware in the chain, meaning that it'd start collecting metrics just before the request is handled and finish collecting them after, as well as use the replacer afterwards since it's filled with all of the needed placeholders at that point.

Existing metrics are reset only if their labels change, pretty much. If the labels change, then their isn't a proper way to "migrate" metrics with the old label space to the new one, because they act like totally different metrics. For example, when adding a host label, it wouldn't make sense to "continue using the old metric" and just append a new label to it - the results wouldn't make sense (also this is what can't be done using the prometheus library).

Actually, since currently Caddy's metrics to which we'd be adding the possibilities of label modification (pretty much request-specific histogram metrics) all depend on the same set of labels, we'd only need to keep a version of the labels which were used to provision the current server and the current labels as a global value, instead of having to upkeep multiple lists (Collectors, Observers...)

@francislavoie
Copy link
Member

http.servers JSON options would go in the Caddyfile's servers global options: https://caddyserver.com/docs/caddyfile/options#server-options

I'm still not sure I totally follow, but that's probably just my fault for not having looked into it enough to wrap my head around it. But it sounds like more of a plan than we've had before, so that's exciting!

@renbou
Copy link
Collaborator Author

renbou commented Mar 25, 2022

http.servers JSON options would go in the Caddyfile's servers global options: https://caddyserver.com/docs/caddyfile/options#server-options

Hmm, not all of them do - take http_port and https_port for example. Metrics can also be made global in the Caddyfile because it'd make more sense, I think.

This shouldn't be that hard to implement, but I'd like to contribute a few optimization-related changes since they should overall clean up the request-handling code a bit. But basically the way we won't be losing metrics is by using a new prometheus.Registry for each server (that actually collects the metrics and bundles them up for the http endpoint) but reuse the actual metrics between server runs (if the labels for those haven't changed, meaning that critical metrics would never be lost because they don't depend on that set of labels)

@francislavoie
Copy link
Member

Hmm, not all of them do - take http_port and https_port for example

Those two options are on the http app, not on http.servers.

Metrics can also be made global in the Caddyfile because it'd make more sense, I think.

Well, it depends whether it's config per server or config on the http app, and it sounds like you're suggesting per-server. Wouldn't it be possible/make sense to have different label config per server?

@francislavoie
Copy link
Member

Do you have an idea how #4140 could fit in maybe?

@renbou
Copy link
Collaborator Author

renbou commented Mar 28, 2022

You're right, those options are on the http app. And the metrics config should definitely be a per-server thing (http.servers in the json), but in a Caddyfile we have no way of specifying multiple servers, which is why it should probably be a part of the global options.

As for the reverse proxy upstreams - I really do not see the point in adding latency and status code metrics for upstreams, since that is already available through the present metrics. A health status metric can surely be added though, and per-path configuration should be possible with what I've described here: we could add a "metrics" structure to the routes (http.servers.routes[i]) which could be a metric inclusion/exclusion list for the route.

Considering the fact that a lot of metrics might be added over the time, we can also make the server-wide metrics config look like this:

metrics {
  labels {
    host {http.server.host}
  }
  include {
    reverse_proxy {
      upstream {http.handlers.reverse_proxy.upstream}
    }
  }
  exclude {
    request_size_bytes
  }
}

which would allow enabling/disabling various metrics globally. Though we'd definitely have to think this design through a bit more, lol, since this doesn't look very nice.

@francislavoie
Copy link
Member

but in a Caddyfile we have no way of specifying multiple servers, which is why it should probably be a part of the global options

We do though; take a look again at how the servers global option works, it maps directly to the servers in the JSON, by applying to the listeners of each server as the pivot https://caddyserver.com/docs/caddyfile/options#server-options

In case it wasn't clear, I was talking about whether metrics should go inside the servers global option, or whether it should be its own global option (beside servers). I think inside servers is probably best because you get some nice properties from that for free (like servers without any argument applying to all servers regardless of port being used)

Considering the fact that a lot of metrics might be added over the time, we can also make the server-wide metrics config look like this:

Hmm, interesting. Yeah, we'll probably need a proof-of-concept implementation to see how that feels and whether it actually works 👍

Thanks for looking into this!

@renbou
Copy link
Collaborator Author

renbou commented Mar 28, 2022

Ah, okay. I wonder why some server options, such as atomatic https, are added as global ones in the Caddyfile then...

@francislavoie
Copy link
Member

Because servers was added later, and some of those don't really make sense to configure per server (we couldn't think of a valid usecase where you'd want to configure different auto-https options per server).

@mholt
Copy link
Member

mholt commented Mar 29, 2022

@renbou is on to something.

I was just sent a goroutine dump from a customer, and noticed that the metrics middleware is wrapping the response for every middleware that is chained in:

#	0x6a1875	net/http.(*http2serverConn).writeHeaders+0x175								GOROOT/src/net/http/h2_bundle.go:5883
#	0x6a2abb	net/http.(*http2responseWriterState).writeChunk+0x77b							GOROOT/src/net/http/h2_bundle.go:6149
#	0x6a209d	net/http.http2chunkWriter.Write+0x1d									GOROOT/src/net/http/h2_bundle.go:6063
#	0x4b7601	bufio.(*Writer).Flush+0x61										GOROOT/src/bufio/bufio.go:607
#	0x6a2f2e	net/http.(*http2responseWriter).Flush+0x2e								GOROOT/src/net/http/h2_bundle.go:6262
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0xf7ba94	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*ResponseWriterWrapper).Flush+0x34			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/responsewriter.go:51
#	0x12d5a3b	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*maxLatencyWriter).Write+0x1db		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:226
#	0x12d56fe	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.Handler.copyBuffer+0x31e			external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:187
#	0x12d51fa	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.Handler.copyResponse+0x2fa		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/streaming.go:163
#	0x12d01fb	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).reverseProxy+0xfbb		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/reverseproxy.go:723
#	0x12ce494	github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.(*Handler).ServeHTTP+0xe94		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/reverseproxy/reverseproxy.go:464
#	0xf7725e	github.com/caddyserver/caddy/v2/modules/caddyhttp.(*metricsInstrumentedHandler).ServeHTTP+0x51e		external/com_github_caddyserver_caddy_v2/modules/caddyhttp/metrics.go:133
#	0xf7d09a	github.com/caddyserver/caddy/v2/modules/caddyhttp.wrapMiddleware.func1.1+0x3a				external/com_github_caddyserver_caddy_v2/modules/caddyhttp/routes.go:272
#	0xf6d82e	github.com/caddyserver/caddy/v2/modules/caddyhttp.HandlerFunc.ServeHTTP+0x2e				external/com_github_caddyserver_caddy_v2/modules/caddyhttp/caddyhttp.go:57

The offending code is routes.go, line 260:

func wrapMiddleware(ctx caddy.Context, mh MiddlewareHandler) Middleware {
	// wrap the middleware with metrics instrumentation
	metricsHandler := newMetricsInstrumentedHandler(caddy.GetModuleName(mh), mh)

This will result in excessive allocations and a relatively serious performance penalty. @hairyhenderson I know you're busy, but do you think we can escalate this? We will need to fix this or remove this before the 2.5 release to ensure stability and reliability.

(It's my bad for not catching this in code review, but it was a large change, and I didn't realize the extent at which it was creating new response recorders.)

@mholt mholt added the bug 🐞 Something isn't working label Mar 29, 2022
@renbou
Copy link
Collaborator Author

renbou commented Mar 29, 2022

@mholt I've described a way to fix this as well as add various requested features related to the metrics in the discussion above, and began thinking of the way to actually implement this. I was planning to contribute a few other optimization-related PR's first, but if 2.5 is coming up soon (I haven't really seen a deadline), I can start working on this instead (probably closer to the weekend though)

@mholt
Copy link
Member

mholt commented Mar 29, 2022

I am a little behind on things. Sorry I haven't gotten around to reading your latest posts in detail yet!

I suspect that moving the newMetricsInstrumentedHandler out of the middleware-chaining loop would be a good solution in terms of fixing the redundancy & performance problems. If that breaks the desired metrics functionality, then the initialization of metrics instrumentation may need to be split into two parts, so that the response is only "recorded" (yes I know it's not buffering the response, but still) once.

We don't have a firm deadline for 2.5, but soon-ish would be nice. :) Was kind of my hope to get it out at the beginning of April.

@renbou
Copy link
Collaborator Author

renbou commented Mar 29, 2022

The proposed solution should instrument the handler chain with the metrics middleware only once, globally, and the handler-specific information is going to be preserved by wrapping the handlers with a middleware which simply sets info about the handler (i.e. once the first time a handler returns its wrapper would set values in the request context signifying the name of the handler)

@mholt mholt modified the milestones: 2.x, v2.5.1 Apr 25, 2022
@mholt
Copy link
Member

mholt commented May 6, 2022

This still needs to be fixed soon, but I don't think it will be done by the 2.5.1 release; bumping back to 2.5.2.

@mholt mholt modified the milestones: v2.5.1, v2.5.2 May 6, 2022
@francislavoie francislavoie modified the milestones: v2.5.2, v2.6.0 Jun 29, 2022
@mholt
Copy link
Member

mholt commented Sep 15, 2022

@renbou Is this still something you'd be interested in working on?

@hairyhenderson Any chance we could get your help on this one? Otherwise the fix may be less than optimal for Prometheus/Grafana users.

@mholt
Copy link
Member

mholt commented Sep 15, 2022

Here's a simple comparison of serving a static file with and then without metrics:

image

(config is literally just :1234 \n file_server and index2.html is the Caddy website homepage)

Without metrics as they are currently implemented, Caddy can be about 12-15% faster. (For this test, I simply disabled the wrapping of each handler with the metrics handler.)

mholt added a commit that referenced this issue Sep 16, 2022
mholt added a commit that referenced this issue Sep 16, 2022
* caddyhttp: Make metrics opt-in

Related to #4644

* Make configurable in Caddyfile
@ReillyBrogan
Copy link

FWIW we serve metrics out of a different port than we use for proxying/file server, so we hope that that usecase can still be supported with however this is implemented in the future. IE, if you have to use the metrics directive to opt-in to metrics collection it should be possible to not require a path as well so that those metrics can be served from a different matcher with for example metrics /metrics

@francislavoie
Copy link
Member

francislavoie commented Oct 4, 2022

FWIW we serve metrics out of a different port than we use for proxying/file server, so we hope that that usecase can still be supported with however this is implemented in the future.

It definitely will, we have no plans to change that 👍

The metrics directive is just about serving a metrics endpoint. If we introduce a directive for enabling metrics, it would probably be a different one (though the naming is unfortunate).

@asychev
Copy link

asychev commented Jan 18, 2023

Any updates on that or maybe some ETA?

@francislavoie
Copy link
Member

We haven't heard from @renbou in quite some time. I feel like the last we heard of them roughly coincides with the war in Ukraine (their profile mentions Saint-Petersburg, Russia) around March/April 2022. 🤔

@francislavoie
Copy link
Member

@renbou any chance you'll continue the work on this?

@renbou
Copy link
Collaborator Author

renbou commented May 4, 2023

Hi, guys! I'm really sorry about my awful response time and the "abandoned" state I've left this issue with. Turns out that working a full-time job and studying for a bachelor's doesn't leave you with much free time (lol). I'd be really glad to take this issue as well as some of the other optimizations I've implemented in a different PR and refactor/complete them.

I think we need to discuss, however, what is actually expected to be done in regards to the metrics. The last time I've tried working on this, it wasn't really clear how we want metrics to work and look in the config. Also, I've tried to bite off more than I could chew in a single sprint of changes (implement reloadable metrics, configurable metrics, etc), so I think we need to decide what should be prioritized right now.

The main issue is the performance, right? I'll check out the changes which have occurred in the codebase and see what can be optimized. Then we can probably move on to more feature-related metrics tasks which need doing

@mholt
Copy link
Member

mholt commented May 4, 2023

Hi @renbou!!! Great to hear from you!

Hope your work and school is going well. I definitely know how that goes. What are you studying?

Would you like an invite to our developers Slack? That might make some back-and-forth discussion easier 👍 Edit: You're already in our Slack 😅 Want to pop in and say hi? We can discuss things there if you prefer, or just here. Either way.

@Tmpod
Copy link

Tmpod commented Jul 28, 2023

Hello!
Has there been any progress on this matter? Since you mentioned continuing the discussion over on a Slack channel, I'm afraid stuff hasn't been shared it yet.

@asychev
Copy link

asychev commented Feb 6, 2024

Is this completely abandoned?

@francislavoie
Copy link
Member

@asychev The core Caddy team doesn't have any experience with metrics, so we're not equipped to make any progress. We need help.

@asychev
Copy link

asychev commented Feb 6, 2024

@francislavoie Sad to hear, but clear. Based on conversation above it is not about the metrics really, but middlewares handling architecture, isn't it?

@francislavoie
Copy link
Member

Per-request are disabled by default now as a mitigation for the majority of users, who don't need metrics.

But yes, the way we're doing per-request metrics currently is probably inefficient, but we need help to resolve it.

@SuperQ
Copy link

SuperQ commented Jul 31, 2024

Hey all, Prometheus maintainer here. I'm not sure where this has gone wrong, but a typical metric instrumentation with Prometheus client_golang should have only nanoseconds and almost no allocs or impact to the request path. This used to be perfectly fine in Caddy, so I don't know where things have gone wrong recently.

I'm not sure what "per-request" metrics means, all requests should be passed to Observe().. There should be zero impact in having metrics enabled by default.

@francislavoie
Copy link
Member

francislavoie commented Jul 31, 2024

@SuperQ any help would be greatly appreciated.

For context, Caddy's HTTP handling is a middleware chain. When metrics were implemented, it was done by wrapping every handler in the routes with a function that adds metrics instrumentation. You can see that here (inside wrapMiddleware which calls newMetricsInstrumentedHandler per handler):

func (r *Route) ProvisionHandlers(ctx caddy.Context, metrics *Metrics) error {

Since v2.6.0 #5042 we made this wrapping opt-in so that it only happens if the users need metrics (cause not everyone actually scrapes metrics from Caddy, so why waste time doing that if not necessary?)

There's a lot of missing functionality in the metrics for Caddy. We have demand for them to be more configurable, include the hostname somehow to separate metrics for different domains, etc. But I and the rest of the core Caddy team don't use metrics ourselves and don't have any experience with this stuff, so we're not equipped to implement improvements.

The problems with metrics in general, as is my understanding:

  • The way metrics are configured is at odds with how Caddy config works. Caddy allows in-memory config changes (zero-downtime, graceful config reloads or changes). Since metrics are stored "globally" (honestly, not sure how it's stored in memory), how can we change config while retaining metrics stats through config changes? Right now metrics data is not lost, but config reloads might cause metrics to need to be reset? I don't know. But that doesn't sound ideal, users would expect metrics (counters, histograms etc) to persist, especially if they make frequent config reloads.
  • Cardinality is a problem. Caddy can support infinite domains using On-Demand TLS or wildcard certs, so how can we have per-domain metrics without having runaway cardinality? We could make domain tracking optional, but if we do that that means having to reconfigure metrics based on the Caddy config, so if the config changes, how do we handle that (see first point).

See some of the other open issues for more detail I guess. There's some particularly relevant discussion in #4016. I've had a lot of informal conversations with @hairyhenderson about this on Slack (he implemented the initial version of metrics for us years ago but he no longer has time to maintain it) over the years, but we've never managed to figure out a path forwards.

@SuperQ
Copy link

SuperQ commented Jul 31, 2024

The way metrics are configured is at odds with how Caddy config works. Caddy allows in-memory config changes (zero-downtime, graceful config reloads or changes). Since metrics are stored "globally" (honestly, not sure how it's stored in memory), how can we change config while retaining metrics stats through config changes? Right now metrics data is not lost, but config reloads might cause metrics to need to be reset? I don't know. But that doesn't sound ideal, users would expect metrics (counters, histograms etc) to persist, especially if they make frequent config reloads.

Metrics are stored in a Registry. There is a default registry that is used which is a process global store if you do not explicitly create your own Registry.

Persistence is not something Caddy has to worry too much about. This is handled by Prometheus automatically. When Caddy is reloaded, you can choose a couple of options.

  • Reset the metrics registry and re-init. This is the trivial / easy solution. Unless you're reloading every few seconds, it shouldn't be a problem.
  • Use the Delete() function when a config reload removes something (like a host).

Resetting a counter/histogram to zero at reload shouldn't be necessary.

Cardinality is a problem. Caddy can support infinite domains using On-Demand TLS or wildcard certs, so how can we have per-domain metrics without having runaway cardinality? We could make domain tracking optional, but if we do that that means having to reconfigure metrics based on the Caddy config, so if the config changes, how do we handle that (see first point).

Cardinality is not a problem, it's simply a thing you have to capacity plan for. A typical Prometheus server can handle tens of millions of metrics. Per-domain metrics may be important depending on the deployment. The linked nginx-vts-exporter defaults to per-virtual-server labels. I suspect many users want to have at least per-server (domain and aliases) and maybe per-domain metrics by default. However, having some metrics or labels be configurable is a reasonable feature request.

Creating your own metrics registry var with prometheus.NewRegistry() and passing it through to modules may be the best option for dynamic reloading.

You may want to hop on the Prometheus dev community. The CNCF slack #prometheus-dev is the most active. There are folks around who know the details of client_golang better than I do. I'm happy to help here as both a Caddy user/fan and a Prometheus maintainer.

@francislavoie
Copy link
Member

Some users may want to reload every few seconds, actually. Some use the config API and push config changes very rapidly. Caddy is able to handle that pretty well in many cases.

Caddy is able to support dynamic hostnames (e.g. On-Demand TLS, wildcards) so a Delete() solution doesn't seem like a viable idea.

I have no plans to work on this myself. I'm just a volunteer to the project, and I don't use metrics myself. I have no time to take this on. What we're looking for is someone with metrics experience to help us maintain this. I'm happy to facilitate in any way I can, but I can't reasonably spend time implementing any kind of solution.

@SuperQ
Copy link

SuperQ commented Jul 31, 2024

If you want to be that dynamic, the Delete() solution is exactly what you want, rather than a registry reset.

The Delete allows you to remove stale dynamic records without resetting the whole registry. This leave all of the existing entries un-touched. Basically if the API call/reload removes a record from serving, it can trigger a delete of the metrics.

Remember that metrics are a registry, based on the "metric family". The metric family is the top-level name like caddy_http_request_duration_seconds. The delete would be for a single label like Delete(prometheus.Labels{"server": "some-server-name"}).

@francislavoie
Copy link
Member

What I'm saying is that in the On-Demand case, we have no way to know if something is stale. There's no reference to the domain in the config, because TLS-SNI can contain anything, and we may issue a cert for that domain right away. We don't know if a domain will no longer be used, we just have to rely on letting certs expire normally if they stop being used long term. How can we delete something we have no reference to? I don't mean dynamic config, I mean dynamic at runtime.

@SuperQ
Copy link

SuperQ commented Jul 31, 2024

I get it now, that makes a bit more sense. If you don't remove the key from your database, then there's nothing to delete.

But these label values should continue to exist in metrics anyway, as they are still valid in the runtime. So there is no problem there.

I was mostly thinking about the server label. To my understanding, this is explicitly listed in the config/runtime, so can be positively expired when it is no longer a valid route.

@francislavoie
Copy link
Member

I was mostly thinking about the server label. To my understanding, this is explicitly listed in the config/runtime, so can be positively expired when it is no longer a valid route.

The server label doesn't really matter. It's not that meaningful. It is auto-generated sequentially from Caddyfile config (e.g. srv0, srv1), but users may override their names to be more useful. Servers map to a specific listener (port number), not to domains. One server might be for listening on :443, another might be for :80.

But these label values should continue to exist in metrics anyway, as they are still valid in the runtime. So there is no problem there.

My concern with that is still about memory usage scaling though. We have people who run Caddy instances with hundreds of thousands of domains. They're definitely constrained on memory because Caddy needs to keep a certain amount of certificates decoded in cache (up to a limit, with eviction). Wouldn't it also make metrics HTTP responses huge? That's the trouble with cardinality. So it would definitely need to be opt-in, or have somekind of mechanism for limiting which domains are tracked to keep it in check. Not simple.

@SuperQ
Copy link

SuperQ commented Jul 31, 2024

It might be nice to be able to name servers rather than just auto-generate. :)

Prometheus metrics scale mostly linearly with the amount of memory. It's a simple Go map datastructure. I have a simple test binary for generating random cardinality and testing metrics.

package main

import (
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/thanhpk/randstr"
)

func main() {
	fmt.Println("Generating metrics")
	metricTest := promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "test_cardinality_total",
			Help: "Test for cardinality",
		},
		[]string{"test_label"},
	)

	for i := 0; i < 100_000; i++ {
		label := randstr.String(64)
		fmt.Printf("Generating metric %d %s\n", i, label)
		metricTest.WithLabelValues(label).Inc()
	}

	http.Handle("/metrics", promhttp.Handler())
	fmt.Println("Listening on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Running this test only results in about 27MiB of heap use.

The Prometheus client_golang is reasonably well optimized.

I agree, opt-in for domain name would be a good idea. I don't think we would want to get fancy include/exclude filter limiting, that's feels like premature optimization. Users who care about per-domain metrics can opt-in. Everyone else gets the default per-server.

@francislavoie
Copy link
Member

It might be nice to be able to name servers rather than just auto-generate. :)

You can: https://caddyserver.com/docs/caddyfile/options#name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working discussion 💬 The right solution needs to be found help wanted 🆘 Extra attention is needed optimization 📉 Performance or cost improvements
Projects
None yet
Development

No branches or pull requests

8 participants