Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add per_host option to metrics handler #6279

Closed
wants to merge 3 commits into from

Conversation

hussam-almarzoq
Copy link
Contributor

This addresses parts of #3784 and #4016. I understand the considerations mentioned by @hairyhenderson regarding the cardinality issue, but I think this is more of a Prometheus issue rather than a Caddy issue. Unlike #4644, no change in Caddy would solve it.

I think adding a clear caution in the docs explaining the implications of enabling this option, much like enabling server metrics, should be enough for the users to understand the tradeoff.

@CLAassistant
Copy link

CLAassistant commented Apr 28, 2024

CLA assistant check
All committers have signed the CLA.

@francislavoie
Copy link
Member

Cardinality is a Caddy problem, because some users serve thousands of domains via On-Demand TLS.

That said, I'll let @hairyhenderson address this, I'm not in touch with metrics stuff.

@hussam-almarzoq
Copy link
Contributor Author

hussam-almarzoq commented Apr 29, 2024

Cardinality is a Caddy problem, because some users serve thousands of domains via On-Demand TLS.

That said, I'll let @hairyhenderson address this, I'm not in touch with metrics stuff.

Cardinality limitation is a Prometheus limitation not a Caddy one. It is a consideration that you would have to deal with in Caddy or in any other system that uses Prometheus. I think Caddy should not make it inherently impossible to do this, but have the option there with a sane default – much like having metrics disabled by default. I would love to hear from @hairyhenderson.

@hairyhenderson
Copy link
Collaborator

I'm OK in general with this feature, as long as it defaults to false and is well-documented.

@hussam-almarzoq do you think you could issue a companion PR for caddyserver/website that we could merge soon after this?

I have a few review comments - will post those inline.

@@ -33,7 +35,7 @@ var httpMetrics = struct {
func initHTTPMetrics() {
const ns, sub = "caddy", "http"

basicLabels := []string{"server", "handler"}
basicLabels := []string{"server", "handler", "host"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than always setting host even when per_host is false, can you pass through the perHost bool and set it conditionally?

Suggested change
basicLabels := []string{"server", "handler", "host"}
basicLabels := []string{"server", "handler"}
if perHost {
basicLabels = append(basicLabels, "host")
}

(same for httpLabels below on ln62)

@@ -144,7 +152,7 @@ func (h *metricsInstrumentedHandler) ServeHTTP(w http.ResponseWriter, r *http.Re
// probably falling through with an empty handler.
if statusLabels["code"] == "" {
// we still sanitize it, even though it's likely to be 0. A 200 is
// returned on fallthrough so we want to reflect that.
// returned on fallthrough, so we want to reflect that.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated to this PR - can you revert it? (I don't disagree with it, but it's also entirely unnecessary as the previous sentence was also grammatically correct)

Suggested change
// returned on fallthrough, so we want to reflect that.
// returned on fallthrough so we want to reflect that.

server := serverNameFromContext(r.Context())
labels := prometheus.Labels{"server": server, "handler": h.handler}
labels := prometheus.Labels{"server": server, "handler": h.handler, "host": host}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should conditionally set host:

Suggested change
labels := prometheus.Labels{"server": server, "handler": h.handler, "host": host}
labels := prometheus.Labels{"server": server, "handler": h.handler}
if h.perHost {
labels["host"] = host
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add some tests for the case where per_host is set?

@hairyhenderson
Copy link
Collaborator

In chatting with @francislavoie we realized that only adding the host label when the config is set will break reloads. So before we do that we need to support reinitializing metrics on reload.

Until that's done, I think this change is blocked.

@francislavoie francislavoie added feature ⚙️ New feature or request do not merge ⛔ Not ready yet! labels Apr 30, 2024
@francislavoie
Copy link
Member

francislavoie commented Apr 30, 2024

I still disagree with cardinality not being an issue. If a user has this enabled and is serving a wildcard domain where the left-most label is mapped to a username in the app (e.g. <username>.example.com) then there can be an infinite amount of domains.

This means the host label's cardinality is infinite. This means more and more RAM will be consumed until Caddy get's OOM-killed. That's not ok. It could be argued we just need to document "don't enable this if the amount of hosts you have is unbound (e.g. wildcard domains or On-Demand TLS)" but I don't think that's a good idea, some users will not properly read the disclaimer and will enable it and have a server that's vulnerable to being OOM-killed by an attacker.

Also yeah, reloading config is still a problem. Caddy needs to support graceful reloads, so if the config changes we need to update the metrics config. AFAIK this isn't possible right now because of global state, and if we changed it to be non-global state then all metrics will be reset on reload (so all counts/averages/etc are lost on reload). I don't know how we'll solve that.

@hussam-almarzoq
Copy link
Contributor Author

In chatting with @francislavoie we realized that only adding the host label when the config is set will break reloads. So before we do that we need to support reinitializing metrics on reload.

Until that's done, I think this change is blocked.

I think we can avoid the reload issue by having a default value for the label when it is disabled. This way the config can be changed without changing the label count

@hussam-almarzoq
Copy link
Contributor Author

@hairyhenderson any thoughts?

@hairyhenderson
Copy link
Collaborator

I think we can avoid the reload issue by having a default value for the label when it is disabled.

@hussam-almarzoq I'm not excited about that option, because it's now adding a new label to everyone's metrics whether they want it or not.

The key here is:

we need to support reinitializing metrics on reload.

This is necessary for other metrics work as well, not just for this PR.

To respond to what @francislavoie said:

Also yeah, reloading config is still a problem. Caddy needs to support graceful reloads, so if the config changes we need to update the metrics config. AFAIK this isn't possible right now because of global state, and if we changed it to be non-global state then all metrics will be reset on reload (so all counts/averages/etc are lost on reload). I don't know how we'll solve that.

Ultimately I think it's OK for everything to reset on reload. It'll appear as if the process died and was restarted, but that's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge ⛔ Not ready yet! feature ⚙️ New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support per-host Prometheus metrics
5 participants