Excessive Prometheus label cardinality issue #2502

seremenko-wish · 2021-04-27T05:41:24Z

Describe the bug
Excessive Prometheus metrics cardinality causes server degradation. MetricsManager middleware creates a new label for all unique URLs, which results in server resources depletion and eventually, panics (see https://github.com/ory/hydra/blob/master/metrics/prometheus/middleware.go#L30). So even a very short-lived pentest scan can generate hundreds of thousands of new metrics which could cause Hydra, Prometheus, and all related monitoring tools to fail.

A few examples:

hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2/'"/logout",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2/(nslookup hitflvcefreks16da7.bxss.me||perl -e "gethostbyname('hitflvcefreks16da7.bxss.me')")/logout",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2/-1 OR 3+427-427-1=0+0+0+1 -- /logout",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2//%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5c%2e%2e%5cetc/passwd",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2//..%c0%af..%c0%af..%c0%af..%c0%af..%c0%af..%c0%af..%c0%af..%c0%af/etc/passwd",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2//.../.../.../.../.../.../.../.../etc/passwd",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2//..\..\..\..\..\..\..\..\etc/passwd",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}
hydra_response_time_seconds_count{_shard_id="003",az="useast1a",cluster="test.k8s.local",endpoint="/oauth2//./WEB-INF/web.xml",env="prod",instance="10.10.10.10:3333",job="test-auth",k8s_namespace="test-auth",k8s_pod="test-auth-649f5965b8-brqgc",node="ip-10-10-10-10",region="useast1"}

Reproducing the bug

Run Hydra app on local
Send a request curl http://localhost:4444/randomRequest
Send a request curl http://localhost:4445/metrics/prometheus
Response for step 3 has the following metrics:

hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.005"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.01"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.025"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.05"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.1"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.25"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="0.5"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="1"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="2.5"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="5"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="10"} 1
hydra_response_time_seconds_bucket{buildTime="",endpoint="/randomRequest",hash="",version="",le="+Inf"} 1
hydra_response_time_seconds_sum{buildTime="",endpoint="/randomRequest",hash="",version=""} 9.4364e-05
hydra_response_time_seconds_count{buildTime="",endpoint="/randomRequest",hash="",version=""} 1

Expected behavior
Ideally, only endpoints registered in HTTP routers should show up as a separate label in Prometheus metrics. All other requests should generate metrics with the URL label unmatched.

PR with a fix will be submitted soon.

The text was updated successfully, but these errors were encountered:

Fixes #2502

seremenko-wish mentioned this issue Apr 27, 2021

fix: Prometheus URL label #2503

Merged

5 tasks

aeneasr added the bug Something is not working. label Apr 29, 2021

aeneasr self-assigned this Apr 29, 2021

aeneasr closed this as completed in #2503 May 19, 2021

aeneasr pushed a commit that referenced this issue May 19, 2021

fix: prometheus URL label (#2503)

f588ec6

Fixes #2502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive Prometheus label cardinality issue #2502

Excessive Prometheus label cardinality issue #2502

seremenko-wish commented Apr 27, 2021 •

edited

Loading

Excessive Prometheus label cardinality issue #2502

Excessive Prometheus label cardinality issue #2502

Comments

seremenko-wish commented Apr 27, 2021 • edited Loading

seremenko-wish commented Apr 27, 2021 •

edited

Loading