-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prometheus] Add new prometheus metrics and metrics endpoint #827
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the proposal. I think there's still some way to go, also because I'm quite unfamiliar with how Prometheus works. Can you recommend a crash course for me, so I can better judge the implementation here?
health/handler.go
Outdated
) | ||
|
||
const ( | ||
HealthStatusPath = "/health/status" | ||
MetricsStatusPath = "/health/metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be renamed to prometheus
metrics/telemetry/middleware.go
Outdated
@@ -100,10 +100,14 @@ func NewMetricsManager(issuerURL string, databaseURL string, l logrus.FieldLogge | |||
salt: uuid.New(), | |||
BuildTime: buildTime, BuildVersion: version, BuildHash: hash, | |||
} | |||
|
|||
go mm.RegisterSegment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side effect of NewMetricsHandler
which might introduce some issues during testing. Not sure if I like this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll back it to old place as we decide to separate Prometheus and telemetry flag
@@ -106,9 +106,7 @@ func RunHost(c *config.Config) func(cmd *cobra.Command, args []string) { | |||
|
|||
if ok, _ := cmd.Flags().GetBool("disable-telemetry"); !ok && os.Getenv("DISABLE_TELEMETRY") != "1" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should disable prometheus if this is true. The idea of this is to allow people to disable sending reports to our servers. Prometheus runs locally and doesn't send anything to us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @dolbik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arekkas I left this part practically unchanged. Server reports stay as is. Prometheus middleware is always enabled https://github.com/ory/hydra/pull/827/files#diff-34e2a4fce852fbba1d689b617c6d4a94R114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh Yeah! Sorry, didn't see that - perfect :)
metrics/prometheus/metrics.go
Outdated
}, | ||
[]string{"endpoint"}, | ||
), | ||
ResponseTime: prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we decide which prometheus statistics to use here? Is sts_response_time_seconds
standardized somewhere, or chosen arbitrarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://prometheus.io/docs/concepts/metric_types/
HistogramVec is used for request durations for example.
Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc
Damn, that's what I feared. Naming is subject to the environment you're in, so hardcoding this is not a very good idea. Another issue is what metrics we collect, like do or do we not want to collect CPU metrics? But what about the thousands of other possible metrics?
I feel like implementing prometheus is like implementing proper analytics/tracking - the data being sent is really dependent on what you want to learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an example, you want to name this secure_token_service
. I'd disagree and say to use the name of the product. Someone else might want to have here oauth2_server
or whatever. It's really hard to define something that is so dependent on the context it's being used in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. Some configurable value can be added (for example CLUSTER_NAME, default hydra) for naming. In this case metrics name will be CLUSTER_NAME_response_time_seconds for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default prometheus client gathers information about:
- Number of goroutines that currently exist.
- A summary of the GC invocation durations
- Number of bytes allocated and still in use
- Total number of bytes allocated, even if freed
- Number of bytes obtained by system. Sum of all system allocations
- Total number of pointer lookups
- Total number of mallocs
- Total number of frees
- Number of heap bytes allocated and still in use
- Number of heap bytes obtained from system
- Number of heap bytes waiting to be used
- Number of heap bytes that are in use
- Total number of heap bytes released to OS
- Number of allocated objects
- Number of bytes in use by the stack allocator
- Number of bytes obtained from system for stack allocator
- Number of bytes in use by mspan structures
- Number of bytes used for mspan structures obtained from system
- Number of bytes in use by mcache structures
- Number of bytes used for mcache structures obtained from system
- Number of bytes used by the profiling bucket hash table
- Number of bytes used for garbage collection system metadata
- Number of bytes used for other system allocations
- Number of heap bytes when next garbage collection will take place
- Number of seconds since 1970 of last garbage collection
- Total user and system CPU time spent in seconds
- Number of open file descriptors
- Maximum number of open file descriptors
- Virtual memory size in bytes
- Resident memory size in bytes
- Start time of the process since unix epoch in seconds
All other metrics should be added manually. Thats why i added 2 metrics that may be needed for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - maybe a good idea then would be to just send the default metricsf or now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case new middleware does not make sense. New endpoint will be added into "health".
Should i leave middleware as skeleton for future or it can be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense - we could - in the future - also have the ability to load a plugin for handling prometheus middleware, which would make this a nobrainer to use.
metrics/prometheus/middleware.go
Outdated
} | ||
|
||
func NewMetricsHandler(l logrus.FieldLogger, version, hash, buildTime string) *PrometheusHandler { | ||
l.Info("Setting up Prometheus metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factories should try and avoid side effects such as logging
metrics/prometheus/middleware.go
Outdated
} | ||
|
||
func (pmm *PrometheusHandler) ServeHTTP(rw http.ResponseWriter, r *http.Request, next http.HandlerFunc) { | ||
defer func(start time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need a defer here? Just put it behind next
?
@arekkas Maybe this article can help https://blog.alexellis.io/prometheus-monitoring/ |
888fd6b
to
7c2cc77
Compare
metrics/prometheus/metrics.go
Outdated
"buildTime": buildTime, | ||
}, | ||
} | ||
return pm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense!
Signed-off-by: Dmitry Dolbik <dolbik@gmail.com>
@arekkas Looks like a have finished code according discussion. |
@arekkas Are any actions needed from my side? |
No, all good, I'm just extremely busy with #836 right now, I'll merge it soon |
Thank you for your contribution! |
* health: Adds new prometheus metrics and metrics endpoint (ory#827) Signed-off-by: Dmitry Dolbik <dolbik@gmail.com> * IDM-410 Prometheus Metrics for the Secure token Service
Based on ory#827 JIRA: WAP-1891
* OData no longer defaulted, using header to cause it * Simpler value * Add Prometheus endpoint Based on ory#827 JIRA: WAP-1891
How can I calculate the number of users that get sessions which aren't expired yet? My goal is to calculate the number of online users. |
Signed-off-by: Dmitry Dolbik dolbik@gmail.com
It is draft implementation to support Prometheus in Hydra (#669). Prometheus collects all metrics and more then current telemetry does. Current telemetry and prometheus works together in this pull request
Added two new Prometheus metrics:
The result of prometheus:
go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 go_goroutines 16 go_memstats_alloc_bytes 3.479416e+06 go_memstats_alloc_bytes_total 3.479416e+06 go_memstats_buck_hash_sys_bytes 1.445704e+06 go_memstats_frees_total 1375 go_memstats_gc_sys_bytes 268288 go_memstats_heap_alloc_bytes 3.479416e+06 go_memstats_heap_idle_bytes 679936 go_memstats_heap_inuse_bytes 4.857856e+06 go_memstats_heap_objects 19294 go_memstats_heap_released_bytes_total 0 go_memstats_heap_sys_bytes 5.537792e+06 go_memstats_last_gc_time_seconds 0 go_memstats_lookups_total 12 go_memstats_mallocs_total 20669 go_memstats_mcache_inuse_bytes 13888 go_memstats_mcache_sys_bytes 16384 go_memstats_mspan_inuse_bytes 59280 go_memstats_mspan_sys_bytes 65536 go_memstats_next_gc_bytes 4.473924e+06 go_memstats_other_sys_bytes 1.028528e+06 go_memstats_stack_inuse_bytes 753664 go_memstats_stack_sys_bytes 753664 go_memstats_sys_bytes 9.115896e+06 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 0.001876666 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 0.000721316 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1