Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prometheus] Add new prometheus metrics and metrics endpoint #827

Merged
merged 1 commit into from
May 8, 2018

Conversation

dolbik
Copy link
Contributor

@dolbik dolbik commented Apr 26, 2018

Signed-off-by: Dmitry Dolbik dolbik@gmail.com

It is draft implementation to support Prometheus in Hydra (#669). Prometheus collects all metrics and more then current telemetry does. Current telemetry and prometheus works together in this pull request
Added two new Prometheus metrics:

  • Secure token service requests served per endpoint
  • Secure token service response time per endpoint

The result of prometheus:
go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 go_goroutines 16 go_memstats_alloc_bytes 3.479416e+06 go_memstats_alloc_bytes_total 3.479416e+06 go_memstats_buck_hash_sys_bytes 1.445704e+06 go_memstats_frees_total 1375 go_memstats_gc_sys_bytes 268288 go_memstats_heap_alloc_bytes 3.479416e+06 go_memstats_heap_idle_bytes 679936 go_memstats_heap_inuse_bytes 4.857856e+06 go_memstats_heap_objects 19294 go_memstats_heap_released_bytes_total 0 go_memstats_heap_sys_bytes 5.537792e+06 go_memstats_last_gc_time_seconds 0 go_memstats_lookups_total 12 go_memstats_mallocs_total 20669 go_memstats_mcache_inuse_bytes 13888 go_memstats_mcache_sys_bytes 16384 go_memstats_mspan_inuse_bytes 59280 go_memstats_mspan_sys_bytes 65536 go_memstats_next_gc_bytes 4.473924e+06 go_memstats_other_sys_bytes 1.028528e+06 go_memstats_stack_inuse_bytes 753664 go_memstats_stack_sys_bytes 753664 go_memstats_sys_bytes 9.115896e+06 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_requests_total{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 0.001876666 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/metrics",hash="undefined",version="dev-master"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.005"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.01"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.025"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.05"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.25"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="0.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="1"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="2.5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="5"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="10"} 1 sts_response_time_seconds_bucket{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master",le="+Inf"} 1 sts_response_time_seconds_sum{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 0.000721316 sts_response_time_seconds_count{buildTime="2018-04-26 06:44:37.592314 +0000 UTC",endpoint="/health/status",hash="undefined",version="dev-master"} 1

Copy link
Member

@aeneasr aeneasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the proposal. I think there's still some way to go, also because I'm quite unfamiliar with how Prometheus works. Can you recommend a crash course for me, so I can better judge the implementation here?

)

const (
HealthStatusPath = "/health/status"
MetricsStatusPath = "/health/metrics"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be renamed to prometheus

@@ -100,10 +100,14 @@ func NewMetricsManager(issuerURL string, databaseURL string, l logrus.FieldLogge
salt: uuid.New(),
BuildTime: buildTime, BuildVersion: version, BuildHash: hash,
}

go mm.RegisterSegment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of NewMetricsHandler which might introduce some issues during testing. Not sure if I like this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll back it to old place as we decide to separate Prometheus and telemetry flag

@@ -106,9 +106,7 @@ func RunHost(c *config.Config) func(cmd *cobra.Command, args []string) {

if ok, _ := cmd.Flags().GetBool("disable-telemetry"); !ok && os.Getenv("DISABLE_TELEMETRY") != "1" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should disable prometheus if this is true. The idea of this is to allow people to disable sending reports to our servers. Prometheus runs locally and doesn't send anything to us

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @dolbik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arekkas I left this part practically unchanged. Server reports stay as is. Prometheus middleware is always enabled https://github.com/ory/hydra/pull/827/files#diff-34e2a4fce852fbba1d689b617c6d4a94R114

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Yeah! Sorry, didn't see that - perfect :)

},
[]string{"endpoint"},
),
ResponseTime: prometheus.NewHistogramVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we decide which prometheus statistics to use here? Is sts_response_time_seconds standardized somewhere, or chosen arbitrarily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://prometheus.io/docs/concepts/metric_types/

HistogramVec is used for request durations for example.

Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the counter should be like https://prometheus.io/docs/practices/naming/ and we can change it accoring rules in doc

Damn, that's what I feared. Naming is subject to the environment you're in, so hardcoding this is not a very good idea. Another issue is what metrics we collect, like do or do we not want to collect CPU metrics? But what about the thousands of other possible metrics?

I feel like implementing prometheus is like implementing proper analytics/tracking - the data being sent is really dependent on what you want to learn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an example, you want to name this secure_token_service. I'd disagree and say to use the name of the product. Someone else might want to have here oauth2_server or whatever. It's really hard to define something that is so dependent on the context it's being used in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. Some configurable value can be added (for example CLUSTER_NAME, default hydra) for naming. In this case metrics name will be CLUSTER_NAME_response_time_seconds for example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default prometheus client gathers information about:

  • Number of goroutines that currently exist.
  • A summary of the GC invocation durations
  • Number of bytes allocated and still in use
  • Total number of bytes allocated, even if freed
  • Number of bytes obtained by system. Sum of all system allocations
  • Total number of pointer lookups
  • Total number of mallocs
  • Total number of frees
  • Number of heap bytes allocated and still in use
  • Number of heap bytes obtained from system
  • Number of heap bytes waiting to be used
  • Number of heap bytes that are in use
  • Total number of heap bytes released to OS
  • Number of allocated objects
  • Number of bytes in use by the stack allocator
  • Number of bytes obtained from system for stack allocator
  • Number of bytes in use by mspan structures
  • Number of bytes used for mspan structures obtained from system
  • Number of bytes in use by mcache structures
  • Number of bytes used for mcache structures obtained from system
  • Number of bytes used by the profiling bucket hash table
  • Number of bytes used for garbage collection system metadata
  • Number of bytes used for other system allocations
  • Number of heap bytes when next garbage collection will take place
  • Number of seconds since 1970 of last garbage collection
  • Total user and system CPU time spent in seconds
  • Number of open file descriptors
  • Maximum number of open file descriptors
  • Virtual memory size in bytes
  • Resident memory size in bytes
  • Start time of the process since unix epoch in seconds

All other metrics should be added manually. Thats why i added 2 metrics that may be needed for users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - maybe a good idea then would be to just send the default metricsf or now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case new middleware does not make sense. New endpoint will be added into "health".

Should i leave middleware as skeleton for future or it can be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that makes sense - we could - in the future - also have the ability to load a plugin for handling prometheus middleware, which would make this a nobrainer to use.

}

func NewMetricsHandler(l logrus.FieldLogger, version, hash, buildTime string) *PrometheusHandler {
l.Info("Setting up Prometheus metrics")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factories should try and avoid side effects such as logging

}

func (pmm *PrometheusHandler) ServeHTTP(rw http.ResponseWriter, r *http.Request, next http.HandlerFunc) {
defer func(start time.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need a defer here? Just put it behind next?

@dolbik
Copy link
Contributor Author

dolbik commented Apr 26, 2018

@arekkas Maybe this article can help https://blog.alexellis.io/prometheus-monitoring/

@dolbik dolbik force-pushed the Prometheus branch 3 times, most recently from 888fd6b to 7c2cc77 Compare April 26, 2018 12:46
"buildTime": buildTime,
},
}
return pm
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to leave a empty middleware for prometheus with examples. It is given ability to simple add custom metrics if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense!

Signed-off-by: Dmitry Dolbik <dolbik@gmail.com>
@dolbik
Copy link
Contributor Author

dolbik commented Apr 27, 2018

@arekkas Looks like a have finished code according discussion.

@dolbik
Copy link
Contributor Author

dolbik commented May 2, 2018

@arekkas Are any actions needed from my side?

@aeneasr
Copy link
Member

aeneasr commented May 2, 2018

No, all good, I'm just extremely busy with #836 right now, I'll merge it soon

@aeneasr aeneasr merged commit ef94f98 into ory:master May 8, 2018
@aeneasr
Copy link
Member

aeneasr commented May 8, 2018

Thank you for your contribution!

mgalagan-sugarcrm pushed a commit to sugarcrm/hydra that referenced this pull request May 14, 2018
* health: Adds new prometheus metrics and metrics endpoint (ory#827)

Signed-off-by: Dmitry Dolbik <dolbik@gmail.com>

* IDM-410 Prometheus Metrics for the Secure token Service
mgloystein pushed a commit to spotxchange/hydra that referenced this pull request May 14, 2018
Based on ory#827

JIRA: WAP-1891
mgloystein added a commit to spotxchange/hydra that referenced this pull request May 14, 2018
* OData no longer defaulted, using header to cause it

* Simpler value

* Add Prometheus endpoint

Based on ory#827

JIRA: WAP-1891
@aeneasr aeneasr mentioned this pull request May 24, 2018
@aeneasr
Copy link
Member

aeneasr commented Nov 18, 2018

@dolbik would you mind sharing how you set up prometheus with hydra? What would be excellent, if you could create an docker example here and create a PR for it :)

@mobintmu
Copy link

mobintmu commented May 2, 2023

How can I calculate the number of users that get sessions which aren't expired yet? My goal is to calculate the number of online users.
Does the Prometheus feature support the number of users that get the sessions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants