-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitoring with Prometheus #2068
Comments
Looks like a good idea. Having access to this kind of data would be great for us. It would improve things that are currently difficult like diagnosing out-of-memory errors or tuning the LRU cache size and give us more of an understanding of where performance bottlenecks might be. This needs feedback from @paulmelnikow and/or @espadrine to take forward but to throw in my 2c I would be in favour of making the metrics completely public as long as:
What do you think about that issue - are the other things to consider there which I have not mentioned? |
Good points @chris48s!
I do not see such threat now.
I will check impact of reading metrics via |
This sounds great! Thanks so much for your work. System monitoring eases one of our current pain points with the servers. I like how this solution assembles off-the-shelf tools to solve the problem. It's a smart approach which gives us a lot of bang for the buck. These options seem solid, and though I haven't done any kind of comparison shopping, I think this is a great place to start. If it turns out down the line something else suits our needs better, we can change tack. (I have some familiarity with these tools, but haven't used Prometheus before and am definitely not an expert.)
Thanks for offering to host! I think it's awesome for you to host it for the time being. I'd want to make a plan for having it not depend on any one person, though. One idea is Grafana Labs, which offers cloud Grafana + Prometheus. I wonder if they would give us a donation.
I'm inclined to put this behind a secret key and/or limit IPs. Not because we don't want to share the data, but because I don't know much about prom-client so would rather be on the conservative side with the raw endpoint.
Perhaps we could start with the maintainers having access, and then publish a public dashboard once we have a good grip on how the tool works and what exactly we're sharing? One thing to keep in mind is that we have a "no tracking" promise. So we do need to make sure whatever we're monitoring doesn't amount to tracking. I don't think that will be hard, but let's not forget to think about it. |
I wrote to them in April 2018, but I didn't get a response. I will forward you an email I wrote to them.
So will add a IP limit for this resource in my PR.
OK. We can start with this approach. Since dashboard won't be public at the beginning we can discuss details at Discord.
👍 I have this promise in my mind form the beginning - I would like to aggregate data about the performance. |
@chris48s Few weeks ago I wrote that I will check impact of reading |
If we're going to limit access to a small group, lets ignore that issue for now |
After discussion today, we decided to leave the metrics public. |
Metrics are available here: https://metrics.shields.io What's next?
I would like to start with some really basic metrics of usage (number of invocations in time - counter) of badges (like Github-license). Prometheus has labels, so we can use several dimensions to describe metrics:
I will document this |
Request counts are on my mind too. How would we do this reliably using Prometheus? The way it's built seems better at tracking current state than activity over time. Is there a pattern we can follow? |
Application expose metrics with current state, but Prometheus stores time series data - activity over time. I wrote we can start with metric showing number of invocations. It would be even better to start with metric showing amount of time it takes to process a badge request. |
Time it takes to process a badge request sounds good! We could even do that by badge type, which seems even more helpful. There are two challenges with using Prometheus to do overall request counts:
The issue prometheus/prometheus#2473 gets at what I meant. There may be a way we can hack it though it seems like for that kind of analytics there might be a better tool we could consider. |
This post talks about using Prometheus'
Does that make sense to you @platan? I'm game to give that a try! |
I agree with you, rate function should do the job.
Increase function can be used in this case.
Prometheus is not a long-term storage solution (https://dev.to/mhausenblas/revisiting-promcon-2018-panel-on-prometheus-long-term-storage-5f1p). Do we need data from longer period of time? |
Ah, nice, it seems like some combination of rate() and increase() accomplish this.
Longer than what? 😀 I think long-term data would be nice, and would make for interesting data to dig through, but probably isn't essential. A month seems like it could be adequate as a starting point. |
Perhaps rather than update our legacy analytics for #1848, we could migrate them to Prometheus instead. The other thing we're tracking is which template is being used. I'd also like to track which logos are being used (and what proportion of requests has a logo). One possibility would be to use additional labels on the service-request badge. However I worry that will take up more memory on the server and bandwidth in the metric requests. If we have each logo and each template that increases the number of possible in-memory incrementers from one per service to ~10 per service. I'm not too familiar with how prom-client is implemented, but it's safe to assume it keeps all these labeled numbers in memory! Since we don't really care how these metrics correlate with which service, would it be better to create separate metrics for these other two dimensions? |
This picks up #2068 by adding per-badge stats as discussed in #966. It ensures every service has a unique `name` property. By default this comes from the class name, and is overridden in all the various places where the class names are duplicated. (Some of those don't seem that useful, like the various download interval services, though those need to be refactored down into a single service anyway.) Tests enforce the names are unique. These are the names used by the service-test runner, so it's a good idea to make them unique anyway. (It was sort of strange before that you had to specify `nuget` instead of e.g. `resharper`.) I've added validation to `deprecatedService` and `redirector`, and required that every `route` has a `base`, even if it's an empty string. The name is used to generate unique metric labels, generating metrics like these: ``` service_requests_total{category="activity",family="eclipse-marketplace",service="eclipse_marketplace_update"} 2 service_requests_total{category="activity",family="npm",service="npm_collaborators"} 3 service_requests_total{category="activity",family="steam",service="steam_file_release_date"} 2 service_requests_total{category="analysis",family="ansible",service="ansible_galaxy_content_quality_score"} 2 service_requests_total{category="analysis",family="cii-best-practices",service="cii_best_practices_service"} 4 service_requests_total{category="analysis",family="cocoapods",service="cocoapods_docs"} 2 service_requests_total{category="analysis",family="codacy",service="codacy_grade"} 3 service_requests_total{category="analysis",family="coverity",service="coverity_scan"} 2 service_requests_total{category="analysis",family="coverity",service="deprecated_coverity_ondemand"} 2 service_requests_total{category="analysis",family="dependabot",service="dependabot_semver_compatibility"} 3 service_requests_total{category="analysis",family="lgtm",service="lgtm_alerts"} 2 service_requests_total{category="analysis",family="lgtm",service="lgtm_grade"} 3 service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_git_hub"} 4 service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_npm"} 5 service_requests_total{category="analysis",family="symfony",service="sensiolabs_i_redirector"} 1 service_requests_total{category="analysis",family="symfony",service="symfony_insight_grade"} 1 service_requests_total{category="build",family="appveyor",service="app_veyor_ci"} 3 service_requests_total{category="build",family="appveyor",service="app_veyor_tests"} 6 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_build"} 6 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_release"} 5 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_tests"} 6 service_requests_total{category="build",family="azure-devops",service="vso_build_redirector"} 2 service_requests_total{category="build",family="azure-devops",service="vso_release_redirector"} 1 service_requests_total{category="build",family="bitbucket",service="bitbucket_pipelines"} 5 service_requests_total{category="build",family="circleci",service="circle_ci"} 5 ``` This is predicated on being able to use Prometheus's [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) function to visualize a counter's rate of change, as mentioned at #2068 (comment). Otherwise the stats will be disrupted every time a server restarts. The metrics only appear on new-style services.
Seems like this can be closed now! We've got the analytics working well via #3093. Let's open a new issue for any follow-on work. |
https://github.com/platan/metrics-shields-io-config is an Ansible playbook which can be used to configure monitoring for Shields.io (https://metrics.shields.io) and to create a new instance of monitoring from scratch. |
Should we add that info here? https://github.com/badges/shields/blob/master/doc/production-hosting.md#monitoring |
I would like to propose a monitoring solution for Shields based on Prometheus and Grafana.
Prometheus is an open-source monitoring and alerting system (overview). This page compares Promethues with alternatives.
How Prometheus works? An application exposes it's metrics as a plain text via HTTP. Then Prometheus periodically pulls these metrics and allows to display graphs based on these metrics. So it works different comparing to Graphite, where application push data to Graphite.
Grafana can be configured to use Prometheus as a data source for graphs.
https://github.com/siimon/prom-client is a good client for Node.js we can use. It collects recommended default metrics and can be used to prepare custom metrics.
What do we have now?
What do we have to do?
/metrics
resource be available for all or limited to some IP/users* I'm not a monitoring systems expert. I just have some experience with Grafana, Graphite and Prometheus :-).
The text was updated successfully, but these errors were encountered: