-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PaaS-friendly metrics #3946
Comments
Prometheus really seems to discourage this setup and as far as I can tell doesn't have a recipe for a PaaS like Heroku. In the readme they state they chose not to implement TTL because they consider it an anti-pattern. There is a fork we could consider which has implemented a TTL extension. Alternatively we could solve this on the calling side. From the pushgateway docs:
Heroku has a An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down, it could:
If/when we scale down the number of dynos, we'll have stale stats for the unused dynos. e.g. if we have web.1, web.2, web.3, web.4 and then scale down to three dynos, web.4 should be cleared out. So another option would be to do the deleting during a clean shutdown instead of during startup. However if the server crashes, there will be stale data until the next clean shutdown. Finally, Heroku guarantees dynos will live for at most 24 hours, so there will never be any stale data that is more than one day old. Maybe we could live with stale results that are < 1 day old. |
I didn't realize this. Does this mean that if we start with three dynos and don't scale out the number for 24 hours, that those three will be restarted at least once? |
Are there any downsides to this, particularly on the Heroku side? For example would this be viable if we use Heroku's auto scaling feature? |
Sorry, I'm not quite following what you're asking 😀
It's pretty much the same issue / same solution whether we scale manually or automatically. (Though if we scaled manually we could manually clear out the pushgateway.) |
You mentioned that |
When a dyno has been up for 24 hours, the dyno manager will proactively cycle it. A new dyno is brought up, and then once the new one is online, the old one is shut down. If we continually make deploys or config changes without a 24-hour gap, we'll never see any cycling. There's a bit more explanation about cycling here. Does that make sense? If other systems do this proactive cycling I imagine it works similarly. |
I was more afraid that the behavior that applies to the Free Dynos (https://www.heroku.com/dynos) is what that meant. Sounds like it's just rehydrating the containers every 24 hours while maintaining the specified capacity throughout that rehydration process, so I think I've got it now, thanks! |
Ahh gotcha. No, the Standard-1X dynos definitely do not sleep! |
In my opinion we can try pushgateway. In this case we:
Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" section. Alternatively we can pull Shields hosted on Heroku quite often (e.g. every 5 seconds) and we'll see if this gives good results. |
Thanks for your thoughts on this!
Is this so metrics from two different instances are never showing up at once? I wonder if this might not be a problem, especially given the 15-second reporting interval and the 5-minute purge. The correct stats will eventually replace the incorrect one (as the dynos don't overlap for very long) and the stats that aren't present on the new instance will get wiped when the purge happens.
Collecting metrics via logplex seems like it would be very reliable and maybe the most cloud-native, though maybe a bit fiddly to get right. The StatsD approach seems a bit annoying, too. I started setting up a pushgateway on Heroku here: https://github.com/badges/shields-pushgateway, though wondered if it would be preferable to run it somewhere that has a persistent fllesystem. Also, any thoughts on how to secure the pushgateway? I read this: https://prometheus.io/docs/operating/security/#pushgateway We could possibly put nginx in front of pushgateway and run basic auth in nginx, though I'm not sure whether prom-client's pushgateway support would work with it. |
I can easily add pushgateway to metrics.shields.io. There is already nginx
there - we can use it to secure pushgateway. We only have to check if
prom-client supports auth basic via https://username:pasword@host.
pt., 6 wrz 2019, 05:14 użytkownik Paul Melnikow <notifications@github.com>
napisał:
… Thanks for your thoughts on this!
- instead of using $DYNO we can take $HEROKU_DYNO_ID
<https://devcenter.heroku.com/articles/dyno-metadata#attributes>
Is this so metrics from two different instances are never showing up at
once? I wonder if this might not be a problem, especially given the
15-second reporting interval and the 5-minute purge. The correct stats will
eventually replace the incorrect one (as the dynos don't overlap for very
long) and the stats that aren't present on the new instance will get wiped
when the purge happens.
Since we have to push metrics in pushgateway I wanted to suggest StatsD +
Graphite
***@***.***/statsd-graphite-grafana-328c0cdda5a0>,
but it can be even harder to use than pushgateway (using StatsD + Heroku
described in "What Is StatsD?"
<https://blog.appoptics.com/three-ways-to-aggregate-metrics-on-heroku/>
section.
Collecting metrics via logplex seems like it would be very reliable and
maybe the most cloud-native, though maybe a bit fiddly to get right. The
StatsD approach seems a bit annoying, too.
I started setting up a pushgateway on Heroku here:
https://github.com/badges/shields-pushgateway, though wondered if it
would be preferable to run it somewhere that has a persistent fllesystem.
Also, any thoughts on how to secure the pushgateway? I read this:
https://prometheus.io/docs/operating/security/#pushgateway
We could possibly put nginx in front of pushgateway and run basic auth in
nginx, though I'm not sure whether prom-client's pushgateway support would
work with it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3946?email_source=notifications&email_token=AALUR4BSQXCNXL2P5O2DQFDQIHDKLA5CNFSM4ITI6RZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BSY4A#issuecomment-528690288>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALUR4ERDOIAW7GTM4BNF53QIHDKLANCNFSM4ITI6RZQ>
.
|
prom-client uses https://nodejs.org/api/http.html, https://nodejs.org/api/https.html and we can pass credentials using: let gateway = new client.Pushgateway('http://username:password@127.0.0.1:9091');
// or
let gateway = new client.Pushgateway('http://127.0.0.1:9091', {auth: 'username:password'}); |
I'm working on PaaS-friendly metrics and I'm going to create a pull request soon. |
As discussed at #3874 we're having a serious capacity problem which is having a negative impact on our users. I've proposed an experiment which is to run Shields for a day on four Herkou dynos and compare the performance. IMO the most useful metric would be onboard response time, and ideally we'd have a way to measure it both before and after.
If the Heroku experiment works well, it would be a convenient way to go forward as the deploy and scaling process is easy and transparent, and they have agreed to sponsor us at a level that I expect will fully cover the cost.
There's a challenge to overcome with using Prometheus on Heroku. In #3874 (comment) I mentioned:
I think my first suggestion would be to try to use pushgateway and see how well that works. What do you think about this option?
/cc @platan
(Related to previous effort at #1848)
The text was updated successfully, but these errors were encountered: