PaaS-friendly metrics #3946

paulmelnikow · 2019-09-03T17:29:13Z

As discussed at #3874 we're having a serious capacity problem which is having a negative impact on our users. I've proposed an experiment which is to run Shields for a day on four Herkou dynos and compare the performance. IMO the most useful metric would be onboard response time, and ideally we'd have a way to measure it both before and after.

If the Heroku experiment works well, it would be a convenient way to go forward as the deploy and scaling process is easy and transparent, and they have agreed to sponsor us at a level that I expect will fully cover the cost.

There's a challenge to overcome with using Prometheus on Heroku. In #3874 (comment) I mentioned:

2. Because the individual servers can't be reached externally, metrics will need to be generated on each server and sent to the metrics server. This SO post outlines two options: one using pushgateway, and the other scraping more frequently and including the $DYNO variable in the metrics.

I think my first suggestion would be to try to use pushgateway and see how well that works. What do you think about this option?

/cc @platan

(Related to previous effort at #1848)

The text was updated successfully, but these errors were encountered:

paulmelnikow · 2019-09-03T20:16:44Z

Prometheus really seems to discourage this setup and as far as I can tell doesn't have a recipe for a PaaS like Heroku. In the readme they state they chose not to implement TTL because they consider it an anti-pattern. There is a fork we could consider which has implemented a TTL extension.

Alternatively we could solve this on the calling side. From the pushgateway docs:

The latter point is especially relevant when multiple instances of a job differentiate their metrics in the Pushgateway via an instance label or similar. Metrics for an instance will then remain in the Pushgateway even if the originating instance is renamed or removed. This is because the lifecycle of the Pushgateway as a metrics cache is fundamentally separate from the lifecycle of the processes that push metrics to it. Contrast this to Prometheus's usual pull-style monitoring: when an instance disappears (intentional or not), its metrics will automatically disappear along with it. When using the Pushgateway, this is not the case, and you would now have to delete any stale metrics manually or automate this lifecycle synchronization yourself.

Heroku has a DYNO env var: https://devcenter.heroku.com/articles/dynos#local-environment-variables

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down, it could:

Delete what is currently on the push gateway for its own dyno ID.
Start pushing new metrics to the push gateway using its dyno ID.

If/when we scale down the number of dynos, we'll have stale stats for the unused dynos. e.g. if we have web.1, web.2, web.3, web.4 and then scale down to three dynos, web.4 should be cleared out.

So another option would be to do the deleting during a clean shutdown instead of during startup. However if the server crashes, there will be stale data until the next clean shutdown.

Finally, Heroku guarantees dynos will live for at most 24 hours, so there will never be any stale data that is more than one day old. Maybe we could live with stale results that are < 1 day old.

calebcartwright · 2019-09-03T22:20:46Z

Finally, Heroku guarantees dynos will live for at most 24 hours

I didn't realize this. Does this mean that if we start with three dynos and don't scale out the number for 24 hours, that those three will be restarted at least once?

calebcartwright · 2019-09-03T22:24:09Z

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down

Are there any downsides to this, particularly on the Heroku side? For example would this be viable if we use Heroku's auto scaling feature?

paulmelnikow · 2019-09-03T22:35:09Z

I didn't realize this. Does this mean that if we start with three dynos and don't scale out the number for 24 hours, that those three will be restarted at least once?

Sorry, I'm not quite following what you're asking 😀

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down

Are there any downsides to this, particularly on the Heroku side? For example would this be viable if we use Heroku's auto scaling feature?

It's pretty much the same issue / same solution whether we scale manually or automatically. (Though if we scaled manually we could manually clear out the pushgateway.)

calebcartwright · 2019-09-03T22:56:15Z

Sorry, I'm not quite following what you're asking 😀

You mentioned that dynos will live at most 24 hours and I'm just trying to understand what that means. I've worked with other PaaS platforms that automatically shutdown/recycle their dyno equivalents every ~24 hours, and that statement about at most 24 hours has me wondering if that's the case with Heroku too.

paulmelnikow · 2019-09-03T23:07:53Z

When a dyno has been up for 24 hours, the dyno manager will proactively cycle it. A new dyno is brought up, and then once the new one is online, the old one is shut down. If we continually make deploys or config changes without a 24-hour gap, we'll never see any cycling. There's a bit more explanation about cycling here. Does that make sense?

If other systems do this proactive cycling I imagine it works similarly.

calebcartwright · 2019-09-03T23:12:03Z

I was more afraid that the behavior that applies to the Free Dynos (https://www.heroku.com/dynos) is what that meant.

Sounds like it's just rehydrating the containers every 24 hours while maintaining the specified capacity throughout that rehydration process, so I think I've got it now, thanks!

paulmelnikow · 2019-09-03T23:27:29Z

Ahh gotcha. No, the Standard-1X dynos definitely do not sleep!

platan · 2019-09-05T20:44:53Z

In my opinion we can try pushgateway. In this case we:

push metrics from app using https://github.com/siimon/prom-client#pushgateway for pushgateway e.g. every 15 seconds
pull metrics from pushgateway to prometheus every 15 seconds
delete old metrics from pushgateway
- delete all metrics every 5 minutes - this is a simple option to implement, in the worst case this can remove data before pulling it by prometheus once every 5 minutes
- or delete data from closed dynos only - we can get list of active dynos via Heroku REST API and remove data from inactive dynos
instead of using $DYNO we can take $HEROKU_DYNO_ID

Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" section.

Alternatively we can pull Shields hosted on Heroku quite often (e.g. every 5 seconds) and we'll see if this gives good results.

paulmelnikow · 2019-09-06T03:14:43Z

Thanks for your thoughts on this!

instead of using $DYNO we can take $HEROKU_DYNO_ID

Is this so metrics from two different instances are never showing up at once? I wonder if this might not be a problem, especially given the 15-second reporting interval and the 5-minute purge. The correct stats will eventually replace the incorrect one (as the dynos don't overlap for very long) and the stats that aren't present on the new instance will get wiped when the purge happens.

Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" section.

Collecting metrics via logplex seems like it would be very reliable and maybe the most cloud-native, though maybe a bit fiddly to get right. The StatsD approach seems a bit annoying, too.

I started setting up a pushgateway on Heroku here: https://github.com/badges/shields-pushgateway, though wondered if it would be preferable to run it somewhere that has a persistent fllesystem.

Also, any thoughts on how to secure the pushgateway? I read this: https://prometheus.io/docs/operating/security/#pushgateway

We could possibly put nginx in front of pushgateway and run basic auth in nginx, though I'm not sure whether prom-client's pushgateway support would work with it.

platan · 2019-09-06T06:59:44Z

I can easily add pushgateway to metrics.shields.io. There is already nginx there - we can use it to secure pushgateway. We only have to check if prom-client supports auth basic via https://username:pasword@host. pt., 6 wrz 2019, 05:14 użytkownik Paul Melnikow <notifications@github.com> napisał:

…

Thanks for your thoughts on this! - instead of using $DYNO we can take $HEROKU_DYNO_ID <https://devcenter.heroku.com/articles/dyno-metadata#attributes> Is this so metrics from two different instances are never showing up at once? I wonder if this might not be a problem, especially given the 15-second reporting interval and the 5-minute purge. The correct stats will eventually replace the incorrect one (as the dynos don't overlap for very long) and the stats that aren't present on the new instance will get wiped when the purge happens. Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite ***@***.***/statsd-graphite-grafana-328c0cdda5a0>, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" <https://blog.appoptics.com/three-ways-to-aggregate-metrics-on-heroku/> section. Collecting metrics via logplex seems like it would be very reliable and maybe the most cloud-native, though maybe a bit fiddly to get right. The StatsD approach seems a bit annoying, too. I started setting up a pushgateway on Heroku here: https://github.com/badges/shields-pushgateway, though wondered if it would be preferable to run it somewhere that has a persistent fllesystem. Also, any thoughts on how to secure the pushgateway? I read this: https://prometheus.io/docs/operating/security/#pushgateway We could possibly put nginx in front of pushgateway and run basic auth in nginx, though I'm not sure whether prom-client's pushgateway support would work with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3946?email_source=notifications&email_token=AALUR4BSQXCNXL2P5O2DQFDQIHDKLA5CNFSM4ITI6RZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BSY4A#issuecomment-528690288>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALUR4ERDOIAW7GTM4BNF53QIHDKLANCNFSM4ITI6RZQ> .

platan · 2019-09-06T18:31:48Z

prom-client uses https://nodejs.org/api/http.html, https://nodejs.org/api/https.html and we can pass credentials using:

let gateway = new client.Pushgateway('http://username:password@127.0.0.1:9091');
// or
let gateway = new client.Pushgateway('http://127.0.0.1:9091', {auth: 'username:password'});

platan · 2020-03-03T19:43:03Z

I'm working on PaaS-friendly metrics and I'm going to create a pull request soon.

paulmelnikow added operations Hosting, monitoring, and reliability for the production badge servers blocker PRs and epics which block other work labels Sep 3, 2019

platan mentioned this issue Apr 5, 2020

📈 PaaS-friendly metrics #4874

Merged

platan closed this as completed in #4874 Apr 19, 2020

paulmelnikow mentioned this issue Apr 20, 2020

Proposal: Move Shields to Heroku and inaugurate a new ops team #4929

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaaS-friendly metrics #3946

PaaS-friendly metrics #3946

paulmelnikow commented Sep 3, 2019 •

edited

Loading

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019 •

edited

Loading

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

paulmelnikow commented Sep 3, 2019

platan commented Sep 5, 2019 •

edited

Loading

paulmelnikow commented Sep 6, 2019

platan commented Sep 6, 2019 via email

platan commented Sep 6, 2019

platan commented Mar 3, 2020

PaaS-friendly metrics #3946

PaaS-friendly metrics #3946

Comments

paulmelnikow commented Sep 3, 2019 • edited Loading

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019 • edited Loading

paulmelnikow commented Sep 3, 2019

calebcartwright commented Sep 3, 2019

paulmelnikow commented Sep 3, 2019

platan commented Sep 5, 2019 • edited Loading

paulmelnikow commented Sep 6, 2019

platan commented Sep 6, 2019 via email

platan commented Sep 6, 2019

platan commented Mar 3, 2020

paulmelnikow commented Sep 3, 2019 •

edited

Loading

calebcartwright commented Sep 3, 2019 •

edited

Loading

platan commented Sep 5, 2019 •

edited

Loading