Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaaS-friendly metrics #3946

Closed
paulmelnikow opened this issue Sep 3, 2019 · 13 comments · Fixed by #4874
Closed

PaaS-friendly metrics #3946

paulmelnikow opened this issue Sep 3, 2019 · 13 comments · Fixed by #4874
Labels
blocker PRs and epics which block other work operations Hosting, monitoring, and reliability for the production badge servers

Comments

@paulmelnikow
Copy link
Member

paulmelnikow commented Sep 3, 2019

As discussed at #3874 we're having a serious capacity problem which is having a negative impact on our users. I've proposed an experiment which is to run Shields for a day on four Herkou dynos and compare the performance. IMO the most useful metric would be onboard response time, and ideally we'd have a way to measure it both before and after.

If the Heroku experiment works well, it would be a convenient way to go forward as the deploy and scaling process is easy and transparent, and they have agreed to sponsor us at a level that I expect will fully cover the cost.

There's a challenge to overcome with using Prometheus on Heroku. In #3874 (comment) I mentioned:

2. Because the individual servers can't be reached externally, metrics will need to be generated on each server and sent to the metrics server. This SO post outlines two options: one using pushgateway, and the other scraping more frequently and including the $DYNO variable in the metrics.

I think my first suggestion would be to try to use pushgateway and see how well that works. What do you think about this option?

/cc @platan

(Related to previous effort at #1848)

@paulmelnikow paulmelnikow added operations Hosting, monitoring, and reliability for the production badge servers blocker PRs and epics which block other work labels Sep 3, 2019
@paulmelnikow
Copy link
Member Author

Prometheus really seems to discourage this setup and as far as I can tell doesn't have a recipe for a PaaS like Heroku. In the readme they state they chose not to implement TTL because they consider it an anti-pattern. There is a fork we could consider which has implemented a TTL extension.

Alternatively we could solve this on the calling side. From the pushgateway docs:

The latter point is especially relevant when multiple instances of a job differentiate their metrics in the Pushgateway via an instance label or similar. Metrics for an instance will then remain in the Pushgateway even if the originating instance is renamed or removed. This is because the lifecycle of the Pushgateway as a metrics cache is fundamentally separate from the lifecycle of the processes that push metrics to it. Contrast this to Prometheus's usual pull-style monitoring: when an instance disappears (intentional or not), its metrics will automatically disappear along with it. When using the Pushgateway, this is not the case, and you would now have to delete any stale metrics manually or automate this lifecycle synchronization yourself.

Heroku has a DYNO env var: https://devcenter.heroku.com/articles/dynos#local-environment-variables

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down, it could:

  1. Delete what is currently on the push gateway for its own dyno ID.
  2. Start pushing new metrics to the push gateway using its dyno ID.

If/when we scale down the number of dynos, we'll have stale stats for the unused dynos. e.g. if we have web.1, web.2, web.3, web.4 and then scale down to three dynos, web.4 should be cleared out.

So another option would be to do the deleting during a clean shutdown instead of during startup. However if the server crashes, there will be stale data until the next clean shutdown.

Finally, Heroku guarantees dynos will live for at most 24 hours, so there will never be any stale data that is more than one day old. Maybe we could live with stale results that are < 1 day old.

@calebcartwright
Copy link
Member

Finally, Heroku guarantees dynos will live for at most 24 hours

I didn't realize this. Does this mean that if we start with three dynos and don't scale out the number for 24 hours, that those three will be restarted at least once?

@calebcartwright
Copy link
Member

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down

Are there any downsides to this, particularly on the Heroku side? For example would this be viable if we use Heroku's auto scaling feature?

@paulmelnikow
Copy link
Member Author

I didn't realize this. Does this mean that if we start with three dynos and don't scale out the number for 24 hours, that those three will be restarted at least once?

Sorry, I'm not quite following what you're asking 😀

An option is to have each new dyno set a timeout on startup. After say 60 seconds, long enough for the old dyno to shut down

Are there any downsides to this, particularly on the Heroku side? For example would this be viable if we use Heroku's auto scaling feature?

It's pretty much the same issue / same solution whether we scale manually or automatically. (Though if we scaled manually we could manually clear out the pushgateway.)

@calebcartwright
Copy link
Member

calebcartwright commented Sep 3, 2019

Sorry, I'm not quite following what you're asking 😀

You mentioned that dynos will live at most 24 hours and I'm just trying to understand what that means. I've worked with other PaaS platforms that automatically shutdown/recycle their dyno equivalents every ~24 hours, and that statement about at most 24 hours has me wondering if that's the case with Heroku too.

@paulmelnikow
Copy link
Member Author

When a dyno has been up for 24 hours, the dyno manager will proactively cycle it. A new dyno is brought up, and then once the new one is online, the old one is shut down. If we continually make deploys or config changes without a 24-hour gap, we'll never see any cycling. There's a bit more explanation about cycling here. Does that make sense?

If other systems do this proactive cycling I imagine it works similarly.

@calebcartwright
Copy link
Member

I was more afraid that the behavior that applies to the Free Dynos (https://www.heroku.com/dynos) is what that meant.

Sounds like it's just rehydrating the containers every 24 hours while maintaining the specified capacity throughout that rehydration process, so I think I've got it now, thanks!

@paulmelnikow
Copy link
Member Author

Ahh gotcha. No, the Standard-1X dynos definitely do not sleep!

@platan
Copy link
Member

platan commented Sep 5, 2019

In my opinion we can try pushgateway. In this case we:

  • push metrics from app using https://github.com/siimon/prom-client#pushgateway for pushgateway e.g. every 15 seconds
  • pull metrics from pushgateway to prometheus every 15 seconds
  • delete old metrics from pushgateway
    • delete all metrics every 5 minutes - this is a simple option to implement, in the worst case this can remove data before pulling it by prometheus once every 5 minutes
    • or delete data from closed dynos only - we can get list of active dynos via Heroku REST API and remove data from inactive dynos
  • instead of using $DYNO we can take $HEROKU_DYNO_ID

Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" section.

Alternatively we can pull Shields hosted on Heroku quite often (e.g. every 5 seconds) and we'll see if this gives good results.

@paulmelnikow
Copy link
Member Author

Thanks for your thoughts on this!

Is this so metrics from two different instances are never showing up at once? I wonder if this might not be a problem, especially given the 15-second reporting interval and the 5-minute purge. The correct stats will eventually replace the incorrect one (as the dynos don't overlap for very long) and the stats that aren't present on the new instance will get wiped when the purge happens.

Since we have to push metrics in pushgateway I wanted to suggest StatsD + Graphite, but it can be even harder to use than pushgateway (using StatsD + Heroku described in "What Is StatsD?" section.

Collecting metrics via logplex seems like it would be very reliable and maybe the most cloud-native, though maybe a bit fiddly to get right. The StatsD approach seems a bit annoying, too.

I started setting up a pushgateway on Heroku here: https://github.com/badges/shields-pushgateway, though wondered if it would be preferable to run it somewhere that has a persistent fllesystem.

Also, any thoughts on how to secure the pushgateway? I read this: https://prometheus.io/docs/operating/security/#pushgateway

We could possibly put nginx in front of pushgateway and run basic auth in nginx, though I'm not sure whether prom-client's pushgateway support would work with it.

@platan
Copy link
Member

platan commented Sep 6, 2019 via email

@platan
Copy link
Member

platan commented Sep 6, 2019

prom-client uses https://nodejs.org/api/http.html, https://nodejs.org/api/https.html and we can pass credentials using:

let gateway = new client.Pushgateway('http://username:password@127.0.0.1:9091');
// or
let gateway = new client.Pushgateway('http://127.0.0.1:9091', {auth: 'username:password'});

@platan
Copy link
Member

platan commented Mar 3, 2020

I'm working on PaaS-friendly metrics and I'm going to create a pull request soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker PRs and epics which block other work operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants