-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Move Shields to Heroku and inaugurate a new ops team #4929
Comments
I'm onboard with this 👍 2 questions:
|
Yea, that's a good question. It'd be nice to work out that number. When I asked for the donation I worked out a high estimate based on some number of Standard-1x dynos. Let me see if I can track it down. I definitely don't want to churn on hosting providers, but I think if cost were to become a deciding factor, we could move pretty easily to a technically similar cloud-based service like Google App Engine, AWS Elastic Beanstalk, or Azure App Service. Something based on Lambda, like Zeit Now, could also work, though it'd require some major rearchitecting to operate efficiently in Lambdas. Also if we had to pay for something like Zeit, given how many requests we can pack onto tiny servers, the cost per request would be higher.
Yea, taking location into account – that's what I meant to write. (Sorry, the way I phrased that was not at all clear!) The hope is that localizing the server handling the request is only a plus. Although we may have to experiment with that a bit, as our performance is also depends on the regional performance characteristics our upstream providers. |
I don't think we need to, Heroku seems like the right choice. This is something we can potentially figure out later on too post migration. I'd just like to have a rough $ of what it costs to run, even if sponsorships are covering things. |
That's great news!
How exactly this process of moving form OVH to Heroku could look like? Do you want to move to Heroku gradually or switch all at once? Does Cloudflare allows to control amount of the traffic send to a specific domain (something better than round-robin)? With round-robin we can do it gradually by adding a new domain (e.g.
We have great (live) service tests which can be used to check if a newly deployed version works properly (using https://github.com/badges/shields/blob/master/core/service-test-runner/cli.js#L16). We do not have to run all of them, maybe one or few for every service will be enough. |
The sooner we can make the transition the better IMO. Let me know what I can do to help. |
Has anyone already reached out to @espadrine via mail to make sure he doesn't miss this? |
👍
Not knowing for sure how many we'd need, I'd sent them what seems to be an extremely high estimate:
Currently we have $562 of platform credits so if we decide to go this route, I will ask them to top it off.
Yes, I think gradually would be a good idea.
I think Cloudflare offers a lot of fancy ways to do load balancing, however the round-robin is already set up and seems fine for the purpose of this transition. We can reconsider how load-balancing works when adding the second region. Your suggestion is good: spin up a couple dynos, add the Heroku app as a fourth server, and see how it all goes, before we start removing anything.
Yep! I sent him an email when I posted this. If he responds to me directly I'll report back here. |
Before we get too far along, it'd be great to get @espadrine's approval and find out how he wants to be involved in this. Though if we haven't heard from him for a few days, perhaps we should meet to continue the conversation. |
@paulmelnikow Curious what you view as the difference in terms of #4878 for this. Is it just that Heroku will be more reliable and the current server system is not? I guess my question is, are we confident that this proposal will fix #4878? Nothing I saw in the proposal really discusses that or gives much insight into the problems mentioned in #4878. Would love to help support this in anyway I can tho. |
The hope is that Heroku has less downtime than OVH. But more importantly, with Heroku, we can add capacity with two clicks in the UI. |
I was wrong about this. Since our round-robin DNS is using A records, we can't just drop Heroku, since it needs to be used either as a CNAME or with a reverse proxy in front of it. Given the service is basically unusable for big chunks of our peak hours (#4878), I suggest as an experiment we move all the traffic over to Heroku and see how it works. I sent a second note to @espadrine proposing that. Meanwhile the service is online here: https://img-test.shields.io/badge/build-passing-brightgreen |
I have good friends at Heroku who could potentially help us with costs if we asked. Let me know if you'd like me to. |
I suggest to start cautiously and move only a part of the traffic to Heroku. Creating a reverse proxy at a new VPS shouldn't be complicated or expensive. |
@platan I fully disagree with this. Anything you add to the chain adds complexity and is one more thing to maintain and manage. It’s also one more thing that could break. It sounds to me throughout this entire proposal the goal is to reduce complexity and streamline the process. Adding another reverse proxy or VPS, seems directly against the goal of this proposal. It sounds like this project already uses Cloudflare. Why can’t we utilize that more instead of another server to manage? |
@fishcharlie, if I'm understanding @platan 's suggestion correctly, the reverse proxy VPS would only be useful during the transition period, to allow dual running and load balancing part of traffic between the old servers and the new Heroku deployment. It's only complexity we would have in the short term. Cloudfare alone wouldn't allow that gradual transition, it would have to be clear cut. |
@PyvesB I don't believe this is correct. I'm not sure how this project is utilizing Cloudflare, but Cloudflare does offer a load balancing feature. This would allow you to set it up to manage traffic between multiple servers. This would reduce the complexity of having another layer of infrastructure to setup, manage, and maintain. Why add complexity when a service you are already using provides the functionality you need?? |
Well, I know very little about the Cloudfare product, but I was basing my previous message on @paulmelnikow analysis: #4929 (comment) |
@PyvesB Ahh got it, missed that somehow. Yeah I'm still not in support of spinning up another system. Of course I won't be the one maintaining it, so it's not really up to me, and if someone is willing to take over that responsibility that is fine. I guess my point is that anything is better than what is going on right now. Yes, jumping all into Heroku is a massive leap. But to me, the tradeoffs are worth it since this service is basically unusable for a large amount of the day. Spinning up another service right now when the team is struggling to maintain the current infrastructure seems like an extremely bad idea. The goal is to isolate the issue and fix it. Let's say the problem persists or gets worse with this new plan, it will add complexity to isolate the issue. Is it the load balancer that is having problems or is it Heroku? That will be a much more difficult question to answer by spinning up more failure points. The whole goal here should be to reduce failure points, not increase them. Even during a transition period. If that means jumping off the deep end and just going for it, I say that is a tradeoff worth making. It just requires the team to move quickly and ensure they have time blocked off and willing to fix any problems that come up quickly. Sure there might be downtime, but I think that is inevitable at this point no matter the path forward, and the service is already experiencing major outages and problems. It can't get much worse... I came to this thread via #4878 so to me this is an extremely high priority and urgent issue. I don't think spending the time spinning up a VPS, managing that, and adding another failure point is worth it. This is an issue that I think needs to be solved quickly and I don't think adding more infrastructure is worth the time. I know I might be alone in this, but I'd prefer the team to take a few risks here at the cost of the service being less stable short term to have a higher chance of stability in the long run. |
You do have some good points. I would personally be in favour on moving to Heroku directly. As you said, it minimises effort and things can't get much worse. |
I wanted to post a brief update on behalf of the ops team. The team, consisting of @chris48s @calebcartwright @PyvesB and @paulmelnikow, met yesterday and decided to move ahead with this experiment. Given Shields has been basically unusable for several hours a day, the team decided to make a wholesale transfer to Heroku rather than set up a proxy so the Heroku dynos could join the existing servers. This went live last night with four dynos around 2:30 UTC. Traffic started flowing to the new dynos in seconds. For the most part it’s been pretty stable. We ran into an issue we did not foresee. We generate a lot of traffic to services which have chosen to allow unlimited traffic to our server IP addresses. On Heroku the outbound traffic comes from unpredictable IP addresses, so we will need to develop a solution to that. In the meantime we’ve worked around that by proxying the JSON result from the legacy servers. Heroku is designed to have smooth server transitions, however we’re seeing lots of errors, dropped requests, and memory peaks at the time of server transition. These only last a few seconds (whereas the legacy deploy would disrupt the service for minutes). I am not completely sure, though I think this may be related to scout camp, which binds the port before the server is fully set up. (Heroku is designed to give apps a minute or so to set up, but they have to wait to bind the port until they’re ready.) (Thanks also to folks who have opined in this thread! We took your input into consideration.) |
First of all. HUGE thanks and shoutout to @paulmelnikow @chris48s @calebcartwright and @PyvesB for making this happen!! Great work to everyone to make this transition. I haven't seen any problems today, and even things that were slow before, seem to be loading much faster. So this was 100% a positive move and a huge step in the right direction. Now on to some of the other things to continue to try to help support this process: @paulmelnikow You might have seen this. But in regards to outbound traffic static IPs, it looks like Heroku offers Fixie as an addon. Which looks to support having all outbound traffic come from one IP address. Only downside is cost and limits. Looks like they have bandwidth and number of requests limits depending on how much you pay. Since @olivierlacan offered, might be worth seeing if Heroku would be willing to offer any type of benefits since this is an open source project. This would also likely require communicating with the services that bypassed your IP for unlimited traffic and notifying them of the new IP so they can whitelist it. Not sure how many services that includes, but that for sure looks like an option. Resources: I'd say the other option here is to transform that old service into a pure proxy. Downside to this is having to maintain more infrastructure. I haven't heard too many cases of proxy's getting overloaded with traffic, but as this project grows and scales more, that could be a consideration that might make it difficult and would merely just be kicking the can down the road. In terms of startup flow and binding the port during startup, it looks like that PR and issue #4958 are good steps in that direction, and as far as I can see, look like it should be a good solution for fixing that issue. Overall looks like a huge success with a few super minor pitfalls to work through to increase the stability moving forward. Huge shoutout to the entire team here!! Amazing work! 🎉 |
I got a quick message back from @espadrine that he was okay with us moving forward with this, however that a VPS would be more cost effective. That's totally true. However I think the bonuses of a platform like a Heroku outweigh the cost, particularly if we can continue to get services donated from Heroku. Thanks @olivierlacan for the offer to reach out to them! We are operating on a donation from them at the moment, though in a few weeks we need to ask them to renew it for the year to come. I'll let you know when we're ready to do that, so we're coming in from both angles. We appreciate the ops suggestions @fishcharlie! Let's continue the discussion about the proxy services at #4962 which was opened yesterday. The port binding issue helped a lot but didn't perfectly fix it; let's continue that at #4958 (comment). We can always use enthusiastic help on this project. Most of us have been at it for a while 😉 so if you're looking for ways to help, please do continue to jump in! If you want to chat about ways to get more involved feel free to ping me on Discord. |
I've gotten a 👍 from @espadrine offline on giving @calebcartwright, @chris48s, and @PyvesB access to the production resources. Hearty congratulations and welcome! |
@paulmelnikow I'd be interested in reading a writeup about your experience with Heroku after a month or two of trying it out - What worked well, and what didn't work well. Have you considered "serverless" (Azure Functions or AWS Lambda) as well? Maybe Shields isn't built for it (as apps need to be entirely stateless in order to work well with serverless technologies), but might be worth keeping in mind? I wonder if it's worth considering something like Dokku too - It's similar to Heroku and is Heroku-compatible, but self-hosted, so you'd still get some of the benefits of self-hosting plus some of the benefits of a PaaS platform. Some of the disadvantages still apply (like round robin DNS not being ideal), but could be fixable. |
It's a cool idea! Thanks for suggesting it.
Yea, I got into a discussion with folks at Zeit / Vercel a year or so back about moving Shields to Now 2.0. Shields is entirely stateless, so that's not a problem. There are other issues, though. Currently the startup time is very high. You have to load the entire project to execute a single badge. That is fixable, however it's a big refactor. (Not nearly as huge as the service refactor though!) The bigger consideration is that Shields is extremely super CPU- and memory-efficient, so a long-lived process is very cost-efficient. Even if we had to pay full price for Heroku at fixed scale, paying full price for serverless executions would be more. (Yesterday, for example, we had 11.4 million uncached requests.)
I don't have personal experience with Dokku, though our legacy setup uses a homegrown post-receive hook which works in a similar way. This approach doesn't provide smooth failover, which might be solvable, and you also need to choose a solution for provisioning / deprovisioning. Millions of developers want Shields to be reliable and fast. If Heroku provides that, and is willing to keep sponsoring us, I'm really happy to use it. Maintaining Shields is a ton of work, and personally I would rather work on making the codebase friendly, engaging contributors, improving the quite dated frontend, and making the service awesome, than on running servers. We'll have to see how it goes! |
Sounds good to me! Thanks for continuing to keep things running 😃 |
The Ops team has unanimously agreed that our Heroku experiment has been a success, and we have decided that we will continue to host Shields in our Heroku environment going forward. As Paul noted above, there were a few minor issues we initially encountered and resolved, and a couple that we'll need to revisit with a longer term solution (#4962). However, the new Heroku hosting environment has been a big improvement overall, and it has ameliorated the performance and reliability issues we were having with our previous hosting environment. It also provides us with greater agility and allows us to more easily scale dynamically to handle the badge server loads. We'll continue to monitor the new environment and address the outstanding items to ensure Shields.io remains available and performant for the community. Thanks to everyone involved! |
Our status page is now targetting the new Heroku hosting environment as well! |
This comment has been minimized.
This comment has been minimized.
@pke - lets move that discussion to #5131 - its unrelated. I'm also going to close this down now that the migration is done and we're happy with the new environment and processes. Further ops work is covered in separate issues: https://github.com/badges/shields/issues?q=is%3Aissue+is%3Aopen+label%3Aoperations |
From time to time I’ve proposed some changes to our hosting, in particular migrating to a PaaS.
Our current hosting setup has a couple of advantages:
But it also has some significant disadvantages:
I’d like to propose moving the production servers to two Heroku clusters, one in the United States and a second in Europe. We’d configure Cloudflare geographic load balancing to direct traffic to the EU servers in priority over the US servers. Initially these would both use Standard-1X dynos, scaled manually, though we can consider adding auto-scaling in the future. The dynos would be sponsored by Heroku so there would be no cost to the Shields community. We have a platform credit on file though we probably will need to renew that for the upcoming year. We will save the very small amount of money since we’ll no longer need to pay the hosting costs of our current servers.
From the core maintainer team, Shields will create a four-member ops team. Each member of this group will be responsible for developing and understanding of the whole system, considering the security implications of their actions and our privacy commitment before taking every action, and conferring with the rest of the core team whenever needed. They will be responsible for investigating operations issues when they occur.
Access to the production Heroku apps will be given to the four members of this team. This is according to the principle of “least privilege,” as access to the Heroku apps includes access to all the production secrets.
Deploys will be handled by any of these four active maintainers, after basic checks that the staging site is working. However it’s possible in the future that other maintainers could be given access to deploy or that deploy could be done automatically.
I’d propose @chris48s, @calebcartwright, and @PyvesB join me as the four initial members of this group. If @espadrine wants to, he could join as a fourth/backup person. If not, the rest of us could consult with him as needed and give him our sincere thanks for the massive responsibility he’s carried over the years. As part of formalizing #2577, we could revisit the ops team’s membership, and how this responsibility is rotated over time.
There are many advantages to this approach:
It has some disadvantages, too:
While some basic metrics like response times and resource utilization are provided by the Heroku platform, the metrics we’ve been providing with Prometheus are not designed to be PaaS-friendly. We may need to do some work on our own to get Prometheus working with Heroku.(PaaS-friendly metrics #3946) (This issue was solved this weekend. Thanks @platan! 🎉 )Clearing this bottleneck would be a very, very good thing for this project’s health and sustainability.
Thoughts? Concerns? Questions? Especially looking for input from maintainers here, though comments are welcome from the community, too.
The text was updated successfully, but these errors were encountered: