-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Site Partial Outage Summary (~5 hours) #385
Comments
We all make mistakes, and I've certainly been on the other side of deploys where I couldn't figure out what went wrong at all. Thanks so much for putting this post-mortem together so we can discuss this.
Can you elaborate on the "deploy directly" bit?
One of the things here that you mentioned is that the site is an app service which for our cases (a static site with no back-end) is more complicated and less efficient. We've wanted to change this, but never been able to because...well, the authentications/permissions issues you hit!! 😫😄 Generally the guidance I've heard from folks in Azure is to use the storage APIs, but the workflow sounds confusing there. We should figure out what an ideal workflow and hosting situation is. |
Today, the deploy process is:
This would switch it so that it's only one thing:
This is much closer to how we do PR builds right now basically.
The idea here is that we can get logs in the CI. As well as being able to see pass/fail without going into azure. I'm hoping that the CI action will just zip it up itself and overwrite the current running app. Though, its possible we could hit that same bug, but the thing we definitely know is whether we've got our code to azure correctly. When the webhook was unreliable, you weren't sure without logging in.
Ace, yeah, I thought the site was this setup all along! Was kinda surprised myself, I've set up a CDN on azure with edge support before (which is what playground v2 uses) and it's likely that this should be workable for the TS site too. I'll need to do a bit of work (it's not quite netlify/now) but it should be achievable. |
I've sent a PR to the gatsby plugin which would stop the potential |
OK, closing this out |
I have the staging deploys now directly deploying to Azure's blob storage as a static website |
Re: #378
A deploy of the TypeScript v2 website, with somewhat novel re-directs structures, ensured that subsequent builds to the site would not be deployed correctly.
Effectively some files on the site were the beta version of the website, and others were the older version of the website.
This meant a lot of pages couldn't link between each other if they were a beta page, and that some links didn't work (like the playground, which relies on a lot of other files).
Timeline
Causes
We want to make sure that no links break in transitioning from v1 to v2. There's a lot of v1 links which aren't in active use, but also redirect to existing pages, so the v2 site has a section which looks like this:
setupRedirects
loops through the objects above it and tells gatsby that these redirects exist. These redirects are then emitted to the file system as/Playground/index.html
which forwards you to/play
via client-side JavaScript using the plugin gatsby-plugin-client-side-redirect.During a deploy, CI pushes to a either the branch
SITE-PRODUCTION
/SITE-STAGING
and then sends a webhook for Azure to pick up and deploy the static HTML from those branches (similar to how github pages works)The deployment script in Azure failed when a file transitioned from a path like
/docs/handbook/writing-declaration-files.html
to instead be a folder with an index.html (/docs/handbook/writing-declaration-files/index.html
).This meant that every file alphabetically from the deploy had successfully migrated until the above was hit. Causing half the site to be in v1, and half the site to be in v2.
Resolution
Effectively we had been slow with setting up access to the Azure portal, which is where we would have been able to see build logs for deploys.
Deploys to the TypeScript site have been unpredictable on the azure side for quite a while, normally you can send another build down the pipeline and it fixes itself on the next run. This meant when a bad deploy happened, the first few answers were simply "let's send another build across" which is roughly a 30 minute process (~15m in CI, then ~5-30m in Azure) to verify
After a few cases of "send a deploy" didn't work and gave baffling results of the v1 index page and the v2 playground, then it started to look like getting access to the build logs was going to be the only answer.
@DanielRosenwasser asked for some help from someone who had been helping the TS team set up our Azure portal access (Thanks Antoni) to see if we could speed it up.
Once we had access to the build logs, it became quite obvious what the issue was:
From there the files were deleted via the Azure console, and a triggered redeploy successfully got through making the site v1.
Post-Mortem
It took us 9 months to get other members of the team access to the Azure portal, which ironically, was supposed to happen earlier this morning but we had to re-schedule (given how wild everything is with COVID-19).
I didn't push the deadlines hard enough because we weren't seeing any problems, having portal access to change settings and see build logs is a "nice to have" when you think you're working with a static site hosting. It turns out that the site isn't running on cloud storage, but is an Azure App Service app which meant we own more of the hosting responsibilities than I had anticipated.
To my knowledge, this has been the first downtime since I've started working on the site - that sucks. Sorry folks.
Mitigation
I have a few direct TODOs to stop this happening again:
/index.html
when the file is already a*.html
For the long term:
The text was updated successfully, but these errors were encountered: