Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ever-growing *.spec.whatwg.org storage off of the VM disk #107

Open
foolip opened this issue Nov 29, 2019 · 17 comments
Open

Move ever-growing *.spec.whatwg.org storage off of the VM disk #107

foolip opened this issue Nov 29, 2019 · 17 comments

Comments

@foolip
Copy link
Member

foolip commented Nov 29, 2019

This week marquee, which hosts all static whatwg.org sites, grew its disk usage past 80% of its 30GB and triggered an alert. I've increased the size to 50GB for now.

The constant increase is because of commit snapshots. We could compress on disk or deduplicate more, but it would still slowly grow, indefinitely. We shouldn't store these files on a fixed-size block device, but in an object store where there is no fixed upper limit.

DigitalOcean Spaces is a solution we could use, by letting nginx forward requests to it.

However, by still having all requests hit nginx we wouldn't be making full use of a solution like this. Spaces has a CDN feature with certificate handling, but it requires control over the DNS and is thus blocked by #75.

@annevk
Copy link
Member

annevk commented Nov 29, 2019

To clarify, request forwarding is a backend matter and does not involve redirects?

@foolip
Copy link
Member Author

foolip commented Nov 30, 2019

DigitalOcean Spaces doesn't support serving a website from it directly, but this is tracked in https://ideas.digitalocean.com/ideas/DO-I-318.

The smallest change that would work is to let nginx continue to handle redirects, and for requests that don't redirect proxy that to an internal Spaces endpoint. Spaces wouldn't itself ever respond with a redirect, at least not until https://ideas.digitalocean.com/ideas/DO-I-318 is fixed.

For all of the static sites, I think our requirements are:

  • many redirect rules with varying 301/302
  • control over content-type headers beyond what's inferred by file extensions
  • adding a bunch of headers like HSTS

@foolip
Copy link
Member Author

foolip commented Nov 30, 2019

@annevk
Copy link
Member

annevk commented Dec 2, 2019

Sorry, to restate my question, will our end-user-visible response URLs remain unchanged?

@foolip
Copy link
Member Author

foolip commented Dec 13, 2019

Yes, of course, any solution that doesn't give full control of the URL layout I'd just rule out :)

@foolip
Copy link
Member Author

foolip commented Mar 12, 2020

Numbers in whatwg/meta#161 (comment) suggest that everything would easily fit in a Git repo, but you can't serve a website from a repo so that doesn't solve everything here.

@foolip
Copy link
Member Author

foolip commented Oct 6, 2020

Hijacking this issue to drop some notes about using a CDN, which isn't the same problem as running out of disk space...

Some numbers based on using goaccess to analyze /var/log/nginx/access.log.{2,3,4}.gz, which seems to be about a day's worth of requests. With all hosts mixed together, we get 872.72 GiB of requests for /. Filtering out just html.spec.whatwg.org it's 721.76 GiB. So most of our traffic is serving https://html.spec.whatwg.org/. That's what I would have expected. If we were to use an CDN, we should do it for https://html.spec.whatwg.org/ first and see what that does for us.

I'm not sure about our numbers, I'm pretty sure they're the the compressed size, but we're not using 30*872 GiB ~= 26 TiB of transfer per month, more like 4-5 TiB. So this analysis is probably all wrong :)

@foolip
Copy link
Member Author

foolip commented Jan 26, 2021

It looks like https://www.digitalocean.com/products/app-platform/ could be something to look into for this. From a cursory view, it seems more like AppEngine, in that it supports Node.js and other languages, static content, and you don't manage the servers yourself.

@foolip
Copy link
Member Author

foolip commented Mar 25, 2021

I have looked into using DigitalOcean spaces with nginx in front, using proxy_pass to forward requests. This would allow us to keep all the redirects, which is nice.

The main problem this runs into is that a S3-like storage bucket is just a set of named objects whose names are paths, it's not a file system. The following can't be done in the usual way and needs some other solution:

  • redirecting "directories" like /validator to /validator/, but not /faq (exists) to /faq/, and preferably not /doesnotexist to /doesnotexist/
  • serving /validator/ from /validator/index.html in the bucket (without redirecting)
  • file listings, which we currently use fancyindex for

I think that if the first problem could be solved, then the second can be done with a location directive handling anything with a trailing slash, and we could generate static directory listings where we want them.

@domenic
Copy link
Member

domenic commented Mar 25, 2021

It looks like DigitalOcean Spaces is maybe particularly bad at this: S3 has a whole "website hosting mode", see e.g. their docs on index.html files. Whereas https://www.digitalocean.com/community/questions/spaces-set-index-html-as-default-landing-page seems to have seen no activity. Maybe using S3 (which we already do for PR preview) would be the right way to go here?

@foolip
Copy link
Member Author

foolip commented Mar 25, 2021

Hmm, I hadn't consider just using AWS S3, but that would probably solve most of this. What's not great about it is that we'd depend on both DigitialOcean and S3 being healthy at all times.

What mystifies me is that neither S3 nor spaces seems to have a way to set a Location header for a specific object, but can customize Content-Type and friends. If that were possible, this would be easy enough in Spaces too.

@domenic
Copy link
Member

domenic commented Mar 26, 2021

S3 has a complicated system: https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-page-redirect.html . It is a bit mystifying why they don't allow something simpler. E.g. the most flexible option, the JSON rules, is capped at 50. And the per-object redirect doesn't seem to let you choose the status codes.

@domenic
Copy link
Member

domenic commented Mar 26, 2021

Probably a bad idea to diversify even further, but there's also Netlify which has very straightforward _redirects and _headers files. I can't tell if they're really meant to scale in the same way as S3, but they seem serious...

@foolip
Copy link
Member Author

foolip commented Mar 26, 2021

If we could put objects in the bucket which the nginx front end turns into a redirect to add a slash, then I think we'd be set. (We'd also need to generate file listings but that could be a deploy step, not too hard I think.)

@domenic do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

One option we could look into is "deprecating" URLs with a trailing slash and writing redirect rules for the ones we currently have. But I don't love having to muck around with our URLs because we're changing the storage solution.

@domenic
Copy link
Member

domenic commented Mar 29, 2021

Do you know if S3 when hosting a static web site will redirect "directories" with no trailing slash to add a slash?

From https://docs.aws.amazon.com/AmazonS3/latest/userguide/IndexDocumentSupport.html :

For example, the following URL, with a trailing slash, returns the photos/index.html index document.

http://bucket-name.s3-website.Region.amazonaws.com/photos/

However, if you exclude the trailing slash from the preceding URL, Amazon S3 first looks for an object photos in the bucket. If the photos object is not found, it searches for an index document, photos/index.html. If that document is found, Amazon S3 returns a 302 Found message and points to the photos/ key. For subsequent requests to photos/, Amazon S3 returns photos/index.html. If the index document is not found, Amazon S3 returns an error.

So, it sounds like it will 302 redirect them. That appears to be similar to what we have today (e.g. https://whatwg.org/validator currently 301 redirects to https://whatwg.org/validator/.)

@foolip foolip mentioned this issue Sep 1, 2021
19 tasks
@foolip
Copy link
Member Author

foolip commented Sep 6, 2021

@foolip
Copy link
Member Author

foolip commented Feb 16, 2024

I won't be able to make time from WHATWG infra work this year, so here's a brain dump.

The /var/www/html.spec.whatwg.org/ directory on marquee is 29 GB, that's the biggest problem in any migration. As a Git repository it's 6GB, so that rules out any solution of the shape "put everything in Git and deploy on every commit". That's unfortunate, because there are many options for that.

A solution would take the shape of a storage bucket which deploys write into, and a frontend/CDN that just serves from that bucket. The hard part of that is preserving all of our redirects, and I've seen no storage buckets which have built-in redirect support that's expressive enough. (S3 has some stuff, not enough.) We would need something like https://developers.cloudflare.com/rules/url-forwarding/bulk-redirects/reference/csv-file-format/ I think.

This problem ought to be easy for someone who has experience maintaining large websites and migrating between hosting... if they were meticulous about preserving redirects.

That's all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants