Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[working] webinstall.dev was down due to the recent GitHub API outage #962

Open
D3f0 opened this issue Feb 6, 2025 · 2 comments
Open

[working] webinstall.dev was down due to the recent GitHub API outage #962

D3f0 opened this issue Feb 6, 2025 · 2 comments

Comments

@D3f0
Copy link

D3f0 commented Feb 6, 2025

curl -vv https://webinstall.dev

* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 / [blank] / UNDEF
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=webinstall.dev
*  start date: Jan 13 16:43:45 2025 GMT
*  expire date: Apr 13 16:43:44 2025 GMT
*  subjectAltName: host "webinstall.dev" matched cert's "webinstall.dev"
*  issuer: C=US; O=Let's Encrypt; CN=E6
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://webinstall.dev/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: webinstall.dev]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.7.1]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: webinstall.dev
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/2 502
< server: Caddy
< content-length: 0
< date: Thu, 06 Feb 2025 12:40:40 GMT
<

@coolaj86
Copy link
Member

coolaj86 commented Feb 6, 2025

Confirmed. We're having a repeat of #874.

I'm investigating.

@coolaj86 coolaj86 changed the title webinstall.dev down (502) [fixed] webinstall.dev down (502) (GitHub Release APIs have failed) Feb 6, 2025
@coolaj86
Copy link
Member

coolaj86 commented Feb 6, 2025

Confirmed that this was due to some incomplete error handling on our part, which was triggered by an internal problem with GitHub's Release API:

First we were getting this:

<p><strong>We couldn't respond to your request in time.</strong></p>
<p>Sorry about that. Please try refreshing and contact us if the problem persists.</p>
<div id="suggestions">
  <a href="https://github.com/contact">Contact Support</a> &mdash;
  <a href="https://www.githubstatus.com">GitHub Status</a> &mdash;
  <a href="https://twitter.com/githubstatus">@githubstatus</a>
</div>

And afterwards simply:

Server Error

Admittedly, this should just return an error for the packages whose metadata relies on the GitHub Releases API - which is most of them (but not Node or Zig or Go or Rust or many other popular installers) - not take down the entire service.

However, there's a background task that which isn't part of the API start which does not have the same error handling as the API routes. When it has an error, it's bubbles up to an async task, and then fails as an uncaught error.

Error: fetch was not ok
    at Fetcher.throwIfNotOk (/home/app/srv/webinstall.dev/installers/_common/fetcher.js:36:15)
    ...
    at async /home/app/srv/webinstall.dev/installers/_webi/builds-cacher.js:401:18

This was just an oversight in the design. Then when the restart occurs, the background task is immediately started and runs its first random update - which is 90% likely to use the GitHub Release API, which immediately causes the failure, triggering a restart, and then the restart rate limite.

Mitigation

As an immediate fix, I'm simply removed the restart limit.

I will also work on the real fix today - which is to make sure the background update task is wrapped with an error handler that simply logs the error, and return the best data from cache.

One concern I have is that repeatedly hitting the GitHub Release API in an error condition may trigger rate-limiting, would could cause updates to quit until a cool-off period. However, I doubt that's the case that GitHub would count internal errors against the API limits.

@coolaj86 coolaj86 changed the title [fixed] webinstall.dev down (502) (GitHub Release APIs have failed) [working] webinstall.dev was down due to the recent GitHub API outage Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants