Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github badges are intermittently inaccessible #1245

Closed
paulmelnikow opened this issue Nov 2, 2017 · 24 comments
Closed

Github badges are intermittently inaccessible #1245

paulmelnikow opened this issue Nov 2, 2017 · 24 comments
Labels
bug Bugs in badges and the frontend operations Hosting, monitoring, and reliability for the production badge servers

Comments

@paulmelnikow
Copy link
Member

paulmelnikow commented Nov 2, 2017

I'm not sure whether this is due to one of the recent changes…

#1142 #1117 #1195 #1186 #1118

…or simply #1119, which is a bug that causes a token to erroneously be considered exhausted once it's used for a search request.

People can't add new tokens either (#1243), exacerbating this slightly, but that will be fixed in #1038.

The first report was roughly 16 hours after deploy.

@paulmelnikow paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Nov 2, 2017
@paulmelnikow
Copy link
Member Author

cc @espadrine

@paulmelnikow
Copy link
Member Author

paulmelnikow commented Nov 2, 2017

The badges are working again, and I think our "main" rate limit just reset:

core:
  remaining: 12489 of 12500
  reset: in an hour
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Generated with https://github.com/paulmelnikow/github-limited

@paulmelnikow
Copy link
Member Author

Now intermittently broken, though plenty of rate limit left.

core:
  remaining: 12479 of 12500
  reset: in 18 minutes
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Here's an example: https://img.shields.io/github/tag/expressjs/express.svg

@paulmelnikow paulmelnikow changed the title Github badges are all inaccessible Github badges are intermittently inaccessible Nov 2, 2017
@paulmelnikow
Copy link
Member Author

I sent this to @espadrine about an hour ago:

The Github badges are failing intermittently. Are you seeing crashes on the server?

It might be an old bug, related to our handling of quotas for the Github search API. Though the timing of the incident makes me suspect recent changes. I've made some recent changes to the github auth but it's not jumping out at me from reading them.

It's difficult to debug without server access. I'm thinking I should add an endpoint to get all the user tokens, or else, hashed user tokens with stats. That way I could troubleshoot a bit better, locally.

Is there currently any backup of the user tokens, apart from the other servers?

I do have some new github token code to fix the search API quota issue, though it's a rewrite and I'd like to test it more first. Before merging I also want to add some optional trace logging we can turn on in cases like this.

Feel like I need deploy access + logs + a way to restore the token file, to deploy that with confidence I can find and fix whatever might be wrong with it.

Any thoughts on what could be causing the ssh issue?

I like getting to the bottom of things, and want to fix this, however my options are limited.

@paulmelnikow
Copy link
Member Author

paulmelnikow commented Nov 3, 2017

I set up a status page:

https://status.shields-server.com/

It runs a static badge, the Github license badge, and the npm license badge, and for each one, looks for some of the expected right text.

I'm happy to cover the cost for a couple months ($5.50) but it might be good to migrate to something else soon.

When I created shields-server.com, I set up cnames for s0.shields-server.com, s1.shields-server.com, s3.shields-server.com, though it'd be better to make these subdomains of shields.io and dump the extra domain.

@RedSparr0w
Copy link
Member

Nice page, Should help get some insight on what's going wrong,
The GitHub licence badge seems to be failing a fair amount (~20% currently)
Is s1, s2, s3 running different code or are they all the same?

@paulmelnikow
Copy link
Member Author

paulmelnikow commented Nov 3, 2017

Yea, thanks, it should help. The code on the three servers should be the same.

There are interesting patterns in the downtime:

https://status.shields-server.com/779605524
https://status.shields-server.com/779605526
https://status.shields-server.com/779605529

The three servers had correlated downtime around 15:30 (that’s NY time). One of them also had downtime an hour earlier, around 14:30. Two had downtime around 13:33 / 13:43.

The duration of the downtime varies from server to server. For example, s0 was down from 15:28 to 15:49, s1 from 15:30 to 15:36, and s2 from 15:32 to 15:48.

Correlated downtime suggests there is some shared state, pointing to rate limit exhaustion as a factor. Downtime about an hour apart might correlate with rate limit resets.

The skew in recovery time might be explained by caching, though there might be other explanations too.

@RedSparr0w
Copy link
Member

Yeah it's quite strange that the down times are very similar, would setting a very low max-age help with possible caching issues?

@paulmelnikow
Copy link
Member Author

As far as I can tell, maxAge only affects cache headers – and potentially the behavior of the client – though not the behavior of the Shields server. I wouldn't think UptimeRobot did any caching. It wouldn't really make sense for a monitoring service. So I don't think setting maxAge would have any effect.

@paulmelnikow
Copy link
Member Author

I just wanted to clarify that the caching I think might be involved is the Shields internal vendor cache in lib/request-handler.js.

Interesting that we're still seeing hourly downtime, though less correlated between servers. I wonder if it is related to hours since uptime.

screen shot 2017-11-05 at 12 28 10 pm

screen shot 2017-11-05 at 12 28 15 pm

screen shot 2017-11-05 at 12 28 20 pm

@paulmelnikow
Copy link
Member Author

screen shot 2017-11-06 at 11 02 34 am

screen shot 2017-11-06 at 11 02 40 am

screen shot 2017-11-06 at 11 02 46 am

@RedSparr0w
Copy link
Member

Still seems to be failing ~20% of the time,
Any clues yet as to what the problem could be?
It still seems they generally go down/up within 5-20 minutes of each other.

@espadrine
Copy link
Member

espadrine commented Nov 9, 2017

Three things.

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.
It is inevitable that crossing the Atlantic yields a poorer SLA.
On the plus side, it is the least infuriating SSH session for me, and Europeans enjoy a faster static badge thanks to it.

Second, the worldwide load looks like this.
image

(Local time probably means UTC? Hard to tell. It's 10:40am here in France.)
In which case, we can call the two low points "Pacific daytime" and… "Chinese lunch break"?

I can't recall what the third thing was, but maybe it was related to describing exactly the shape of the failures? Like, is it failing once every ten during the high-load hour?

This was referenced Nov 9, 2017
@GBH
Copy link

GBH commented Nov 9, 2017

Just bumping to say I experience non-loading badges for several days now. Every other refresh I get Invalid upstream response (521) from githubusercontent.com

@jaydenseric
Copy link

I've been seeing a lot of this the last few days:

screen shot 2017-11-10 at 12 51 56 pm

@paulmelnikow
Copy link
Member Author

Indeed, this has happened with a good chunk of requests over the last few days.

https://status.shields-server.com/

Things have been much worse the last 22 hours because of #1263, which is unrelated service-provider downtime that took out one of our servers.

@paulmelnikow
Copy link
Member Author

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.

Good to know. That explains why the stats for s1 are sometimes slightly worse.

@paulmelnikow
Copy link
Member Author

To re-summarize:

  1. @espadrine, who has limited time on this project, is the only sysadmin.
  2. He's working on giving me access.
  3. Doing so is complicated because the hosting account (and maybe the servers too) are shared with other services he runs.
  4. I like getting to the bottom of things, and want to fix this, however my options are limited.

I just emailed this plan:

To solve #1119, I rewrote the GitHub auth logic which is in an unmerged PR. There were other minor bugs I found along the way. A logic error in the token sorting, a missing callback.

I’d like to deploy that new code, but it’s a big change and I don’t feel comfortable doing it without some way to back up and restore the tokens, and deploy and logs access or else a deploy window when you’re around.

Here’s what I’ll do:

  • Add some debug output and/or debug API to the current github-auth code
  • Self-review, again, the new github-auth PR
  • Add debug output and/or debug API to the new github-auth PR

Could I ask you to:

  • Check how many tokens we have in production
  • Deploy latest so we can start collecting additional tokens (it’ll help a little, I think)
  • Sort out the logging
  • Debug the ssh issue

@manuel-rubio
Copy link

@paulmelnikow I was checking the links:

https://img.shields.io/codecov/c/github/bragful/ephp.svg
https://img.shields.io/travis/bragful/ephp/master.svg

They are working but they take too much time to load (around 15 seconds). Github works using a proxy to retrieve this kind of images so, the error retrieved by the browser is a 504 (Gateway timeout).

Did you check the amount of requests your system is receiving to generate the badges? If I can help you with something just let me know.

@paulmelnikow
Copy link
Member Author

@manuel-rubio Yea, that's unfortunate. See #1263.

@paulmelnikow paulmelnikow added the bug Bugs in badges and the frontend label Nov 10, 2017
@paulmelnikow
Copy link
Member Author

paulmelnikow commented Nov 10, 2017

While working on

  • Add some debug output and/or debug API to the current github-auth code

I found the issue. It's a dumb thing I introduced in #1118. Fixed in #1266.

AFAICT production has been running with anonymous quota. I'm shocked this has been working as well as it has. Admittedly, not that well, though I'd have expected what we have to work for the first few seconds of every hour.

Either the server is using a different github secret from the one I expect, or as likely, the Shields IPs do indeed have special treatment from GitHub.

I'm still eager to get the new code shipped, as it has a lot more tests. And of course #1263 remains an issue.

@paulmelnikow
Copy link
Member Author

Opened #1267 with an auth debug endpoint + logging.

@paulmelnikow
Copy link
Member Author

If I can help you with something just let me know.

I didn't really answer this question @manuel-rubio!

There are four ways you can help:

  1. Review my changes. I have to self-review my code, which is hardly ideal. For what it's worth, the PR that caused this regression was open for four weeks, which is plenty of opportunity. If a team of five people could review a couple PRs per week, my changes could easily have 2–3 reviews apiece. Not only does would this reduce bugs, over time it has a wonderful side effect of making the code more readable and therefore more approachable.
  2. Perform first reviews of simple changes, like badge additions.
  3. Monitor issues and the chat room, and help other people who have questions about contributing to Shields, or using it for their projects. Dig into the code as needed. This is the easiest way to create time among the people who have the most context on this project.
  4. Contribute GitHub tokens and $. I honestly don't know much about our current financial state, though I would love to have the flexibility to use third-party monitoring and logging services, and not to mention choose hosting that makes scaling and shared administration easy. We're setting up an OpenCollective since Gratipay is shutting down.

@paulmelnikow
Copy link
Member Author

The fix is deployed. Status looks good:

screen shot 2017-11-11 at 2 22 10 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs in badges and the frontend operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

No branches or pull requests

6 participants