Github badges are intermittently inaccessible #1245

paulmelnikow · 2017-11-02T14:48:08Z

I'm not sure whether this is due to one of the recent changes…

…or simply #1119, which is a bug that causes a token to erroneously be considered exhausted once it's used for a search request.

People can't add new tokens either (#1243), exacerbating this slightly, but that will be fixed in #1038.

The first report was roughly 16 hours after deploy.

paulmelnikow · 2017-11-02T14:48:31Z

cc @espadrine

paulmelnikow · 2017-11-02T15:12:34Z

The badges are working again, and I think our "main" rate limit just reset:

core:
  remaining: 12489 of 12500
  reset: in an hour
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Generated with https://github.com/paulmelnikow/github-limited

paulmelnikow · 2017-11-02T15:44:32Z

Now intermittently broken, though plenty of rate limit left.

core:
  remaining: 12479 of 12500
  reset: in 18 minutes
search:
  remaining: 30 of 30
  reset: in a minute
graphql:
  remaining: 5000 of 5000
  reset: in an hour

Here's an example: https://img.shields.io/github/tag/expressjs/express.svg

paulmelnikow · 2017-11-02T17:56:49Z

I sent this to @espadrine about an hour ago:

The Github badges are failing intermittently. Are you seeing crashes on the server?

It might be an old bug, related to our handling of quotas for the Github search API. Though the timing of the incident makes me suspect recent changes. I've made some recent changes to the github auth but it's not jumping out at me from reading them.

It's difficult to debug without server access. I'm thinking I should add an endpoint to get all the user tokens, or else, hashed user tokens with stats. That way I could troubleshoot a bit better, locally.

Is there currently any backup of the user tokens, apart from the other servers?

I do have some new github token code to fix the search API quota issue, though it's a rewrite and I'd like to test it more first. Before merging I also want to add some optional trace logging we can turn on in cases like this.

Feel like I need deploy access + logs + a way to restore the token file, to deploy that with confidence I can find and fix whatever might be wrong with it.

Any thoughts on what could be causing the ssh issue?

I like getting to the bottom of things, and want to fix this, however my options are limited.

paulmelnikow · 2017-11-03T17:04:58Z

I set up a status page:

https://status.shields-server.com/

It runs a static badge, the Github license badge, and the npm license badge, and for each one, looks for some of the expected right text.

I'm happy to cover the cost for a couple months ($5.50) but it might be good to migrate to something else soon.

When I created shields-server.com, I set up cnames for s0.shields-server.com, s1.shields-server.com, s3.shields-server.com, though it'd be better to make these subdomains of shields.io and dump the extra domain.

RedSparr0w · 2017-11-03T20:36:24Z

Nice page, Should help get some insight on what's going wrong,
The GitHub licence badge seems to be failing a fair amount (~20% currently)
Is s1, s2, s3 running different code or are they all the same?

paulmelnikow · 2017-11-03T20:55:34Z

Yea, thanks, it should help. The code on the three servers should be the same.

There are interesting patterns in the downtime:

https://status.shields-server.com/779605524
https://status.shields-server.com/779605526
https://status.shields-server.com/779605529

The three servers had correlated downtime around 15:30 (that’s NY time). One of them also had downtime an hour earlier, around 14:30. Two had downtime around 13:33 / 13:43.

The duration of the downtime varies from server to server. For example, s0 was down from 15:28 to 15:49, s1 from 15:30 to 15:36, and s2 from 15:32 to 15:48.

Correlated downtime suggests there is some shared state, pointing to rate limit exhaustion as a factor. Downtime about an hour apart might correlate with rate limit resets.

The skew in recovery time might be explained by caching, though there might be other explanations too.

RedSparr0w · 2017-11-03T20:58:52Z

Yeah it's quite strange that the down times are very similar, would setting a very low max-age help with possible caching issues?

paulmelnikow · 2017-11-03T21:06:46Z

As far as I can tell, maxAge only affects cache headers – and potentially the behavior of the client – though not the behavior of the Shields server. I wouldn't think UptimeRobot did any caching. It wouldn't really make sense for a monitoring service. So I don't think setting maxAge would have any effect.

paulmelnikow · 2017-11-05T17:28:54Z

I just wanted to clarify that the caching I think might be involved is the Shields internal vendor cache in lib/request-handler.js.

Interesting that we're still seeing hourly downtime, though less correlated between servers. I wonder if it is related to hours since uptime.

paulmelnikow · 2017-11-06T16:03:09Z

RedSparr0w · 2017-11-09T02:09:18Z

Still seems to be failing ~20% of the time,
Any clues yet as to what the problem could be?
It still seems they generally go down/up within 5-20 minutes of each other.

espadrine · 2017-11-09T09:47:19Z

Three things.

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.
It is inevitable that crossing the Atlantic yields a poorer SLA.
On the plus side, it is the least infuriating SSH session for me, and Europeans enjoy a faster static badge thanks to it.

Second, the worldwide load looks like this.

(Local time probably means UTC? Hard to tell. It's 10:40am here in France.)
In which case, we can call the two low points "Pacific daytime" and… "Chinese lunch break"?

I can't recall what the third thing was, but maybe it was related to describing exactly the shape of the failures? Like, is it failing once every ten during the high-load hour?

GBH · 2017-11-09T19:57:29Z

Just bumping to say I experience non-loading badges for several days now. Every other refresh I get Invalid upstream response (521) from githubusercontent.com

jaydenseric · 2017-11-10T01:53:28Z

I've been seeing a lot of this the last few days:

paulmelnikow · 2017-11-10T04:35:57Z

Indeed, this has happened with a good chunk of requests over the last few days.

https://status.shields-server.com/

Things have been much worse the last 22 hours because of #1263, which is unrelated service-provider downtime that took out one of our servers.

paulmelnikow · 2017-11-10T04:36:01Z

s1 is European (located at Gravelines, IIRC), while s0 and s2 are in Canada (Montreal?).
Most of our vendors (typically, GitHub) have US servers.

Good to know. That explains why the stats for s1 are sometimes slightly worse.

paulmelnikow · 2017-11-10T04:40:35Z

To re-summarize:

@espadrine, who has limited time on this project, is the only sysadmin.
He's working on giving me access.
Doing so is complicated because the hosting account (and maybe the servers too) are shared with other services he runs.
I like getting to the bottom of things, and want to fix this, however my options are limited.

I just emailed this plan:

To solve #1119, I rewrote the GitHub auth logic which is in an unmerged PR. There were other minor bugs I found along the way. A logic error in the token sorting, a missing callback.

I’d like to deploy that new code, but it’s a big change and I don’t feel comfortable doing it without some way to back up and restore the tokens, and deploy and logs access or else a deploy window when you’re around.

Here’s what I’ll do:

Add some debug output and/or debug API to the current github-auth code

Self-review, again, the new github-auth PR

Add debug output and/or debug API to the new github-auth PR

Could I ask you to:

Check how many tokens we have in production

Deploy latest so we can start collecting additional tokens (it’ll help a little, I think)

Sort out the logging

Debug the ssh issue

manuel-rubio · 2017-11-10T10:31:27Z

@paulmelnikow I was checking the links:

https://img.shields.io/codecov/c/github/bragful/ephp.svg
https://img.shields.io/travis/bragful/ephp/master.svg

They are working but they take too much time to load (around 15 seconds). Github works using a proxy to retrieve this kind of images so, the error retrieved by the browser is a 504 (Gateway timeout).

Did you check the amount of requests your system is receiving to generate the badges? If I can help you with something just let me know.

paulmelnikow · 2017-11-10T15:57:02Z

@manuel-rubio Yea, that's unfortunate. See #1263.

paulmelnikow · 2017-11-10T18:34:53Z

While working on

Add some debug output and/or debug API to the current github-auth code

I found the issue. It's a dumb thing I introduced in #1118. Fixed in #1266.

AFAICT production has been running with anonymous quota. I'm shocked this has been working as well as it has. Admittedly, not that well, though I'd have expected what we have to work for the first few seconds of every hour.

Either the server is using a different github secret from the one I expect, or as likely, the Shields IPs do indeed have special treatment from GitHub.

I'm still eager to get the new code shipped, as it has a lot more tests. And of course #1263 remains an issue.

paulmelnikow · 2017-11-10T19:54:57Z

Opened #1267 with an auth debug endpoint + logging.

paulmelnikow · 2017-11-10T20:20:15Z

If I can help you with something just let me know.

I didn't really answer this question @manuel-rubio!

There are four ways you can help:

Review my changes. I have to self-review my code, which is hardly ideal. For what it's worth, the PR that caused this regression was open for four weeks, which is plenty of opportunity. If a team of five people could review a couple PRs per week, my changes could easily have 2–3 reviews apiece. Not only does would this reduce bugs, over time it has a wonderful side effect of making the code more readable and therefore more approachable.
Perform first reviews of simple changes, like badge additions.
Monitor issues and the chat room, and help other people who have questions about contributing to Shields, or using it for their projects. Dig into the code as needed. This is the easiest way to create time among the people who have the most context on this project.
Contribute GitHub tokens and $. I honestly don't know much about our current financial state, though I would love to have the flexibility to use third-party monitoring and logging services, and not to mention choose hosting that makes scaling and shared administration easy. We're setting up an OpenCollective since Gratipay is shutting down.

paulmelnikow · 2017-11-11T19:22:34Z

The fix is deployed. Status looks good:

paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Nov 2, 2017

paulmelnikow changed the title ~~Github badges are all inaccessible~~ Github badges are intermittently inaccessible Nov 2, 2017

paulmelnikow mentioned this issue Nov 7, 2017

Activating Open Collective #1250

Closed

This was referenced Nov 9, 2017

Add [Liberapay] #1251

Merged

Get PHP version from [Packagist] #1256

Merged

paulmelnikow added the bug Bugs in badges and the frontend label Nov 10, 2017

paulmelnikow mentioned this issue Nov 10, 2017

Use GitHub token rotation in production :P #1266

Merged

paulmelnikow closed this as completed Nov 11, 2017

adamjstone mentioned this issue Aug 15, 2019

Intermittent 502 responses from camo.githubusercontent.com #3874

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github badges are intermittently inaccessible #1245

Github badges are intermittently inaccessible #1245

paulmelnikow commented Nov 2, 2017 •

edited

Loading

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 2, 2017 •

edited

Loading

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 3, 2017 •

edited

Loading

RedSparr0w commented Nov 3, 2017

paulmelnikow commented Nov 3, 2017 •

edited

Loading

RedSparr0w commented Nov 3, 2017

paulmelnikow commented Nov 3, 2017

paulmelnikow commented Nov 5, 2017

paulmelnikow commented Nov 6, 2017

RedSparr0w commented Nov 9, 2017

espadrine commented Nov 9, 2017 •

edited

Loading

GBH commented Nov 9, 2017

jaydenseric commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

manuel-rubio commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017 •

edited

Loading

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 11, 2017

Github badges are intermittently inaccessible #1245

Github badges are intermittently inaccessible #1245

Comments

paulmelnikow commented Nov 2, 2017 • edited Loading

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 2, 2017 • edited Loading

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 2, 2017

paulmelnikow commented Nov 3, 2017 • edited Loading

RedSparr0w commented Nov 3, 2017

paulmelnikow commented Nov 3, 2017 • edited Loading

RedSparr0w commented Nov 3, 2017

paulmelnikow commented Nov 3, 2017

paulmelnikow commented Nov 5, 2017

paulmelnikow commented Nov 6, 2017

RedSparr0w commented Nov 9, 2017

espadrine commented Nov 9, 2017 • edited Loading

GBH commented Nov 9, 2017

jaydenseric commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

manuel-rubio commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017 • edited Loading

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 10, 2017

paulmelnikow commented Nov 11, 2017

paulmelnikow commented Nov 2, 2017 •

edited

Loading

paulmelnikow commented Nov 2, 2017 •

edited

Loading

paulmelnikow commented Nov 3, 2017 •

edited

Loading

paulmelnikow commented Nov 3, 2017 •

edited

Loading

espadrine commented Nov 9, 2017 •

edited

Loading

paulmelnikow commented Nov 10, 2017 •

edited

Loading