Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Badge Images Often Fail To Load In Github README #1568

Closed
Undistraction opened this issue Mar 12, 2018 · 33 comments
Closed

Badge Images Often Fail To Load In Github README #1568

Undistraction opened this issue Mar 12, 2018 · 33 comments
Labels
operations Hosting, monitoring, and reliability for the production badge servers

Comments

@Undistraction
Copy link

Undistraction commented Mar 12, 2018

I've noticed that at least 50% of the time one or more badges on the README's from my various github project fail to display the image. I'm on a very fast connection (~ 100MBS)

In the error console:

Failed to load resource: the server responded with a status of 504 (Gateway Timeout)

The URLS are not the URLs added to the badges in the README, but point to some kind of Github cache:

URL from badge: https://img.shields.io/npm/v/blain.svg
URL of error: https://camo.githubusercontent.com/fa71495d8e006d53927660ed22594c3e7097c5a6/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f626c61696e2e737667

  • Multiple refreshes usually get one or all of the badges to load correctly.
  • Tested in Chrome, Safari and Firefox
  • I have seen this a lot recently on other project's READMEs

Example Repos

@Undistraction Undistraction changed the title Badges Often Fail To Display Correctly Badge Images Often Fail To Load In Github README Mar 12, 2018
@paulmelnikow
Copy link
Member

Hi, thanks for raising this issue. I've observed this behavior too; I'm sure many people can corroborate.

If you look at https://status.shields-server.com/ and click on one server at a time, you'll see that response times sometimes spike. It's not about the speed of your connection; rather some combination of our server's capacity, and the upstream services being slow or rate limiting us. Github images are served through a proxy, and the meaning of the 504 Gateway Timeout is that the shields server has taken too long to respond to the proxy, and the proxy has given up.

I would love to put work into making Shields more reliable. I think the fix is to add server capacity, and given that we're not going to make upstream rate limiting go away, be much more aggressive with caching through several means:

  • Excess server capacity and or elastic scaling so we can handle traffic spikes (currently there are three virtual machines, period)
  • Caching pieces of data from API responses (not just computed badge text, as we do now)
  • Basing cache priority on frequency (not just recency)
  • Sharing cache data between servers (not cache per server as now)
  • Bigger caches (requires more memory than our virtual machines have)

Our server budget is extremely limited, and frankly we need a significantly larger budget to consider any of these these options.

We ask developers who know and love shields to please make a one-time $10 donation. If you've already given, please ask your developer friends to do the same, or solicit big donations from big projects / companies who use Shields.

https://opencollective.com/shields

Also open to promotion ideas, ideas that don't take money, and in general discussing further!

@paulmelnikow paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Mar 15, 2018
@pixelass
Copy link

In my case it's more like

  • 90% of all cases at least one badge is not loaded.
  • 50% of all cases at least 2 bdges are not loded

screen shot 2018-03-22 at 15 46 17

Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)

@g105b
Copy link
Contributor

g105b commented Mar 23, 2018

Is this simply a case of not having enough server capacity? If so, would you mind letting us know the specifics of what server is being used, where it is located, and any details regarding bandwidth?

@RedSparr0w
Copy link
Member

RedSparr0w commented Mar 26, 2018

Just a quick test of the github timeouts:

1 second delay:
1 second
2 second delay:
2 seconds
3 second delay:
3 seconds
3.9 second delay:
3 seconds
3.95 second delay:
3 seconds
4 second delay:
4 seconds

Edit:
Removed 5, 6 second delays as not needed,
4 seconds seems to always timeout,
3.95 seconds looks to be okay.

@paulmelnikow
Copy link
Member

Is this simply a case of not having enough server capacity? If so, would you mind letting us know the specifics of what server is being used, where it is located, and any details regarding bandwidth?

@g105b Server capacity, yes, combined with more aggressive caching. See my comment above: #1568 (comment)

There are three servers, single-core VPS's with 2 GB RAM: VPS SSD 1 from OVH. One is in Gravelines, France, and I believe the other two are in Quebec, Canada.

@RedSparr0w Thanks for those tests!

To everyone following this issue, if you know and love Shields, please make a one-time $10 donation if you haven't already, and ask your friends to do the same! https://opencollective.com/shields

@RedSparr0w
Copy link
Member

RedSparr0w commented Apr 14, 2018

I've noticed a trend over the past few days that server response times around 7am-10am & 1pm-3pm (UTC) are a lot higher than usual,
I suspect this is the time where most of the badges are failing (due to GitHub timing out after 4 seconds).
image
@espadrine Is there anything in the logs that would suggest a much higher amount of traffic from any particular sources during those times?

@RedSparr0w
Copy link
Member

RedSparr0w commented Apr 19, 2018

Been tracking how often the badges have a response time over 4 seconds here, and still seems to be consistent with the above.

Between 7am-10am & 1pm-3pm response times are a lot higher than normal causing the images to timeout when loading on GitHub:
chart
During the weekend response times were pretty good:
image
On Monday and Tuesday response times were above 4 seconds almost the entire peak hours:
image
Note: times are UTC

gtalarico added a commit to gtalarico/pipenv-pipes that referenced this issue May 7, 2018
@joshenders
Copy link

I work in the CDN/proxy space and can validate that @pelson's response is the correct approach. Adding server capacity for what is essentially a misconfigured HTTP response is not an efficient use of donation money.

@RedSparr0w
Copy link
Member

@joshenders There is work going on with headers in #1725 which has recently been merged, with #1806 being the next step to enabling it, and hopefully getting this issue fixed 🤞

@paulmelnikow
Copy link
Member

@joshenders If you have a chance to read the discussion in #1725, please do!

@paulmelnikow
Copy link
Member

The recent work to set longer cache headers has just gone live. I will be curious to see how much that helps.

It is very likely we also have a capacity issue, owing to ~10% growth over the last several months. I have proposed moving to Zeit Now to fix the capacity issue and solve our sysadmin bottleneck at the same time. This proposal is blocked awaiting response from @espadrine who owns the servers and load balancer.

@paulmelnikow
Copy link
Member

paulmelnikow commented Aug 7, 2018

I’m glad to say addressing the cache headers (#1723) has had a huge effect. Today’s peak traffic is being handled like weekend traffic, with 99% of requests coming in underneath the 4 second camo timeout. The only broken badges I’m seeing today are not ours. 😁

That gives us a little time to sort out our hosting. We’re still relatively slow on a number of badges, particularly the static badges which should be instant.

@RedSparr0w
Copy link
Member

RedSparr0w commented Aug 7, 2018

Uptimes are definitely getting better:
snapshot of the last 24 hours
average response time (24 hours)

@paulmelnikow
Copy link
Member

Another weekday over 99%. 👍😌

If this problem recurs, or there are any other follow-on proposals, let’s open a new issue.

@nobody5050
Copy link

Still having issues with this on several readme’s

@calebcartwright
Copy link
Member

calebcartwright commented Mar 30, 2021

Going to close and lock this issue as it's long been resolved but has a reasonably high potential to elicit follow-on comments.

For anyone else that stumbles upon this one...

This 3+ year old issue (as of the time of this post) was originally reflective of the fact that the Shields project experienced a lot of growth that was overwhelming the minimal runtime environment back then, and the overloaded Shields servers were often unable to serve the requested badges within the window enforced by GitHub/Camo. That in turn would result in timeouts/badges not being rendered on GitHub readme pages.

This has long since been resolved with various runtime improvements and caching mechanisms, and today Shields is serving up more than 750 million badges per month without issue. It is of course still possible that one may see a badge that failed to render in GitHub from time to time, but this isn't related to the widespread and persistent issues that originated this issue.

If anyone has questions/reports/etc. about badges not rendering, please open a new issue and/or ping us on Discord with all the relevant details, including screenshots and the badges/badge types.

Please also note that the GitHub/Camo imposed time limits for rendering images are still in place, so it's not entirely uncommon to see rendering challenges with certain badges like the Dynamic and/or Endpoint badges, particularly if those endpoints are running on a platform that periodically shuts them down (like the Heroku free tier). This can happen because there is a rather tight time window for the entire badge request/response flow to complete, and after receiving a badge request the Shields servers almost always have to first fetch data from some upstream endpoint which does not always provide the needed data quickly enough.

@badges badges locked as resolved and limited conversation to collaborators Mar 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

No branches or pull requests