Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken images on some plugin pages #156

Open
simonw opened this issue Jan 16, 2024 · 16 comments
Open

Broken images on some plugin pages #156

simonw opened this issue Jan 16, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Jan 16, 2024

https://datasette.io/plugins/datasette-atom

CleanShot 2024-01-16 at 08 37 55@2x

@simonw simonw added the bug Something isn't working label Jan 16, 2024
@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Possibly relevant:

Those talk about timeouts though, which I think isn't the same thing as this signature problem.

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

I'm going to force re-build the README for datasette-atom to see if I can fix that one-off. Here's the logic that's causing it not to fetch a fresh copy:

# Fetch README for any repos that have changed since last time
repos_to_fetch_readme_for = []
for row in db["datasette_repos"].rows:
if (
row["latest_commit"] != previous_hashes.get(row["nameWithOwner"])
or force_fetch_readmes
):
repos_to_fetch_readme_for.append(row["nameWithOwner"])

simonw added a commit that referenced this issue Jan 16, 2024
@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

That did fix it:

https://datasette.io/plugins/datasette-atom

CleanShot 2024-01-16 at 09 03 09@2x

So I think the fix may be to change the logic about fetching README files so it fetches if the repo has been updated OR if the README was last fetched more than X days ago. Not sure what value to use for X.

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Found some relevant documentation.

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-anonymized-urls

To host your images, GitHub uses the open-source project Camo. Camo generates an anonymous URL proxy for each file which hides your browser details and related information from other users. The URL starts https://<subdomain>.githubusercontent.com/, with different subdomains depending on how you uploaded the image.

https://github.blog/2014-01-28-proxying-user-images/

A while back, we started proxying all non-https images to avoid mixed-content warnings using a custom node server called camo. We're making a small change today and proxying HTTPS images as well.

Proxying these images will help protect your privacy: your browser information won't be leaked to other third party services.

https://github.com/atmos/camo was archived in April 2021.

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Another clue:

curl -i 'https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67'
HTTP/2 403 
cache-control: no-cache, no-store, private, must-revalidate
content-security-policy: default-src 'none'; img-src data:; style-src 'unsafe-inline'
content-type: text/plain; charset=utf-8
server: github-camo (2d97ca31)
strict-transport-security: max-age=31536000; includeSubDomains
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
x-github-request-id: F714:6D3F:387FC9:493449:65A6B820
accept-ranges: bytes
date: Tue, 16 Jan 2024 17:08:53 GMT
via: 1.1 varnish
x-served-by: cache-pao-kpao1770040-PAO
x-cache: MISS
x-cache-hits: 0
x-timer: S1705424933.064491,VS0,VE117
x-fastly-request-id: db6d3e2546b29d5a0aa986245ef0816997e820aa
timing-allow-origin: https://github.com
content-length: 14

Bad Signature

Note server: github-camo (2d97ca31).

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

I think GitHub upgraded from that Node camo to this alternative written in Go at some point, because I found this in the source code for the Go one: https://github.com/cactus/go-camo/blob/11a57d9fcc47ad5f3de37b733a03f023edf4007c/pkg/camo/proxy.go#L102-L106

	sURL, ok := encoding.DecodeURL(p.config.HMACKey, sigHash, encodedURL)
	if !ok {
		http.Error(w, "Bad Signature", http.StatusForbidden)
		return
	}

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Having seen that code, my hunch is that GitHub changed their HMACKey at some point which broke my previously cached images.

In the DB at the moment I store both the raw markdown and the rendered HTML: https://datasette.io/content/repos/209091256

CleanShot 2024-01-16 at 09 41 57@2x

That shows that using GitHub's markdown API doesn't produce the camo. proxied URLs.

In my code I'm calling out to github-to-sqlite which does this:

https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/utils.py#L785-L796

def fetch_readme(token, full_name, html=False):
    headers = make_headers(token)
    if html:
        headers["accept"] = "application/vnd.github.VERSION.html"
    url = "https://api.github.com/repos/{}/readme".format(full_name)
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    if html:
        return rewrite_readme_html(response.text)
    else:
        return base64.b64decode(response.json()["content"]).decode("utf-8")

Switching to rendering through the Markdown API would fix this by removing camo.githubusercontent.com entirely from how https://datasette.io/ works.

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

if repos_to_fetch_readme_for:
print("Fetching README for {}".format(repos_to_fetch_readme_for))
github_to_sqlite_repos.callback(
db_filename,
usernames=[],
auth="auth.json",
repo=repos_to_fetch_readme_for,
load=None,
readme=True,
readme_html=True,
)

I could flip readme_html=False and then have a separate Markdown rendering step.

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Probably easiest to cut github-to-sqlite out entirely here and instead do a fetch to https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-a-repository-readme to get the raw markdown, then send that to the https://api.github.com/markdown endpoint to render it. https://til.simonwillison.net/markdown/github-markdown-api

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Pieces I need to fix this.

curl 'https://api.github.com/repos/simonw/datasette/readme' -H 'Accept: application/vnd.github.raw'

To get back the raw Markdown.

import httpx

body = """
# Example

This is example Markdown
"""

response = httpx.post(
    "https://api.github.com/markdown",
    json={
        # mode=gfm would expand #13 issue links, provided you pass
        # context=simonw/datasette too
        "mode": "markdown",
        "text": body,
    },
    headers=headers,
)
if response.status_code == 200:
    markdown_as_html = response.text

@simonw
Copy link
Owner Author

simonw commented Jan 16, 2024

Looking again at this code:

    if repos_to_fetch_readme_for:
        print("Fetching README for {}".format(repos_to_fetch_readme_for))
        github_to_sqlite_repos.callback(
            db_filename,
            usernames=[],
            auth="auth.json",
            repo=repos_to_fetch_readme_for,
            load=None,
            readme=True,
            readme_html=True,
        )

Is it JUST fetching the READMEs, or is it also populating the database with other important information?

If it's pulling other key information too then I should leave that in there but rename repos_to_fetch_readme_for to repos_to_fetch_details_for - then set readme=False and readme_html=False` and then roll my own separate code to fetch and store the READMEs.

@simonw
Copy link
Owner Author

simonw commented Jan 17, 2024

Here's the implementation of repos(): https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/cli.py#L281-L312

def repos(db_path, usernames, auth, repo, load, readme, readme_html):
    "Save repos owned by the specified (or authenticated) username or organization"
    db = sqlite_utils.Database(db_path)
    token = load_token(auth)
    if load:
        for loaded_repo in json.load(open(load)):
            utils.save_repo(db, loaded_repo)
    else:
        if repo:
            # Just these repos
            for full_name in repo:
                repo_id = utils.save_repo(db, utils.fetch_repo(full_name, token))
                _repo_readme(db, token, repo_id, full_name, readme, readme_html)
        else:
            if not usernames:
                usernames = [None]
            for username in usernames:
                for repo in utils.fetch_all_repos(username, token):
                    repo_id = utils.save_repo(db, repo)
                    _repo_readme(
                        db, token, repo_id, repo["full_name"], readme, readme_html
                    )
    utils.ensure_db_shape(db)


def _repo_readme(db, token, repo_id, full_name, readme, readme_html):
    if readme:
        readme = utils.fetch_readme(token, full_name)
        db["repos"].update(repo_id, {"readme": readme}, alter=True)
    if readme_html:
        readme_html = utils.fetch_readme(token, full_name, html=True)
        db["repos"].update(repo_id, {"readme_html": readme_html}, alter=True)

I'm going to assume there are other reasons to call it and rename the variable to repos_to_fetch.

@simonw
Copy link
Owner Author

simonw commented Jan 17, 2024

I'm also going to drop that --force-fetch-readmes flag because I don't know what I'd use it for.

@simonw
Copy link
Owner Author

simonw commented Jan 17, 2024

Last question: how best to upgrade the existing records?

I'm worried about rate limits so I don't want to force update READMEs for everything

I think I'll add code which, every time the build script runs, picks the three oldest repos which have camo.githubusercontent.com somewhere in their readme_html (but not in their readme) and adds them to the repos_to_fetch pile.

Then in a few days time everything should be fixed.

@simonw
Copy link
Owner Author

simonw commented Jan 17, 2024

Those build logs included:

2024-01-17T01:53:25.7033961Z Fixing HTML for ['simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7035207Z Fetching repo details for ['simonw/datasette-edit-templates', 'simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7036136Z Fetching README for simonw/datasette-edit-templates
2024-01-17T01:53:25.7036592Z Fetching README for simonw/csvs-to-sqlite
2024-01-17T01:53:25.7037182Z Fetching README for simonw/datasette-cluster-map
2024-01-17T01:53:25.7037686Z Fetching README for simonw/datasette-leaflet-geojson

But... while these pages have non-broken images they still seem to be proxied through camo.githubusercontent.com:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant