Broken images on some plugin pages #156

simonw · 2024-01-16T16:38:34Z

https://datasette.io/plugins/datasette-atom

simonw · 2024-01-16T16:39:22Z

PyPI Changelog link is https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67

Which returns this text: Bad Signature

Meanwhile on https://github.com/simonw/datasette-atom

simonw · 2024-01-16T16:43:13Z

Possibly relevant:

Those talk about timeouts though, which I think isn't the same thing as this signature problem.

simonw · 2024-01-16T16:50:27Z

I'm going to force re-build the README for datasette-atom to see if I can fix that one-off. Here's the logic that's causing it not to fetch a fresh copy:

datasette.io/build_directory.py

Lines 178 to 185 in 55e8fb1

    
           # Fetch README for any repos that have changed since last time 
        
           repos_to_fetch_readme_for = [] 
        
           for row in db["datasette_repos"].rows: 
        
               if ( 
        
                   row["latest_commit"] != previous_hashes.get(row["nameWithOwner"]) 
        
                   or force_fetch_readmes 
        
               ): 
        
                   repos_to_fetch_readme_for.append(row["nameWithOwner"])

Experiment relating to #156

simonw · 2024-01-16T17:04:07Z

That did fix it:

https://datasette.io/plugins/datasette-atom

So I think the fix may be to change the logic about fetching README files so it fetches if the repo has been updated OR if the README was last fetched more than X days ago. Not sure what value to use for X.

simonw · 2024-01-16T17:07:11Z

Found some relevant documentation.

https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-anonymized-urls

To host your images, GitHub uses the open-source project Camo. Camo generates an anonymous URL proxy for each file which hides your browser details and related information from other users. The URL starts https://<subdomain>.githubusercontent.com/, with different subdomains depending on how you uploaded the image.

https://github.blog/2014-01-28-proxying-user-images/

A while back, we started proxying all non-https images to avoid mixed-content warnings using a custom node server called camo. We're making a small change today and proxying HTTPS images as well.

Proxying these images will help protect your privacy: your browser information won't be leaked to other third party services.

https://github.com/atmos/camo was archived in April 2021.

simonw · 2024-01-16T17:09:46Z

Another clue:

curl -i 'https://camo.githubusercontent.com/42a02d342f39e3dc05195df216b9f89fa84de19d94547b4c68a02ec12f0eaf2d/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f762f72656c656173652f73696d6f6e772f6461746173657474652d61746f6d3f696e636c7564655f70726572656c6561736573266c6162656c3d6368616e67656c6f67'

HTTP/2 403 
cache-control: no-cache, no-store, private, must-revalidate
content-security-policy: default-src 'none'; img-src data:; style-src 'unsafe-inline'
content-type: text/plain; charset=utf-8
server: github-camo (2d97ca31)
strict-transport-security: max-age=31536000; includeSubDomains
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
x-github-request-id: F714:6D3F:387FC9:493449:65A6B820
accept-ranges: bytes
date: Tue, 16 Jan 2024 17:08:53 GMT
via: 1.1 varnish
x-served-by: cache-pao-kpao1770040-PAO
x-cache: MISS
x-cache-hits: 0
x-timer: S1705424933.064491,VS0,VE117
x-fastly-request-id: db6d3e2546b29d5a0aa986245ef0816997e820aa
timing-allow-origin: https://github.com
content-length: 14

Bad Signature

Note server: github-camo (2d97ca31).

simonw · 2024-01-16T17:12:15Z

I think GitHub upgraded from that Node camo to this alternative written in Go at some point, because I found this in the source code for the Go one: https://github.com/cactus/go-camo/blob/11a57d9fcc47ad5f3de37b733a03f023edf4007c/pkg/camo/proxy.go#L102-L106

	sURL, ok := encoding.DecodeURL(p.config.HMACKey, sigHash, encodedURL)
	if !ok {
		http.Error(w, "Bad Signature", http.StatusForbidden)
		return
	}

simonw · 2024-01-16T17:44:20Z

Having seen that code, my hunch is that GitHub changed their HMACKey at some point which broke my previously cached images.

In the DB at the moment I store both the raw markdown and the rendered HTML: https://datasette.io/content/repos/209091256

That shows that using GitHub's markdown API doesn't produce the camo. proxied URLs.

In my code I'm calling out to github-to-sqlite which does this:

https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/utils.py#L785-L796

def fetch_readme(token, full_name, html=False):
    headers = make_headers(token)
    if html:
        headers["accept"] = "application/vnd.github.VERSION.html"
    url = "https://api.github.com/repos/{}/readme".format(full_name)
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return None
    if html:
        return rewrite_readme_html(response.text)
    else:
        return base64.b64decode(response.json()["content"]).decode("utf-8")

Switching to rendering through the Markdown API would fix this by removing camo.githubusercontent.com entirely from how https://datasette.io/ works.

simonw · 2024-01-16T18:41:22Z

datasette.io/build_directory.py

Lines 187 to 197 in 428d631

    
           if repos_to_fetch_readme_for: 
        
               print("Fetching README for {}".format(repos_to_fetch_readme_for)) 
        
               github_to_sqlite_repos.callback( 
        
                   db_filename, 
        
                   usernames=[], 
        
                   auth="auth.json", 
        
                   repo=repos_to_fetch_readme_for, 
        
                   load=None, 
        
                   readme=True, 
        
                   readme_html=True, 
        
               )

I could flip readme_html=False and then have a separate Markdown rendering step.

simonw · 2024-01-16T18:44:57Z

Probably easiest to cut github-to-sqlite out entirely here and instead do a fetch to https://docs.github.com/en/rest/repos/contents?apiVersion=2022-11-28#get-a-repository-readme to get the raw markdown, then send that to the https://api.github.com/markdown endpoint to render it. https://til.simonwillison.net/markdown/github-markdown-api

simonw · 2024-01-16T18:58:49Z

Pieces I need to fix this.

curl 'https://api.github.com/repos/simonw/datasette/readme' -H 'Accept: application/vnd.github.raw'

To get back the raw Markdown.

import httpx

body = """
# Example

This is example Markdown
"""

response = httpx.post(
    "https://api.github.com/markdown",
    json={
        # mode=gfm would expand #13 issue links, provided you pass
        # context=simonw/datasette too
        "mode": "markdown",
        "text": body,
    },
    headers=headers,
)
if response.status_code == 200:
    markdown_as_html = response.text

simonw · 2024-01-16T19:01:06Z

Looking again at this code:

    if repos_to_fetch_readme_for:
        print("Fetching README for {}".format(repos_to_fetch_readme_for))
        github_to_sqlite_repos.callback(
            db_filename,
            usernames=[],
            auth="auth.json",
            repo=repos_to_fetch_readme_for,
            load=None,
            readme=True,
            readme_html=True,
        )

Is it JUST fetching the READMEs, or is it also populating the database with other important information?

If it's pulling other key information too then I should leave that in there but rename repos_to_fetch_readme_for to repos_to_fetch_details_for - then set readme=False and readme_html=False` and then roll my own separate code to fetch and store the READMEs.

simonw · 2024-01-17T01:17:26Z

Here's the implementation of repos(): https://github.com/dogsheep/github-to-sqlite/blob/eaef8ffd3f46be6c26062237ed88b4c2202a1c44/github_to_sqlite/cli.py#L281-L312

def repos(db_path, usernames, auth, repo, load, readme, readme_html):
    "Save repos owned by the specified (or authenticated) username or organization"
    db = sqlite_utils.Database(db_path)
    token = load_token(auth)
    if load:
        for loaded_repo in json.load(open(load)):
            utils.save_repo(db, loaded_repo)
    else:
        if repo:
            # Just these repos
            for full_name in repo:
                repo_id = utils.save_repo(db, utils.fetch_repo(full_name, token))
                _repo_readme(db, token, repo_id, full_name, readme, readme_html)
        else:
            if not usernames:
                usernames = [None]
            for username in usernames:
                for repo in utils.fetch_all_repos(username, token):
                    repo_id = utils.save_repo(db, repo)
                    _repo_readme(
                        db, token, repo_id, repo["full_name"], readme, readme_html
                    )
    utils.ensure_db_shape(db)


def _repo_readme(db, token, repo_id, full_name, readme, readme_html):
    if readme:
        readme = utils.fetch_readme(token, full_name)
        db["repos"].update(repo_id, {"readme": readme}, alter=True)
    if readme_html:
        readme_html = utils.fetch_readme(token, full_name, html=True)
        db["repos"].update(repo_id, {"readme_html": readme_html}, alter=True)

I'm going to assume there are other reasons to call it and rename the variable to repos_to_fetch.

simonw · 2024-01-17T01:20:09Z

I'm also going to drop that --force-fetch-readmes flag because I don't know what I'd use it for.

simonw · 2024-01-17T01:39:48Z

Last question: how best to upgrade the existing records?

I'm worried about rate limits so I don't want to force update READMEs for everything

I think I'll add code which, every time the build script runs, picks the three oldest repos which have camo.githubusercontent.com somewhere in their readme_html (but not in their readme) and adds them to the repos_to_fetch pile.

Then in a few days time everything should be fixed.

simonw · 2024-01-17T01:58:35Z

Those build logs included:

2024-01-17T01:53:25.7033961Z Fixing HTML for ['simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7035207Z Fetching repo details for ['simonw/datasette-edit-templates', 'simonw/csvs-to-sqlite', 'simonw/datasette-cluster-map', 'simonw/datasette-leaflet-geojson']
2024-01-17T01:53:25.7036136Z Fetching README for simonw/datasette-edit-templates
2024-01-17T01:53:25.7036592Z Fetching README for simonw/csvs-to-sqlite
2024-01-17T01:53:25.7037182Z Fetching README for simonw/datasette-cluster-map
2024-01-17T01:53:25.7037686Z Fetching README for simonw/datasette-leaflet-geojson

But... while these pages have non-broken images they still seem to be proxied through camo.githubusercontent.com:

simonw added the bug Something isn't working label Jan 16, 2024

simonw added a commit that referenced this issue Jan 16, 2024

or row["nameWithOwner"] == "simonw/datasette-atom"

428d631

Experiment relating to #156

simonw added a commit that referenced this issue Jan 17, 2024

Fix broken images in READMEs through rendering Markdown, refs #156

7d7c0c3

simonw mentioned this issue Jan 27, 2024

Broken image bug simonw/til#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken images on some plugin pages #156

Broken images on some plugin pages #156

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024 •

edited

Loading

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024 •

edited

Loading

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

Broken images on some plugin pages #156

Broken images on some plugin pages #156

Comments

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024 • edited Loading

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024 • edited Loading

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 16, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

simonw commented Jan 17, 2024

simonw commented Jan 16, 2024 •

edited

Loading

simonw commented Jan 16, 2024 •

edited

Loading