Set up link checker #4

Eric-Arellano · 2023-09-06T19:27:35Z

We should make sure that internal links are valid for the state of the docs in the PR, i.e. HEAD. That is, if we reorganize HTML pages, that's fine as long as we update the right links.

External links should be valid no matter what.

Tasks

Give feedback

No tasks being tracked yet.

Options

Eric-Arellano · 2023-10-04T22:19:33Z

Some thoughts before I implement this:

Don't copy over closed source checker

I was hoping to simply duplicate what we do in closed source, which starts up a server, uses a custom web crawler to find all links, and then checks if they're valid.

This won't work well because starting up a server is non-trivial in this repository, since the infrastructure does not live here. Instead, we start up the server via Docker, which can be intimidating for docs contributors and also can be very slow. We want to keep the check easy and fairly fast to run locally.

Use static analysis

Instead of starting a web server and crawling it, statically analyze the MDX and Jupyter notebook files to extract out their links. For internal links, simply check that the file exists. For external links, make a request.

Our Jupyter notebooks use Markdown syntax, so we can use the same Markdown parser between MDX and Jupyter. We'd use a Markdown parsing library for this.

For internal links, simply check that the file exists.

That is advantageous because it's much faster and avoids the flakiness of network connections.

Custom tooling

I wanted to use https://github.com/tcort/markdown-link-check, a popular tool to check links for Markdown. But it has two issues:

Can't handle the Jupyter notebooks
Makes live requests even for internal links, which is much slower

So, instead, we'd use a custom script, similar to how we do it in closed source. But with a focus on maintainability and reusing libraries for e.g. parsing Markdown.

Future improvement: check anchor links

We want to make sure that anchor links work correctly to take you to the right part of the page. That is a big benefit of Sphinx and was one of our concerns when migrating from Sphinx to MDX.

You can't determine that an anchor is valid via a normal HTTP request. But that's okay, because we only prioritize checking that internal links are using anchors correctly.

So, we will first statically map every file to its anchors, and then check that mapping for internal links.

Eric-Arellano · 2023-10-12T19:24:34Z

I talked it over with the original author of the link checker in closed source. We think the static analysis approach is reasonable given the context of this open source project. We'll still have the full link checker from closed source.

I recommend we split the implementation up into stages:

Validate internal links
(maybe) Improve internal links to check for anchors
Validate external links

The priority is internal links. It's fine to do external links in a later follow up.

Eric-Arellano · 2023-10-18T15:20:58Z

Frank is implementing this first part of internal link validation in #173. That leaves the two follow up improvements:

Improve internal links to check for anchors
Validate external links

Some thoughts on this.

External links

Checking external links can be slow since we have to make an HTTP request. The majority of our links are internal, also, so it's lower priority to check external links every time.

We don't want to check external links by default because it's important that npm run check is fast when run locally so that our content writers are encouraged to run it frequently. So, instead we should have a command line argument --external to say to also check external links, like npm run check:links -- --external. We'll want to enable that in CI. We use yargs for CLI args - look in the file for examples.

For the actual checking, you can use fetch. It's important that you de-duplicate all the URLs beforehand to reduce the number of requests. Here's a link checker that I've written before: https://github.com/ParkingReformNetwork/reform-map/blob/f6823ea05c65ff54e6f60153a4581448001c21e7/scripts/brokenLinks.js. Note that you have to set a user agent in the header, and that I used a for loop intentionally to reduce the risk of rate limiting.

Anchor checks

#173 already has a set of all the valid files. We want to improve this mapping to include anchor information. Note that anchors are only for docs/, not for public/images.

I'm not sure the best data structure to store this, such as a Set<string> where each entry is either a full anchor path like my_file#anchor or my_image.png, or something more complex.nested like keeping track of my_file having 5 anchors. I imagine the former is simpler.

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files. I believe the packages we have in package.json dependencies can help with this, like rehype or remark. But I haven't used them closely.

Bonus credit: "Did you mean"?

If it isn't too hard to implement, it would be really neat to implement a "Did you mean?" functionality. In the past, I've seen people use this algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance.

Set a time box for this, along with a code complexity limit. This feature is "nice to have" but not essential. It's not worth spending a ton of time or adding extremely complex code to add it.

frankharkins · 2023-10-19T11:01:14Z

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files.

Actually, markdown-link-extractor also extracts all the anchors from the markdown files too. So you should be able to do

markdownLinkExtractor(source).anchors

Maybe we should replace the list of filepath strings with a list of objects

interface File {
  path: string
  links: Link[]
  anchors: string[]
}

Then you could do something like

linkedFile = files.find(file => file.path === link.value)
if linkedFile.anchors.includes(link.anchor) {
   ...

This PR adds a link-checking script to verify internal links work. It should be easy to extend this to check external links, and anchors within internal links. All comments welcome. ### Details Uses `markdown-link-extractor`, which is used by `markdown-link-check` (one of the proposed tools). *** First part of #4

Eric-Arellano · 2023-11-08T18:49:17Z

I split this out into the more granular #305 and #306.

The core link checker has been added and works great thanks to @arnaucasau and @frankharkins. Thanks!

### Summary This PR adds a suggestion of a valid link for every broken link that the link checker finds. This feature only works with internal links and anchors. To calculate the best replacement for a broken link, the function `didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the broken link and all the files we have in docs, as it was originally suggested in #4. The package used to implement the Levenshtein distance calculation is [fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).

This PR adds a link-checking script to verify internal links work. It should be easy to extend this to check external links, and anchors within internal links. All comments welcome. ### Details Uses `markdown-link-extractor`, which is used by `markdown-link-check` (one of the proposed tools). *** First part of #4

### Summary This PR adds a suggestion of a valid link for every broken link that the link checker finds. This feature only works with internal links and anchors. To calculate the best replacement for a broken link, the function `didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the broken link and all the files we have in docs, as it was originally suggested in #4. The package used to implement the Levenshtein distance calculation is [fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).

Eric-Arellano added the infra 🏗️ label Sep 6, 2023

Eric-Arellano added this to the Infra MVP milestone Sep 12, 2023

github-project-automation bot added this to Docs Planning Sep 12, 2023

Eric-Arellano modified the milestones: Infra MVP, Infra MVP fast-follow Oct 2, 2023

Eric-Arellano modified the milestones: Infra MVP fast-follow, mid-oct release Oct 5, 2023

frankharkins mentioned this issue Oct 16, 2023

Add script to check links #173

Merged

Eric-Arellano modified the milestones: mid-oct release, dec release Oct 17, 2023

Eric-Arellano assigned frankharkins and arnaucasau Oct 18, 2023

javabster moved this to In Progress in Docs Planning Nov 6, 2023

This was referenced Nov 8, 2023

Link checker: check external links #305

Closed

Link checker: check internal anchors #306

Closed

Eric-Arellano closed this as completed Nov 8, 2023

github-project-automation bot moved this from In Progress to Done in Docs Planning Nov 8, 2023

arnaucasau mentioned this issue Nov 14, 2023

Link checker: did you mean feature #348

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up link checker #4

Set up link checker #4

Eric-Arellano commented Sep 6, 2023 •

edited

Loading

Tasks

Eric-Arellano commented Oct 4, 2023 •

edited

Loading

Eric-Arellano commented Oct 12, 2023

Eric-Arellano commented Oct 18, 2023 •

edited

Loading

frankharkins commented Oct 19, 2023

Eric-Arellano commented Nov 8, 2023

Set up link checker #4

Set up link checker #4

Comments

Eric-Arellano commented Sep 6, 2023 • edited Loading

Tasks

Eric-Arellano commented Oct 4, 2023 • edited Loading

Don't copy over closed source checker

Use static analysis

Custom tooling

Future improvement: check anchor links

Eric-Arellano commented Oct 12, 2023

Eric-Arellano commented Oct 18, 2023 • edited Loading

External links

Anchor checks

Bonus credit: "Did you mean"?

frankharkins commented Oct 19, 2023

Eric-Arellano commented Nov 8, 2023

Eric-Arellano commented Sep 6, 2023 •

edited

Loading

Eric-Arellano commented Oct 4, 2023 •

edited

Loading

Eric-Arellano commented Oct 18, 2023 •

edited

Loading