Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up link checker #4

Closed
Eric-Arellano opened this issue Sep 6, 2023 · 5 comments
Closed

Set up link checker #4

Eric-Arellano opened this issue Sep 6, 2023 · 5 comments

Comments

@Eric-Arellano
Copy link
Collaborator

Eric-Arellano commented Sep 6, 2023

We should make sure that internal links are valid for the state of the docs in the PR, i.e. HEAD. That is, if we reorganize HTML pages, that's fine as long as we update the right links.

External links should be valid no matter what.

Tasks

Preview Give feedback
No tasks being tracked yet.
@Eric-Arellano Eric-Arellano added this to the Infra MVP milestone Sep 12, 2023
@Eric-Arellano Eric-Arellano modified the milestones: Infra MVP, Infra MVP fast-follow Oct 2, 2023
@Eric-Arellano
Copy link
Collaborator Author

Eric-Arellano commented Oct 4, 2023

Some thoughts before I implement this:

Don't copy over closed source checker

I was hoping to simply duplicate what we do in closed source, which starts up a server, uses a custom web crawler to find all links, and then checks if they're valid.

This won't work well because starting up a server is non-trivial in this repository, since the infrastructure does not live here. Instead, we start up the server via Docker, which can be intimidating for docs contributors and also can be very slow. We want to keep the check easy and fairly fast to run locally.

Use static analysis

Instead of starting a web server and crawling it, statically analyze the MDX and Jupyter notebook files to extract out their links. For internal links, simply check that the file exists. For external links, make a request.

Our Jupyter notebooks use Markdown syntax, so we can use the same Markdown parser between MDX and Jupyter. We'd use a Markdown parsing library for this.

For internal links, simply check that the file exists.

That is advantageous because it's much faster and avoids the flakiness of network connections.

Custom tooling

I wanted to use https://github.com/tcort/markdown-link-check, a popular tool to check links for Markdown. But it has two issues:

  1. Can't handle the Jupyter notebooks
  2. Makes live requests even for internal links, which is much slower

So, instead, we'd use a custom script, similar to how we do it in closed source. But with a focus on maintainability and reusing libraries for e.g. parsing Markdown.

Future improvement: check anchor links

We want to make sure that anchor links work correctly to take you to the right part of the page. That is a big benefit of Sphinx and was one of our concerns when migrating from Sphinx to MDX.

You can't determine that an anchor is valid via a normal HTTP request. But that's okay, because we only prioritize checking that internal links are using anchors correctly.

So, we will first statically map every file to its anchors, and then check that mapping for internal links.

@Eric-Arellano Eric-Arellano modified the milestones: Infra MVP fast-follow, mid-oct release Oct 5, 2023
@Eric-Arellano
Copy link
Collaborator Author

I talked it over with the original author of the link checker in closed source. We think the static analysis approach is reasonable given the context of this open source project. We'll still have the full link checker from closed source.

I recommend we split the implementation up into stages:

  1. Validate internal links
  2. (maybe) Improve internal links to check for anchors
  3. Validate external links

The priority is internal links. It's fine to do external links in a later follow up.

@Eric-Arellano
Copy link
Collaborator Author

Eric-Arellano commented Oct 18, 2023

Frank is implementing this first part of internal link validation in #173. That leaves the two follow up improvements:

  1. Improve internal links to check for anchors
  2. Validate external links

Some thoughts on this.

External links

Checking external links can be slow since we have to make an HTTP request. The majority of our links are internal, also, so it's lower priority to check external links every time.

We don't want to check external links by default because it's important that npm run check is fast when run locally so that our content writers are encouraged to run it frequently. So, instead we should have a command line argument --external to say to also check external links, like npm run check:links -- --external. We'll want to enable that in CI. We use yargs for CLI args - look in the file for examples.

For the actual checking, you can use fetch. It's important that you de-duplicate all the URLs beforehand to reduce the number of requests. Here's a link checker that I've written before: https://github.com/ParkingReformNetwork/reform-map/blob/f6823ea05c65ff54e6f60153a4581448001c21e7/scripts/brokenLinks.js. Note that you have to set a user agent in the header, and that I used a for loop intentionally to reduce the risk of rate limiting.

Anchor checks

#173 already has a set of all the valid files. We want to improve this mapping to include anchor information. Note that anchors are only for docs/, not for public/images.

I'm not sure the best data structure to store this, such as a Set<string> where each entry is either a full anchor path like my_file#anchor or my_image.png, or something more complex.nested like keeping track of my_file having 5 anchors. I imagine the former is simpler.

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files. I believe the packages we have in package.json dependencies can help with this, like rehype or remark. But I haven't used them closely.

Bonus credit: "Did you mean"?

If it isn't too hard to implement, it would be really neat to implement a "Did you mean?" functionality. In the past, I've seen people use this algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance.

Set a time box for this, along with a code complexity limit. This feature is "nice to have" but not essential. It's not worth spending a ton of time or adding extremely complex code to add it.

@frankharkins
Copy link
Member

The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files.

Actually, markdown-link-extractor also extracts all the anchors from the markdown files too. So you should be able to do

markdownLinkExtractor(source).anchors

Maybe we should replace the list of filepath strings with a list of objects

interface File {
  path: string
  links: Link[]
  anchors: string[]
}

Then you could do something like

linkedFile = files.find(file => file.path === link.value)
if linkedFile.anchors.includes(link.anchor) {
   ...

Eric-Arellano pushed a commit that referenced this issue Oct 19, 2023
This PR adds a link-checking script to verify internal links work. It
should be easy to extend this to check external links, and anchors
within internal links. All comments welcome.

### Details

Uses `markdown-link-extractor`, which is used by `markdown-link-check`
(one of the proposed tools).

***

First part of #4
@javabster javabster moved this to In Progress in Docs Planning Nov 6, 2023
@Eric-Arellano
Copy link
Collaborator Author

I split this out into the more granular #305 and #306.

The core link checker has been added and works great thanks to @arnaucasau and @frankharkins. Thanks!

@github-project-automation github-project-automation bot moved this from In Progress to Done in Docs Planning Nov 8, 2023
arnaucasau added a commit that referenced this issue Nov 16, 2023
### Summary

This PR adds a suggestion of a valid link for every broken link that the
link checker finds. This feature only works with internal links and
anchors.

To calculate the best replacement for a broken link, the function
`didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein
distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between
the broken link and all the files we have in docs, as it was originally
suggested in #4.

The package used to implement the Levenshtein distance calculation is
[fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).
frankharkins added a commit that referenced this issue Jul 22, 2024
This PR adds a link-checking script to verify internal links work. It
should be easy to extend this to check external links, and anchors
within internal links. All comments welcome.

### Details

Uses `markdown-link-extractor`, which is used by `markdown-link-check`
(one of the proposed tools).

***

First part of #4
frankharkins pushed a commit that referenced this issue Jul 22, 2024
### Summary

This PR adds a suggestion of a valid link for every broken link that the
link checker finds. This feature only works with internal links and
anchors.

To calculate the best replacement for a broken link, the function
`didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein
distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between
the broken link and all the files we have in docs, as it was originally
suggested in #4.

The package used to implement the Levenshtein distance calculation is
[fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants