-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up link checker #4
Comments
Some thoughts before I implement this: Don't copy over closed source checkerI was hoping to simply duplicate what we do in closed source, which starts up a server, uses a custom web crawler to find all links, and then checks if they're valid. This won't work well because starting up a server is non-trivial in this repository, since the infrastructure does not live here. Instead, we start up the server via Docker, which can be intimidating for docs contributors and also can be very slow. We want to keep the check easy and fairly fast to run locally. Use static analysisInstead of starting a web server and crawling it, statically analyze the MDX and Jupyter notebook files to extract out their links. For internal links, simply check that the file exists. For external links, make a request. Our Jupyter notebooks use Markdown syntax, so we can use the same Markdown parser between MDX and Jupyter. We'd use a Markdown parsing library for this.
That is advantageous because it's much faster and avoids the flakiness of network connections. Custom toolingI wanted to use https://github.com/tcort/markdown-link-check, a popular tool to check links for Markdown. But it has two issues:
So, instead, we'd use a custom script, similar to how we do it in closed source. But with a focus on maintainability and reusing libraries for e.g. parsing Markdown. Future improvement: check anchor linksWe want to make sure that anchor links work correctly to take you to the right part of the page. That is a big benefit of Sphinx and was one of our concerns when migrating from Sphinx to MDX. You can't determine that an anchor is valid via a normal HTTP request. But that's okay, because we only prioritize checking that internal links are using anchors correctly. So, we will first statically map every file to its anchors, and then check that mapping for internal links. |
I talked it over with the original author of the link checker in closed source. We think the static analysis approach is reasonable given the context of this open source project. We'll still have the full link checker from closed source. I recommend we split the implementation up into stages:
The priority is internal links. It's fine to do external links in a later follow up. |
Frank is implementing this first part of internal link validation in #173. That leaves the two follow up improvements:
Some thoughts on this. External linksChecking external links can be slow since we have to make an HTTP request. The majority of our links are internal, also, so it's lower priority to check external links every time. We don't want to check external links by default because it's important that For the actual checking, you can use Anchor checks#173 already has a set of all the valid files. We want to improve this mapping to include anchor information. Note that anchors are only for I'm not sure the best data structure to store this, such as a The hard part here is needing to extract out the valid anchors from our MDX and Jupyter files. I believe the packages we have in Bonus credit: "Did you mean"?If it isn't too hard to implement, it would be really neat to implement a "Did you mean?" functionality. In the past, I've seen people use this algorithm: https://en.wikipedia.org/wiki/Levenshtein_distance. Set a time box for this, along with a code complexity limit. This feature is "nice to have" but not essential. It's not worth spending a ton of time or adding extremely complex code to add it. |
Actually, markdownLinkExtractor(source).anchors Maybe we should replace the list of filepath strings with a list of objects interface File {
path: string
links: Link[]
anchors: string[]
} Then you could do something like linkedFile = files.find(file => file.path === link.value)
if linkedFile.anchors.includes(link.anchor) {
... |
This PR adds a link-checking script to verify internal links work. It should be easy to extend this to check external links, and anchors within internal links. All comments welcome. ### Details Uses `markdown-link-extractor`, which is used by `markdown-link-check` (one of the proposed tools). *** First part of #4
I split this out into the more granular #305 and #306. The core link checker has been added and works great thanks to @arnaucasau and @frankharkins. Thanks! |
### Summary This PR adds a suggestion of a valid link for every broken link that the link checker finds. This feature only works with internal links and anchors. To calculate the best replacement for a broken link, the function `didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the broken link and all the files we have in docs, as it was originally suggested in #4. The package used to implement the Levenshtein distance calculation is [fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).
This PR adds a link-checking script to verify internal links work. It should be easy to extend this to check external links, and anchors within internal links. All comments welcome. ### Details Uses `markdown-link-extractor`, which is used by `markdown-link-check` (one of the proposed tools). *** First part of #4
### Summary This PR adds a suggestion of a valid link for every broken link that the link checker finds. This feature only works with internal links and anchors. To calculate the best replacement for a broken link, the function `didYouMean` in `scripts/lib/LinkChecker.ts` uses the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) between the broken link and all the files we have in docs, as it was originally suggested in #4. The package used to implement the Levenshtein distance calculation is [fast-levenshtein](https://www.npmjs.com/package/fast-levenshtein).
We should make sure that internal links are valid for the state of the docs in the PR, i.e. HEAD. That is, if we reorganize HTML pages, that's fine as long as we update the right links.
External links should be valid no matter what.
Tasks
The text was updated successfully, but these errors were encountered: