Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support checking local file links #21

Closed
untitaker opened this issue Oct 24, 2020 · 34 comments · Fixed by #262
Closed

Support checking local file links #21

untitaker opened this issue Oct 24, 2020 · 34 comments · Fixed by #262
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@untitaker
Copy link
Collaborator

#15 implemented relative URLs, however a simple test shows that it is not working:

$ cat '<a href="./foo.html">Broken</a>' > index.html
$ GITHUB_TOKEN= lychee index.html -pv

📝Summary
-------------------
🔍Found: 0
👻Excluded: 0
✅Successful: 0
🚫Errors: 0

Expected to find a 404 because file foo.html does not exist.

@mre
Copy link
Member

mre commented Oct 25, 2020

You need to pass a base-url parameter like so:
GITHUB_TOKEN= lychee index.html -pv --base-url "http://example.com"

Would that cover your use-case?
We could add support for checking local files as well if that helps.

@untitaker
Copy link
Collaborator Author

Ah so it is always going to do network I/O? Ok, that clarifies that then. I just assumed it would do local file checking because the input is already a local file (i.e. work closer to how liche works). I think I'll just run a local http server for CI like I did with the others then and pass an URL.

As for usecases: I am only doing benchmark comparisons around a bunch of link checkers. My usecase is checking internal links of the html generated by a static site generator, and that as part of CI, but I wrote my own link checker for that. Whose small featureset, incredibly, seems to be the exact inverse to lychee's. Just shows that link checker is an overloaded term.

Two pieces of feedback, but keep in mind that those don't come from a "real user":

  • Right now GITHUB_TOKEN is a required envvar, but readme does not call that out. I don't have github links to check though, so I think it should be optional
  • would be good to document --base-url in readme as required for relative urls (or perhaps warn in-app that there are relative links that could not be resolved because --base-url is not set)

@mre
Copy link
Member

mre commented Oct 25, 2020

Good points!

Right now GITHUB_TOKEN is a required envvar, but readme does not call that out. I don't have github links to check though, so I think it should be optional

This is tackled by @pawroman in #22. 🎉

would be good to document --base-url in readme as required for relative urls (or perhaps warn in-app that there are relative links that could not be resolved because --base-url is not set)

True. Come to think of it, checking relative URLs in case a file name is given might be the default. Wonder what @whizsid and @pawroman think about it.

@untitaker
Copy link
Collaborator Author

Liche does this thing where it aborts link checking entirely when it encounters a host-relative URL while parsing a local file and base-url is not set. I have to write this for anything to work:

liche -r deploy/ --document-root deploy/ -x https?://*

which is kind of silly, but it does allow those arguments to diverge, and still lets liche resolve /foo/bar hrefs (host-relative but path-absolute) correctly:

liche -r deploy/specific/sub/page.html --document-root deploy/ -x https?://

I think a hybrid where --document-root defaults to the parent path of the file given would be good, so I can write both:

lychee deploy/index.html
lychee deploy/specific/sub/page.html --base-url deploy/

(The -x prevents liche from checking external links. I personally want this because above all the check should not be flaky, as it runs in CI. It's another kind of usecase that I find most linkcheckers don't really cater well to, or make ergonomic)

@untitaker
Copy link
Collaborator Author

Also I just realized that lychee does not support recursive traversal of links at all, or am I wrong? Had to combine find, xargs and lychee in godforsaken ways but obviously that does not perform very well...

@mre
Copy link
Member

mre commented Oct 25, 2020

Nope, recursion is still on the todo list (see readme). My idea was to use channels for handling in-flight checks; do you use the same in hyperlink?
I do like your other ideas.

@untitaker
Copy link
Collaborator Author

untitaker commented Oct 25, 2020

hyperlink can only crawl directories and resolve relative/internal links (no http support at all), as such it walks through the entire directory first using walkdir/jwalk then parallelizes parsing all found files with Rayon. No graph traversal over hrefs, just directory walking.

I honestly don't know enough about post-futures-0.1 async but I suppose you could do something with the mpsc one? So you have a single consumer that is in charge of spawning new fetch tasks and deduplicating fetch tasks. And each fetch task gets a producer.

@mre
Copy link
Member

mre commented Oct 25, 2020

I honestly don't know enough about post-futures-0.1 async but I suppose you could do something with the mpsc one? So you have a single consumer that is in charge of spawning new fetch tasks and deduplicating fetch tasks. And each fetch task gets a producer.

Precisely. Would love a comparison between async and rayon for that use-case. My guess is that perf would be similar while async would use slightly less resources when dealing with many tasks. Then again threads are super optimized as well on modern hardware so there might be no difference after all.

@untitaker
Copy link
Collaborator Author

untitaker commented Oct 25, 2020

I did see that liche, which uses goroutines, is slightly faster than hyperlink on folders with >1 mio files, but because liche spawned over 200 threads on my mac (while hyperlink spawned 12) I also assumed it effectively did use a thread pool, not actual async syscalls for file I/O.

I did pick basic threading over tokio because I thought outside of networking (I didn't care about HTTP at all) it would not make a difference, and also because it definetly seems I would be able to reuse buffers much more efficiently.

I just googled some stuff and apparently there are async syscalls for reading files on macos.

Anyway, it would be a massive pain to support external links in hyperlink so I probably painted myself into a corner already.

@untitaker
Copy link
Collaborator Author

Precisely. Would love a comparison between async and rayon for that use-case.

I am planning to do semi-serious benchmarks across a bunch of link checkers. I haven't figured out how yet.

Right now my approach to informal testing was to set up a random link checker, disable most of features such that the featureset roughly matches the few things hyperlink can do, then compare them on https://github.com/getsentry/sentry-docs and some synthetic stuff. Unfortunately there's two problems with this:

  • I am disabling so many features of those other link checkers that I am kind of comparing apples with oranges. Particularly lychee looks like it would generally just serve very different usecases.
  • For link checkers which only support HTTP I need to make sure that the link checker is not bottlenecked on the local http server that serves up what would be directly accessed by hyperlink. Figuring that out is not easy without distorting results further. Also, again, apples vs oranges.

@whizsid
Copy link
Contributor

whizsid commented Oct 28, 2020

It is good if we can provide a file checker. Because most of users using link checkers for CI/CD actions. It will fail when they are adding a new relative URL. Because this relative URL is not already exist on provided site at the time. I would like to work on this.

@mre
Copy link
Member

mre commented Oct 28, 2020

Sure @whizsid, go for it.

We could check if a file exists if base-url is not set. Additionally I'm wondering if we should only check for a file if the relative link also came from a file. My reasoning is that it could cause some false-positives if a file exists that just happens to have the same name as a relative link on a website. Might be an edge-case, though and make things unnecessarily complex...
Would love to get some opinions on this before we start.

@mre
Copy link
Member

mre commented Nov 25, 2020

@whizsid are you still working on this? Let me know if you need any support.

@MichaIng
Copy link
Member

MichaIng commented Dec 19, 2020

The now deprecated liche had the most reasonable implementation for "local" links, IMO:

  • Relative local links (not starting with a slash) simply check for local files relative to the originating document, just as it would work on the "published" document, assuming that all local targets are present when the originating document is. This is true for 99% of cases and otherwise it makes sense for the developer to set links with full URL, if the targets are not part of the website/document repository that is to be checked.
  • Absolute local links (leading slash) require the base dir option to define the document root, which can be an external URL as well as a local directory path. So I was able to do liche -d build -r build/docs to check all local links, where the website is built into the build/docs directory and will be reached at https://example.org/docs finally.
  • Of course it makes sense to only check file existence in case of relative local URLs or when a directory path is given for the base dir/url.

The ability to check for local files is quite important, as even if the website that you want to test is available via HTTP(S), it's internal structure may have changed and one needs to be able to check whether all internal relative links are intact.

@mre mre added enhancement New feature or request help wanted Extra attention is needed labels Feb 15, 2021
@mre mre changed the title Lychee does not find relative URLs Support checking local file links Mar 29, 2021
MichaIng added a commit to MichaIng/DietPi-Docs that referenced this issue Jun 9, 2021
... since it does not support local file checking for internal links yet: lycheeverse/lychee#21
@MichaIng
Copy link
Member

MichaIng commented Jun 11, 2021

We finally migrated from liche to lychee, and I have to say, the performance is awesome in comparison. ~280 links on our website are checked in 2 seconds (according to GitHub action), while liche took minutes, regularly failed with a timeout, I guess VM-side, and recently was killed by the VM due to OOM, if I got the error code correctly. Not sure what it did, probably it indeed did download the OS images that were linked into memory, or so...

For internal links, we start the Nginx webserver, that is pre-installed on GitHub actions VMs, and use the loopback address as base URL. On our main website, this works fine, as all HTML files are in the repository root. But when doing this for our documentation (MkDocs-based), it became complicated:

  • lychee applies the base-url as is to all URLs/links checked, without respecting the directory structure.
  • Hence, the base-url is always only correct for a single directory.
  • Since MkDocs puts every HTML into an own directory, every file hence needs to be checked with an own lychee call, which implies that doubled links across HTML files are checked multiple times, every checked file gets its own report/summary, without showing the file name (currently) and regardless if an error was present or not, which makes finding the actual errors inconvenient, etc.

While a logic could be implemented to apply base-url for relative internal links (without leading slash) based on the directory of the HTML file, relative to the working directory where the script was called, I think it is not worth it, considering that simply checking for the existence if the file is the much preferred solution. Then it is pretty clear that relative links need to check for files relative to the actual file location, and for absolute internal links, it makes sense to allow giving a local directory or base URL instead, so that it does not matter from where lychee was called.

See this as a bump if this topic, as I think it is a major feature lack when using lychee to check website repositories 🙂.

@mre
Copy link
Member

mre commented Jun 11, 2021

Want to prioritize this when I get the chance.

@polarathene
Copy link
Contributor

polarathene commented Jun 12, 2021

I think it is not worth it, considering that simply checking for the existence if the file is the much preferred solution.

In some cases where you link with a #fragment(anchor) to a heading to jump/scroll to on the page, you'd also want to ideally verify that is valid too. That isn't a feature supported yet, but may eventually be relevant to consider.

I know you chimed in on that issue yourself, but mentioning it here for others in this issue thread.

@MichaIng
Copy link
Member

MichaIng commented Jun 12, 2021

Not sure how it is related to local file checking, but checking scroll fragments and matching IDs within the input file is actually a doable case for #185 I didn't think about.

@polarathene
Copy link
Contributor

Not sure how it is related to local file checking

Markdown documents for documentation generators like mkdocs? These can have relative filepath links (or just a local heading reference on the same document) via markdown syntax, which is later parsed and converted into anchor links in HTML output.

@MichaIng
Copy link
Member

MichaIng commented Jun 12, 2021

I got it, but this request is covered by #185, while here it's only about the ability to check internal links via local file existence, not at all about the URL fragment 😉. Fragments in Markdown files is difficult, since there is no native Markdown syntax to assign an element ID. There are extensions for this, some to apply them manually, some to apply them automatically, so it would require a complex logic to check this, and the result is accordingly error-prone. I generally recommend to check the built HTML documents instead of the Markdown files, when using MkDocs, so there is a consistent syntax that does not depend on the local implementation and extensions, like Markdown.

@mre mre mentioned this issue Jun 16, 2021
3 tasks
@mre
Copy link
Member

mre commented Sep 4, 2021

@untitaker can you help me with testing #262?
Of course, if anyone else has some time, feedback is very welcome.

@untitaker
Copy link
Collaborator Author

untitaker commented Sep 4, 2021 via email

@untitaker
Copy link
Collaborator Author

untitaker commented Sep 6, 2021

Ok, everything I encountered:

  • progressbar is really slowing things down. by disabling it using a pipe I get down from 3 minutes to 32 seconds on sentry-docs, which is roughly on-par with liche. hyperlink is still doing that site in 2 secs
  • besides progressbar there's some room for optimization, I see lychee maxes out all cores traversing the filesystem upfront, then does not manage to saturate CPU at all anymore when it begins to show the progressbar and tries to advance
  • the problem with anchors not being stripped occurs for me as well

other than that lgtm

full command I ran in sentry-docs after following their setup:

yarn
yarn build
cd public/
time ~/projects/lychee/target/release/lychee --offline -b . '**/*.html' |& tee log

@mre
Copy link
Member

mre commented Sep 6, 2021

progressbar is really slowing things down. by disabling it using a pipe I get down from 3 minutes to 32 seconds on sentry-docs, which is roughly on-par with liche. hyperlink is still doing that site in 2 secs

Wow. There are similar reports here: console-rs/indicatif#170
Good job on making hyperlink so fast! 😃

I see lychee maxes out all cores traversing the filesystem upfront, then does not manage to saturate CPU at all anymore when it begins to show the progressbar and tries to advance

This is true. The problem is the current collector implementation. It pushes all file contents into a channel before checking. That's extremely wasteful. I think just using streams would be the better alternative. This should be tackled once basic file support gets merged. @lebensterben we talked about that before. I could need your help on this. 😅

the problem with anchors not being stripped occurs for me as well

Yup, that was an oversight. It should be fixed now.

@mre
Copy link
Member

mre commented Sep 6, 2021

@untitaker, tried to reproduce your sentry use-case:

time lychee --offline -b . '**/*.html' --no-progress --verbose                                                              
Error: Too many open files (os error 24)

This is clearly not working and we'll need to move to a lazy link collection approach.
It won't happen on the first iteration, but there is obviously some refactoring work to do before file support is suitable for bigger sites. 😞

@MichaIng
Copy link
Member

MichaIng commented Sep 7, 2021

@mre
This is not related to the local files support itself, is it?

@mre
Copy link
Member

mre commented Sep 7, 2021

True.
In the following snippet we return the Result<Vec<InputContent>>, which is fine when we deal with a few files, but when it calls glob_contents with a pattern like **/*.html on a big directory it will run out of file handles because it eagerly reads all those files.

pub async fn get_contents(
&self,
file_type_hint: Option<FileType>,
skip_missing: bool,
) -> Result<Vec<InputContent>> {
match *self {
// TODO: should skip_missing also affect URLs?
Input::RemoteUrl(ref url) => Ok(vec![Self::url_contents(url).await?]),
Input::FsGlob {
ref pattern,
ignore_case,
} => Ok(Self::glob_contents(pattern, ignore_case).await?),
Input::FsPath(ref path) => {
let content = Self::path_content(path).await;
match content {
Ok(input_content) => Ok(vec![input_content]),
Err(_) if skip_missing => Ok(vec![]),
Err(e) => Err(e),
}
}
Input::Stdin => Ok(vec![Self::stdin_content(file_type_hint).await?]),
Input::String(ref s) => Ok(vec![Self::string_content(s, file_type_hint)]),
}
}

It indirectly affects local file support if there a lot of files to handle (as in the sentry example).

The better approach is to lazily evaluate each file e.g. by using a stream. I will play around with streams a bit to see if I can build something fast and efficient. If anyone wants to help me with that, feel free to create a test repo, a pull request or we can even pair program on this together (UTC+2 timezone).

@mre mre closed this as completed in #262 Sep 9, 2021
@lebensterben
Copy link
Member

one solution is to make this synchronous. tokio is able to spawn synchronous task within asynchronous context.

if you want, you can use crossbeam/rayon to start a dedicated thread pool with limited number of OS thread.

@lebensterben
Copy link
Member

see
https://docs.rs/tokio/1.11.0/tokio/#cpu-bound-tasks-and-blocking-code

local disk IO is not CPU bound. we'd expect most files are processed below 10s (the default timeout before tokio steal a thread), so using asynchronous here isn't really helpful either.

and suppose tokio really does steal from a thread, that only adds unnecessary cost of context switching.

@lebensterben
Copy link
Member

ripgrep uses crossbeam + num_cpus.
in particular, its "worker" is synchronous in nature, even when multi-thread is on.

https://github.com/BurntSushi/ripgrep/blob/df83b8b44426b3f2179abe632eb183e8c8270524/crates/core/search.rs#L310

it's relatively cheap for us to factor out the network IO from other parts, and use tokio and crossbeam, respectively, for local and network IO.

roughly speaking,

  • CLI args and/or config file parsing is synchrnonous and single-thread in nature.
  • walking directories according to glob pattern, and grabbing URLs from files are synchronous but can benefit from parallelism (suppose that it's not running on crappy hard disk)
  • the "spawner" of worker pool is the boundary of sync and async. this is actually the part we should spend most effort on.
  • workers are async. but there's not much room to work on. after all, when a HTTP request is sent and before we received anything from server, it's out of our control.

@mre
Copy link
Member

mre commented Oct 7, 2021

For some reason I only see your comments now. What you say all makes sense and we should go forward with that.
As you mentioned, we can take a lesson from ripgrep's book. Since ripgrep and lychee and many others seem to have the same requirements for fast filesystem traversal, I wonder if there should be a separate crate, which can wrap that functionality behind a nice, performant interface. But yeah, rayon sounds like a good idea and spawn_blocking sounds like a great combination.

@untitaker
Copy link
Collaborator Author

untitaker commented Oct 7, 2021

Jwalk already implements walkdir on rayon, I've found it to be on-par with custom rayon or ripgrep. That said if you want to honor gitignore etc maybe you want to look at ripgrep. I just didn't find their code particularly reusable (relatively speaking)

@mre
Copy link
Member

mre commented Oct 7, 2021

jwalk looks great. We should use that if the input is a directory. Would be nice to add glob support to it, but it might be out of scope for the crate. 🤷
ripgrep probably covers more edge-cases, but as you said it's not very reusable as a library, so jwalk is probably our best option.

@mre
Copy link
Member

mre commented Oct 10, 2021

Added jwalk to #330.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants