Support checking local file links #21

untitaker · 2020-10-24T13:43:58Z

#15 implemented relative URLs, however a simple test shows that it is not working:

$ cat '<a href="./foo.html">Broken</a>' > index.html
$ GITHUB_TOKEN= lychee index.html -pv

📝Summary
-------------------
🔍Found: 0
👻Excluded: 0
✅Successful: 0
🚫Errors: 0

Expected to find a 404 because file foo.html does not exist.

The text was updated successfully, but these errors were encountered:

mre · 2020-10-25T12:18:18Z

You need to pass a base-url parameter like so:
GITHUB_TOKEN= lychee index.html -pv --base-url "http://example.com"

Would that cover your use-case?
We could add support for checking local files as well if that helps.

untitaker · 2020-10-25T12:27:18Z

Ah so it is always going to do network I/O? Ok, that clarifies that then. I just assumed it would do local file checking because the input is already a local file (i.e. work closer to how liche works). I think I'll just run a local http server for CI like I did with the others then and pass an URL.

As for usecases: I am only doing benchmark comparisons around a bunch of link checkers. My usecase is checking internal links of the html generated by a static site generator, and that as part of CI, but I wrote my own link checker for that. Whose small featureset, incredibly, seems to be the exact inverse to lychee's. Just shows that link checker is an overloaded term.

Two pieces of feedback, but keep in mind that those don't come from a "real user":

Right now GITHUB_TOKEN is a required envvar, but readme does not call that out. I don't have github links to check though, so I think it should be optional
would be good to document --base-url in readme as required for relative urls (or perhaps warn in-app that there are relative links that could not be resolved because --base-url is not set)

mre · 2020-10-25T13:00:11Z

Good points!

Right now GITHUB_TOKEN is a required envvar, but readme does not call that out. I don't have github links to check though, so I think it should be optional

This is tackled by @pawroman in #22. 🎉

would be good to document --base-url in readme as required for relative urls (or perhaps warn in-app that there are relative links that could not be resolved because --base-url is not set)

True. Come to think of it, checking relative URLs in case a file name is given might be the default. Wonder what @whizsid and @pawroman think about it.

untitaker · 2020-10-25T14:45:38Z

Liche does this thing where it aborts link checking entirely when it encounters a host-relative URL while parsing a local file and base-url is not set. I have to write this for anything to work:

liche -r deploy/ --document-root deploy/ -x https?://*

which is kind of silly, but it does allow those arguments to diverge, and still lets liche resolve /foo/bar hrefs (host-relative but path-absolute) correctly:

liche -r deploy/specific/sub/page.html --document-root deploy/ -x https?://

I think a hybrid where --document-root defaults to the parent path of the file given would be good, so I can write both:

lychee deploy/index.html
lychee deploy/specific/sub/page.html --base-url deploy/

(The -x prevents liche from checking external links. I personally want this because above all the check should not be flaky, as it runs in CI. It's another kind of usecase that I find most linkcheckers don't really cater well to, or make ergonomic)

untitaker · 2020-10-25T15:04:04Z

Also I just realized that lychee does not support recursive traversal of links at all, or am I wrong? Had to combine find, xargs and lychee in godforsaken ways but obviously that does not perform very well...

mre · 2020-10-25T19:40:19Z

Nope, recursion is still on the todo list (see readme). My idea was to use channels for handling in-flight checks; do you use the same in hyperlink?
I do like your other ideas.

untitaker · 2020-10-25T21:38:59Z

hyperlink can only crawl directories and resolve relative/internal links (no http support at all), as such it walks through the entire directory first using walkdir/jwalk then parallelizes parsing all found files with Rayon. No graph traversal over hrefs, just directory walking.

I honestly don't know enough about post-futures-0.1 async but I suppose you could do something with the mpsc one? So you have a single consumer that is in charge of spawning new fetch tasks and deduplicating fetch tasks. And each fetch task gets a producer.

mre · 2020-10-25T21:45:42Z

I honestly don't know enough about post-futures-0.1 async but I suppose you could do something with the mpsc one? So you have a single consumer that is in charge of spawning new fetch tasks and deduplicating fetch tasks. And each fetch task gets a producer.

Precisely. Would love a comparison between async and rayon for that use-case. My guess is that perf would be similar while async would use slightly less resources when dealing with many tasks. Then again threads are super optimized as well on modern hardware so there might be no difference after all.

untitaker · 2020-10-25T22:09:15Z

I did see that liche, which uses goroutines, is slightly faster than hyperlink on folders with >1 mio files, but because liche spawned over 200 threads on my mac (while hyperlink spawned 12) I also assumed it effectively did use a thread pool, not actual async syscalls for file I/O.

I did pick basic threading over tokio because I thought outside of networking (I didn't care about HTTP at all) it would not make a difference, and also because it definetly seems I would be able to reuse buffers much more efficiently.

I just googled some stuff and apparently there are async syscalls for reading files on macos.

Anyway, it would be a massive pain to support external links in hyperlink so I probably painted myself into a corner already.

untitaker · 2020-10-25T22:29:39Z

Precisely. Would love a comparison between async and rayon for that use-case.

I am planning to do semi-serious benchmarks across a bunch of link checkers. I haven't figured out how yet.

Right now my approach to informal testing was to set up a random link checker, disable most of features such that the featureset roughly matches the few things hyperlink can do, then compare them on https://github.com/getsentry/sentry-docs and some synthetic stuff. Unfortunately there's two problems with this:

I am disabling so many features of those other link checkers that I am kind of comparing apples with oranges. Particularly lychee looks like it would generally just serve very different usecases.
For link checkers which only support HTTP I need to make sure that the link checker is not bottlenecked on the local http server that serves up what would be directly accessed by hyperlink. Figuring that out is not easy without distorting results further. Also, again, apples vs oranges.

whizsid · 2020-10-28T08:16:51Z

It is good if we can provide a file checker. Because most of users using link checkers for CI/CD actions. It will fail when they are adding a new relative URL. Because this relative URL is not already exist on provided site at the time. I would like to work on this.

mre · 2020-10-28T12:42:56Z

Sure @whizsid, go for it.

We could check if a file exists if base-url is not set. Additionally I'm wondering if we should only check for a file if the relative link also came from a file. My reasoning is that it could cause some false-positives if a file exists that just happens to have the same name as a relative link on a website. Might be an edge-case, though and make things unnecessarily complex...
Would love to get some opinions on this before we start.

mre · 2020-11-25T09:50:16Z

@whizsid are you still working on this? Let me know if you need any support.

MichaIng · 2020-12-19T12:23:49Z

The now deprecated liche had the most reasonable implementation for "local" links, IMO:

Relative local links (not starting with a slash) simply check for local files relative to the originating document, just as it would work on the "published" document, assuming that all local targets are present when the originating document is. This is true for 99% of cases and otherwise it makes sense for the developer to set links with full URL, if the targets are not part of the website/document repository that is to be checked.
Absolute local links (leading slash) require the base dir option to define the document root, which can be an external URL as well as a local directory path. So I was able to do liche -d build -r build/docs to check all local links, where the website is built into the build/docs directory and will be reached at https://example.org/docs finally.
Of course it makes sense to only check file existence in case of relative local URLs or when a directory path is given for the base dir/url.

The ability to check for local files is quite important, as even if the website that you want to test is available via HTTP(S), it's internal structure may have changed and one needs to be able to check whether all internal relative links are intact.

... since it does not support local file checking for internal links yet: lycheeverse/lychee#21

MichaIng · 2021-06-11T12:46:10Z

We finally migrated from liche to lychee, and I have to say, the performance is awesome in comparison. ~280 links on our website are checked in 2 seconds (according to GitHub action), while liche took minutes, regularly failed with a timeout, I guess VM-side, and recently was killed by the VM due to OOM, if I got the error code correctly. Not sure what it did, probably it indeed did download the OS images that were linked into memory, or so...

For internal links, we start the Nginx webserver, that is pre-installed on GitHub actions VMs, and use the loopback address as base URL. On our main website, this works fine, as all HTML files are in the repository root. But when doing this for our documentation (MkDocs-based), it became complicated:

lychee applies the base-url as is to all URLs/links checked, without respecting the directory structure.
Hence, the base-url is always only correct for a single directory.
Since MkDocs puts every HTML into an own directory, every file hence needs to be checked with an own lychee call, which implies that doubled links across HTML files are checked multiple times, every checked file gets its own report/summary, without showing the file name (currently) and regardless if an error was present or not, which makes finding the actual errors inconvenient, etc.

While a logic could be implemented to apply base-url for relative internal links (without leading slash) based on the directory of the HTML file, relative to the working directory where the script was called, I think it is not worth it, considering that simply checking for the existence if the file is the much preferred solution. Then it is pretty clear that relative links need to check for files relative to the actual file location, and for absolute internal links, it makes sense to allow giving a local directory or base URL instead, so that it does not matter from where lychee was called.

See this as a bump if this topic, as I think it is a major feature lack when using lychee to check website repositories 🙂.

mre · 2021-06-11T13:39:13Z

Want to prioritize this when I get the chance.

polarathene · 2021-06-12T00:30:50Z

I think it is not worth it, considering that simply checking for the existence if the file is the much preferred solution.

In some cases where you link with a #fragment(anchor) to a heading to jump/scroll to on the page, you'd also want to ideally verify that is valid too. That isn't a feature supported yet, but may eventually be relevant to consider.

I know you chimed in on that issue yourself, but mentioning it here for others in this issue thread.

MichaIng · 2021-06-12T09:17:41Z

Not sure how it is related to local file checking, but checking scroll fragments and matching IDs within the input file is actually a doable case for #185 I didn't think about.

polarathene · 2021-06-12T12:31:30Z

Not sure how it is related to local file checking

Markdown documents for documentation generators like mkdocs? These can have relative filepath links (or just a local heading reference on the same document) via markdown syntax, which is later parsed and converted into anchor links in HTML output.

MichaIng · 2021-06-12T12:45:30Z

I got it, but this request is covered by #185, while here it's only about the ability to check internal links via local file existence, not at all about the URL fragment 😉. Fragments in Markdown files is difficult, since there is no native Markdown syntax to assign an element ID. There are extensions for this, some to apply them manually, some to apply them automatically, so it would require a complex logic to check this, and the result is accordingly error-prone. I generally recommend to check the built HTML documents instead of the Markdown files, when using MkDocs, so there is a consistent syntax that does not depend on the local implementation and extensions, like Markdown.

mre · 2021-09-04T01:43:28Z

@untitaker can you help me with testing #262?
Of course, if anyone else has some time, feedback is very welcome.

untitaker · 2021-09-04T07:30:22Z

Will test in detail next week, hopefully. I left a comment on the pr, it seems to me lychee-on-fs is still built for different usecases than what I need.

…

On Sat, Sep 4, 2021, at 03:43, Matthias wrote: @untitaker <https://github.com/untitaker> can you help me with testing #262 <#262>? Of course, if anyone else has some time, feedback is very welcome. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRI76GDY62FJR3BTFRTUAF2UZANCNFSM4S5TC4XA>.

untitaker · 2021-09-06T19:03:31Z

Ok, everything I encountered:

progressbar is really slowing things down. by disabling it using a pipe I get down from 3 minutes to 32 seconds on sentry-docs, which is roughly on-par with liche. hyperlink is still doing that site in 2 secs
besides progressbar there's some room for optimization, I see lychee maxes out all cores traversing the filesystem upfront, then does not manage to saturate CPU at all anymore when it begins to show the progressbar and tries to advance
the problem with anchors not being stripped occurs for me as well

other than that lgtm

full command I ran in sentry-docs after following their setup:

yarn
yarn build
cd public/
time ~/projects/lychee/target/release/lychee --offline -b . '**/*.html' |& tee log

mre · 2021-09-06T22:51:35Z

progressbar is really slowing things down. by disabling it using a pipe I get down from 3 minutes to 32 seconds on sentry-docs, which is roughly on-par with liche. hyperlink is still doing that site in 2 secs

Wow. There are similar reports here: console-rs/indicatif#170
Good job on making hyperlink so fast! 😃

I see lychee maxes out all cores traversing the filesystem upfront, then does not manage to saturate CPU at all anymore when it begins to show the progressbar and tries to advance

This is true. The problem is the current collector implementation. It pushes all file contents into a channel before checking. That's extremely wasteful. I think just using streams would be the better alternative. This should be tackled once basic file support gets merged. @lebensterben we talked about that before. I could need your help on this. 😅

the problem with anchors not being stripped occurs for me as well

Yup, that was an oversight. It should be fixed now.

mre · 2021-09-06T23:21:11Z

@untitaker, tried to reproduce your sentry use-case:

time lychee --offline -b . '**/*.html' --no-progress --verbose                                                              
Error: Too many open files (os error 24)

This is clearly not working and we'll need to move to a lazy link collection approach.
It won't happen on the first iteration, but there is obviously some refactoring work to do before file support is suitable for bigger sites. 😞

MichaIng · 2021-09-07T10:36:48Z

@mre
This is not related to the local files support itself, is it?

mre · 2021-09-07T11:15:59Z

True.
In the following snippet we return the Result<Vec<InputContent>>, which is fine when we deal with a few files, but when it calls glob_contents with a pattern like **/*.html on a big directory it will run out of file handles because it eagerly reads all those files.

lychee/lychee-lib/src/collector.rs

Lines 122 to 145 in 8fd34a7

    
           pub async fn get_contents( 
        
               &self, 
        
               file_type_hint: Option<FileType>, 
        
               skip_missing: bool, 
        
           ) -> Result<Vec<InputContent>> { 
        
               match *self { 
        
                   // TODO: should skip_missing also affect URLs? 
        
                   Input::RemoteUrl(ref url) => Ok(vec![Self::url_contents(url).await?]), 
        
                   Input::FsGlob { 
        
                       ref pattern, 
        
                       ignore_case, 
        
                   } => Ok(Self::glob_contents(pattern, ignore_case).await?), 
        
                   Input::FsPath(ref path) => { 
        
                       let content = Self::path_content(path).await; 
        
                       match content { 
        
                           Ok(input_content) => Ok(vec![input_content]), 
        
                           Err(_) if skip_missing => Ok(vec![]), 
        
                           Err(e) => Err(e), 
        
                       } 
        
                   } 
        
                   Input::Stdin => Ok(vec![Self::stdin_content(file_type_hint).await?]), 
        
                   Input::String(ref s) => Ok(vec![Self::string_content(s, file_type_hint)]), 
        
               } 
        
           }

It indirectly affects local file support if there a lot of files to handle (as in the sentry example).

The better approach is to lazily evaluate each file e.g. by using a stream. I will play around with streams a bit to see if I can build something fast and efficient. If anyone wants to help me with that, feel free to create a test repo, a pull request or we can even pair program on this together (UTC+2 timezone).

lebensterben · 2021-09-11T02:12:55Z

one solution is to make this synchronous. tokio is able to spawn synchronous task within asynchronous context.

if you want, you can use crossbeam/rayon to start a dedicated thread pool with limited number of OS thread.

lebensterben · 2021-09-11T02:20:04Z

see
https://docs.rs/tokio/1.11.0/tokio/#cpu-bound-tasks-and-blocking-code

local disk IO is not CPU bound. we'd expect most files are processed below 10s (the default timeout before tokio steal a thread), so using asynchronous here isn't really helpful either.

and suppose tokio really does steal from a thread, that only adds unnecessary cost of context switching.

lebensterben · 2021-09-11T03:03:15Z

ripgrep uses crossbeam + num_cpus.
in particular, its "worker" is synchronous in nature, even when multi-thread is on.

https://github.com/BurntSushi/ripgrep/blob/df83b8b44426b3f2179abe632eb183e8c8270524/crates/core/search.rs#L310

it's relatively cheap for us to factor out the network IO from other parts, and use tokio and crossbeam, respectively, for local and network IO.

roughly speaking,

CLI args and/or config file parsing is synchrnonous and single-thread in nature.
walking directories according to glob pattern, and grabbing URLs from files are synchronous but can benefit from parallelism (suppose that it's not running on crappy hard disk)
the "spawner" of worker pool is the boundary of sync and async. this is actually the part we should spend most effort on.
workers are async. but there's not much room to work on. after all, when a HTTP request is sent and before we received anything from server, it's out of our control.

mre · 2021-10-07T11:03:32Z

For some reason I only see your comments now. What you say all makes sense and we should go forward with that.
As you mentioned, we can take a lesson from ripgrep's book. Since ripgrep and lychee and many others seem to have the same requirements for fast filesystem traversal, I wonder if there should be a separate crate, which can wrap that functionality behind a nice, performant interface. But yeah, rayon sounds like a good idea and spawn_blocking sounds like a great combination.

untitaker · 2021-10-07T13:11:00Z

Jwalk already implements walkdir on rayon, I've found it to be on-par with custom rayon or ripgrep. That said if you want to honor gitignore etc maybe you want to look at ripgrep. I just didn't find their code particularly reusable (relatively speaking)

mre · 2021-10-07T14:36:10Z

jwalk looks great. We should use that if the input is a directory. Would be nice to add glob support to it, but it might be out of scope for the crate. 🤷
ripgrep probably covers more edge-cases, but as you said it's not very reusable as a library, so jwalk is probably our best option.

mre · 2021-10-10T01:42:25Z

Added jwalk to #330.

untitaker mentioned this issue Oct 24, 2020

Bench against lychee untitaker/hyperlink#16

Closed

mre mentioned this issue Dec 8, 2020

missing features in comparison table #68

Closed

18 tasks

MichaIng mentioned this issue Jan 14, 2021

Migrate to lychee for link checking MichaIng/DietPi-Docs#291

Merged

3 tasks

mre added enhancement New feature or request help wanted Extra attention is needed labels Feb 15, 2021

polarathene mentioned this issue Feb 27, 2021

Checking only links from modified files in a Pull Request? lycheeverse/lychee-action#17

Closed

mre changed the title ~~Lychee does not find relative URLs~~ Support checking local file links Mar 29, 2021

MichaIng added a commit to MichaIng/DietPi-Docs that referenced this issue Jun 9, 2021

Serve docs locally for lychee

4aafa19

... since it does not support local file checking for internal links yet: lycheeverse/lychee#21

MichaIng mentioned this issue Jun 11, 2021

Add support for anchor tags #185

Closed

spier mentioned this issue Jun 13, 2021

Run link checker action daily InnerSourceCommons/InnerSourcePatterns#331

Merged

mre mentioned this issue Jun 16, 2021

Add support for local files #262

Merged

3 tasks

mre mentioned this issue Sep 6, 2021

Send extracted links from extractor pool via channel #193

Closed

mre closed this as completed in #262 Sep 9, 2021

mre mentioned this issue Sep 14, 2021

Improve concurrency with streams #330

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support checking local file links #21

Support checking local file links #21

untitaker commented Oct 24, 2020

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020

mre commented Oct 25, 2020 •

edited

Loading

untitaker commented Oct 25, 2020

untitaker commented Oct 25, 2020

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020 •

edited

Loading

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020 •

edited

Loading

untitaker commented Oct 25, 2020

whizsid commented Oct 28, 2020

mre commented Oct 28, 2020

mre commented Nov 25, 2020

MichaIng commented Dec 19, 2020 •

edited

Loading

MichaIng commented Jun 11, 2021 •

edited

Loading

mre commented Jun 11, 2021

polarathene commented Jun 12, 2021 •

edited

Loading

MichaIng commented Jun 12, 2021 •

edited

Loading

polarathene commented Jun 12, 2021

MichaIng commented Jun 12, 2021 •

edited

Loading

mre commented Sep 4, 2021

untitaker commented Sep 4, 2021 via email

untitaker commented Sep 6, 2021 •

edited

Loading

mre commented Sep 6, 2021

mre commented Sep 6, 2021

MichaIng commented Sep 7, 2021

mre commented Sep 7, 2021 •

edited

Loading

lebensterben commented Sep 11, 2021

lebensterben commented Sep 11, 2021

lebensterben commented Sep 11, 2021

mre commented Oct 7, 2021

untitaker commented Oct 7, 2021 •

edited

Loading

mre commented Oct 7, 2021

mre commented Oct 10, 2021

Support checking local file links #21

Support checking local file links #21

Comments

untitaker commented Oct 24, 2020

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020

mre commented Oct 25, 2020 • edited Loading

untitaker commented Oct 25, 2020

untitaker commented Oct 25, 2020

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020 • edited Loading

mre commented Oct 25, 2020

untitaker commented Oct 25, 2020 • edited Loading

untitaker commented Oct 25, 2020

whizsid commented Oct 28, 2020

mre commented Oct 28, 2020

mre commented Nov 25, 2020

MichaIng commented Dec 19, 2020 • edited Loading

MichaIng commented Jun 11, 2021 • edited Loading

mre commented Jun 11, 2021

polarathene commented Jun 12, 2021 • edited Loading

MichaIng commented Jun 12, 2021 • edited Loading

polarathene commented Jun 12, 2021

MichaIng commented Jun 12, 2021 • edited Loading

mre commented Sep 4, 2021

untitaker commented Sep 4, 2021 via email

untitaker commented Sep 6, 2021 • edited Loading

mre commented Sep 6, 2021

mre commented Sep 6, 2021

MichaIng commented Sep 7, 2021

mre commented Sep 7, 2021 • edited Loading

lebensterben commented Sep 11, 2021

lebensterben commented Sep 11, 2021

lebensterben commented Sep 11, 2021

mre commented Oct 7, 2021

untitaker commented Oct 7, 2021 • edited Loading

mre commented Oct 7, 2021

mre commented Oct 10, 2021

mre commented Oct 25, 2020 •

edited

Loading

untitaker commented Oct 25, 2020 •

edited

Loading

untitaker commented Oct 25, 2020 •

edited

Loading

MichaIng commented Dec 19, 2020 •

edited

Loading

MichaIng commented Jun 11, 2021 •

edited

Loading

polarathene commented Jun 12, 2021 •

edited

Loading

MichaIng commented Jun 12, 2021 •

edited

Loading

MichaIng commented Jun 12, 2021 •

edited

Loading

untitaker commented Sep 6, 2021 •

edited

Loading

mre commented Sep 7, 2021 •

edited

Loading

untitaker commented Oct 7, 2021 •

edited

Loading