-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support checking local file links #21
Comments
You need to pass a Would that cover your use-case? |
Ah so it is always going to do network I/O? Ok, that clarifies that then. I just assumed it would do local file checking because the input is already a local file (i.e. work closer to how liche works). I think I'll just run a local http server for CI like I did with the others then and pass an URL. As for usecases: I am only doing benchmark comparisons around a bunch of link checkers. My usecase is checking internal links of the html generated by a static site generator, and that as part of CI, but I wrote my own link checker for that. Whose small featureset, incredibly, seems to be the exact inverse to lychee's. Just shows that link checker is an overloaded term. Two pieces of feedback, but keep in mind that those don't come from a "real user":
|
Good points!
This is tackled by @pawroman in #22. 🎉
True. Come to think of it, checking relative URLs in case a file name is given might be the default. Wonder what @whizsid and @pawroman think about it. |
Liche does this thing where it aborts link checking entirely when it encounters a host-relative URL while parsing a local file and base-url is not set. I have to write this for anything to work:
which is kind of silly, but it does allow those arguments to diverge, and still lets liche resolve
I think a hybrid where --document-root defaults to the parent path of the file given would be good, so I can write both:
(The |
Also I just realized that lychee does not support recursive traversal of links at all, or am I wrong? Had to combine |
Nope, recursion is still on the todo list (see readme). My idea was to use channels for handling in-flight checks; do you use the same in hyperlink? |
hyperlink can only crawl directories and resolve relative/internal links (no http support at all), as such it walks through the entire directory first using walkdir/jwalk then parallelizes parsing all found files with Rayon. No graph traversal over hrefs, just directory walking. I honestly don't know enough about post-futures-0.1 async but I suppose you could do something with the mpsc one? So you have a single consumer that is in charge of spawning new fetch tasks and deduplicating fetch tasks. And each fetch task gets a producer. |
Precisely. Would love a comparison between async and rayon for that use-case. My guess is that perf would be similar while async would use slightly less resources when dealing with many tasks. Then again threads are super optimized as well on modern hardware so there might be no difference after all. |
I did see that liche, which uses goroutines, is slightly faster than hyperlink on folders with >1 mio files, but because liche spawned over 200 threads on my mac (while hyperlink spawned 12) I also assumed it effectively did use a thread pool, not actual async syscalls for file I/O. I did pick basic threading over tokio because I thought outside of networking (I didn't care about HTTP at all) it would not make a difference, and also because it definetly seems I would be able to reuse buffers much more efficiently. I just googled some stuff and apparently there are async syscalls for reading files on macos. Anyway, it would be a massive pain to support external links in hyperlink so I probably painted myself into a corner already. |
I am planning to do semi-serious benchmarks across a bunch of link checkers. I haven't figured out how yet. Right now my approach to informal testing was to set up a random link checker, disable most of features such that the featureset roughly matches the few things hyperlink can do, then compare them on https://github.com/getsentry/sentry-docs and some synthetic stuff. Unfortunately there's two problems with this:
|
It is good if we can provide a file checker. Because most of users using link checkers for CI/CD actions. It will fail when they are adding a new relative URL. Because this relative URL is not already exist on provided site at the time. I would like to work on this. |
Sure @whizsid, go for it. We could check if a file exists if |
@whizsid are you still working on this? Let me know if you need any support. |
The now deprecated liche had the most reasonable implementation for "local" links, IMO:
The ability to check for local files is quite important, as even if the website that you want to test is available via HTTP(S), it's internal structure may have changed and one needs to be able to check whether all internal relative links are intact. |
... since it does not support local file checking for internal links yet: lycheeverse/lychee#21
We finally migrated from For internal links, we start the Nginx webserver, that is pre-installed on GitHub actions VMs, and use the loopback address as base URL. On our main website, this works fine, as all HTML files are in the repository root. But when doing this for our documentation (MkDocs-based), it became complicated:
While a logic could be implemented to apply See this as a bump if this topic, as I think it is a major feature lack when using |
Want to prioritize this when I get the chance. |
In some cases where you link with a I know you chimed in on that issue yourself, but mentioning it here for others in this issue thread. |
Not sure how it is related to local file checking, but checking scroll fragments and matching IDs within the input file is actually a doable case for #185 I didn't think about. |
Markdown documents for documentation generators like mkdocs? These can have relative filepath links (or just a local heading reference on the same document) via markdown syntax, which is later parsed and converted into anchor links in HTML output. |
I got it, but this request is covered by #185, while here it's only about the ability to check internal links via local file existence, not at all about the URL fragment 😉. Fragments in Markdown files is difficult, since there is no native Markdown syntax to assign an element ID. There are extensions for this, some to apply them manually, some to apply them automatically, so it would require a complex logic to check this, and the result is accordingly error-prone. I generally recommend to check the built HTML documents instead of the Markdown files, when using MkDocs, so there is a consistent syntax that does not depend on the local implementation and extensions, like Markdown. |
@untitaker can you help me with testing #262? |
Will test in detail next week, hopefully. I left a comment on the pr, it seems to me lychee-on-fs is still built for different usecases than what I need.
…On Sat, Sep 4, 2021, at 03:43, Matthias wrote:
@untitaker <https://github.com/untitaker> can you help me with testing #262 <#262>?
Of course, if anyone else has some time, feedback is very welcome.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#21 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMPRI76GDY62FJR3BTFRTUAF2UZANCNFSM4S5TC4XA>.
|
Ok, everything I encountered:
other than that lgtm full command I ran in sentry-docs after following their setup:
|
Wow. There are similar reports here: console-rs/indicatif#170
This is true. The problem is the current collector implementation. It pushes all file contents into a channel before checking. That's extremely wasteful. I think just using streams would be the better alternative. This should be tackled once basic file support gets merged. @lebensterben we talked about that before. I could need your help on this. 😅
Yup, that was an oversight. It should be fixed now. |
@untitaker, tried to reproduce your sentry use-case:
This is clearly not working and we'll need to move to a lazy link collection approach. |
@mre |
True. lychee/lychee-lib/src/collector.rs Lines 122 to 145 in 8fd34a7
It indirectly affects local file support if there a lot of files to handle (as in the sentry example). The better approach is to lazily evaluate each file e.g. by using a stream. I will play around with streams a bit to see if I can build something fast and efficient. If anyone wants to help me with that, feel free to create a test repo, a pull request or we can even pair program on this together (UTC+2 timezone). |
one solution is to make this synchronous. tokio is able to spawn synchronous task within asynchronous context. if you want, you can use crossbeam/rayon to start a dedicated thread pool with limited number of OS thread. |
see local disk IO is not CPU bound. we'd expect most files are processed below 10s (the default timeout before tokio steal a thread), so using asynchronous here isn't really helpful either. and suppose tokio really does steal from a thread, that only adds unnecessary cost of context switching. |
ripgrep uses crossbeam + num_cpus. it's relatively cheap for us to factor out the network IO from other parts, and use tokio and crossbeam, respectively, for local and network IO. roughly speaking,
|
For some reason I only see your comments now. What you say all makes sense and we should go forward with that. |
Jwalk already implements walkdir on rayon, I've found it to be on-par with custom rayon or ripgrep. That said if you want to honor gitignore etc maybe you want to look at ripgrep. I just didn't find their code particularly reusable (relatively speaking) |
jwalk looks great. We should use that if the input is a directory. Would be nice to add glob support to it, but it might be out of scope for the crate. 🤷 |
Added jwalk to #330. |
#15 implemented relative URLs, however a simple test shows that it is not working:
Expected to find a 404 because file
foo.html
does not exist.The text was updated successfully, but these errors were encountered: