Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for local files #262

Merged
merged 48 commits into from
Sep 9, 2021
Merged

Add support for local files #262

merged 48 commits into from
Sep 9, 2021

Conversation

mre
Copy link
Member

@mre mre commented Jun 16, 2021

Fixes #21

TODO:

  • Add support for base-url (lychee deploy/specific/sub/page.html --base-url deploy/) for absolute local links.
    Absolute local links (leading slash) require the base dir option to define the document root, which can be an external URL as well as a local directory path. So I was able to do liche -d build -r build/docs to check all local links, where the website is built into the build/docs directory and will be reached at https://example.org/docs finally. (see comment)
  • Consider --document-root or --base defaults to the parent path of the file
  • Add flag to prevent network requests (similar to how -x prevents liche from checking external links)

README.md Show resolved Hide resolved
@mre
Copy link
Member Author

mre commented Jun 27, 2021

Question: Should lychee check the existence of an index.html if a link points to a folder? E.g. if we find a link to /about, would we just check if the folder exists or if there is an actual HTML file that could be loaded?

I'm thinking we shouldn't check for the index.html, because the specifics heavily depend on the server configuration; e.g. Apache or NGINX might accept an index.htm or an index.php as well or show a directory listing if none of these exist. This behavior can be overwritten, so we can't be sure unless we check the server config. That is a whole other rabbit hole, which we should probably avoid. It might be a caveat that we should document somewhere maybe. Thoughts?

@MichaIng
Copy link
Member

MichaIng commented Jun 27, 2021

I'm thinking we shouldn't check for the index

I agree for the same reasons. When the directory exists, then in 99% of cases something is served. Theoretically an --index option could be added to define accepted index files, but I cannot imagine that this would be much used.

@mre mre marked this pull request as ready for review September 3, 2021 00:04
path::{Path, PathBuf},
};

use glob::glob_with;
Copy link
Member Author

@mre mre Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: All of the removed type definitions below got moved into the the types module.


impl<P: AsRef<Path>> From<P> for FileType {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't removed, just moved to types/file.rs

@mre
Copy link
Member Author

mre commented Sep 3, 2021

@lebensterben @MichaIng @pawroman:
This is ready for review from my side. Nevermind the publish check issues. That's because of changes to the lychee-lib, which aren't published yet. We hopefully find a solution for that in #305.

Apart from that, the tests are passing and I hope I covered all cases (let me know if I didn't!). Also if you can spare some time I'd be super thankful if you could try it on a local project like a static website for instance. Something like this should work:

cargo install --git https://github.com/lycheeverse/lychee --branch local-files
# Test local file links only
lychee --offline --base public public/index.html
# Test all the links
lychee --base public public/index.html

The base is needed if you have any non-relative links (e.g. /about).
This could be a little smarter by assuming the path to the root directory as base if it is not set (e.g. public in this case).
However I wasn't sure what to do in case of multiple inputs. We could use the base each input, but that would require to track the origin of a link throughout the run and I didn't want to add that functionality right away (YAGNI).

@mre
Copy link
Member Author

mre commented Sep 4, 2021

Nevermind the publish check issues. That's because of changes to the lychee-lib, which aren't published yet. We hopefully find a solution for that in #305.

Well @dblock fixed that and the build is green now. Thanks! 😊

@untitaker
Copy link
Collaborator

untitaker commented Sep 4, 2021

Question: Should lychee check the existence of an index.html if a link points to a folder?

I think I disagree with the decision made here. I would suggest that you pick one or two static site hosters, such as gh pages and netlify, and document that your idea of a valid link matches those service's idea. Particularly gh pages offers no configuration as to what is considered an index page which imo already establishes strong conventions that most static site generators try to follow.

I think php files are a red herring because you'd have to execute them to find broken links, so for those usecases lychee-on-local-files is already a bad fit. I'd focus on pure static sites for the local fs usecases.

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

@untitaker
But all webservers support automatic directory indexes, so when a local link points to a directory, how do you know whether an actually contained index.html is required or whether it's aimed to be served as directory index? To not cause a lot of false positives by doing false assumptions via additional logic, you'd need to as well provide an option to define one or more valid index files, or whether required at all, which makes it more and more complicated. Mid- or long-term I'd agree to have such optionally, but would by default always succeed when the directory itself exist and the option is not specified.

For PHP files on the other hand, I don't see why it makes a difference, as the convention to use index.php is even stronger compared to index.html/xhtml/xml/htm/ what what not that exist for static pages 😉. So no need to "execute" PHP (out of scope anyway) but simply check for the index.php index file. But again, IMHO this would be up to the user to define it as valid index via mentioned option. On dynamic websites, internal rewrites are more common, e.g. to have beautified URLs without index.php in the middle, which makes it harder to do everything right OOTB, but lychee has convenient options, now also with the exclude file, to handle such cases.

For now however, I think it is pretty great, @mre many thanks for your hard work the last days, I'll implement it into our workflows and show you a before/after.

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

Test on main website

lychee v0.7.1

Run ./lychee -X head -b 'http://127.0.0.1' --github-token '***' '**/*.md' '**/*.html'
📝 Summary
---------------------
🔍 Total..........299
✅ Successful.....299
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors...........0

lychee v0.7.2 (local-files branch)

Run lychee -X head -b '.' --github-token '***' '**/*.md' '**/*.html'
📝 Summary
---------------------
🔍 Total..........637
✅ Successful.....625
⏳ Timeouts.........0
🔀 Redirected.......0
👻 Excluded.........0
🚫 Errors..........12

Errors in dietpi-software.html
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23download (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23download)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23features (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23features)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials)

Errors in contribute.html
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23download (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23download)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23features (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23features)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted)

Errors in stats.html
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23gettingstarted)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23features (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23features)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23download (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23download)
✗ file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials (Invalid file URI: file:///home/runner/work/DietPi-Website/DietPi-Website/%23testimonials)

It finds more than doubled amount of links, and, looking at the failures, I guess it interprets page anchors, like /#gettingstarted, to link to the id="gettingstarted" element on the main index page, as part of the path, which then becomes invalid of course, and it explains the large amount of additional links, for every unique page anchor one. Not sure if the URL-encoding is done before or after doing the check, but in both cases, it shouldn't be done. The URLs need to be taken literally in every case: If an URL-encoding is missing, then this is an error in the checked HTML/source and needs to be fixed, isn't it?

@mre
Copy link
Member Author

mre commented Sep 4, 2021

So if I understand correctly it should strip off those anchors right? 😅

@MichaIng
Copy link
Member

MichaIng commented Sep 4, 2021

So if I understand correctly it should strip off those anchors right? 😅

Theoretically, if the URL-encoding done somewhere would be removed, the "fragment" (now I remember the correct word for this 😄) may be ignored, stripped and handled as one URL already, as this is true for external URLs with current lychee releases. But if indeed local URLs are handled differently in the first place, so that not (only) the URL-encoding is the issue, then yes.


Pure fragment links are ignored here already, and this seems to work fine: https://github.com/lycheeverse/lychee/pull/262/files#diff-9974c1c6274d302fb56650636debf746561e468f46fccc30dec65f4345abe268R30-R50


I'm unable to find where the URL-encoding actually happens. The first function where things are started to get treated differently seems to be extract_links. URL-encoding cannot have happened first, else there would have been a lot more errors on external links. Also there are pure fragment/anchor links in the checked files, which are obviously detected correctly by is_anchor and silently ignored, hence the # is still literal there. For local links we then have:

  • create_uri_from_path
  • remove_get_params (shouldn't have an effect here since the above failed URLs do not contain a query string)
  • resolve in this case with is_absolute => to_path_buf => clean
  • from_file_path ... ah, I guess here it is done, as usually a file path is not expected to have a fragment which shall be treated as fragment for the URL, so I guess this implies URL encoding (while it is not explicitly documented.

@mre
Copy link
Member Author

mre commented Sep 4, 2021

I would suggest that you pick one or two static site hosters, such as gh pages and netlify, and document that your idea of a valid link matches those service's idea. Particularly gh pages offers no configuration as to what is considered an index page which imo already establishes strong conventions that most static site generators try to follow.

Yeah that's a tricky one. My fear is that whatever standard we'll pick will lead to weird inconsistencies down the road. I agree that Github Pages is sort of a quasi-standard at the moment, but I wouldn't bet on any current solution for the future. I'm just trying to be conservative with my guarantees here so that there are no big surprises.

If anything, we should think about making this configurable. We could add a param like --index_page index.html, which would accept a regex and would make a index files mandatory for each directory. This way users could change the behavior to match their own static website layout.

@untitaker
Copy link
Collaborator

I think it's more dangerous and less conservative to over-accept links as valid rather than under-accept. If you choose to opt for over-accepting to avoid floods of errors by default, I fear that the experience over time will be that newer versions lychee will find less and less 404s in its default mode as support for more usecases is added

For PHP files on the other hand, I don't see why it makes a difference, as the convention to use index.php is even stronger compared to index.html/xhtml/xml/htm/ what what not that exist for static pages wink. So no need to "execute" PHP (out of scope anyway) but simply check for the index.php index file.

Yes, you can validate a link to index.php without parsing the PHP file. But I would argue that in such a hypothetical usecase lychee is already a bad fit anyway because one would likely want to check the entire website for broken links, not just individual files that lychee happens to be able to parse (i.e. they will likely opt for HTTP-based link checking). My broader argument here is that lychee-on-fs is only realistically usable on fully static sites so why even think about PHP?

@untitaker
Copy link
Collaborator

bug: If I run lychee --offline -b . '**/*.html' on sentry-docs, I get:

Error: Failed to read file: `page-data/404.html`, reason: Is a directory (os error 21)

The reason for this is that the glob matches a directory named 404.html (don't ask me what that is, it's some random stuff Gatsby generated), and that cannot be read as a file. Would argue that the glob should never match dirs, but not sure what arguments I'm passing.

So I resorted to:

~/projects/lychee/target/release/lychee --offline -b . $(rg --files -thtml)

There's some overhead imposed by rg --files, and then it takes 3-4 minutes to run. I then hit the same problem as @MichaIng with anchors not resolving to filepaths.

@MichaIng
Copy link
Member

MichaIng commented Sep 5, 2021

Yes, you can validate a link to index.php without parsing the PHP file. But I would argue that in such a hypothetical usecase lychee is already a bad fit anyway because one would likely want to check the entire website for broken links

Yes that is true, and I think it is out of scope to hassle with this. For a dynamic website one can spin up a webserver, as well in CI, and then use lychee to check the served website. #165 will make this even easier.

The reason for this is that the glob matches a directory named 404.html

Lol, what the hell creates a directory named like that 😄. However, I think directories should be skipped silently and not throw an error, at least when being matched by a glob instead of given as explicit input (if it is sufficiently easy to implement skipping for globs only).

mre added a commit that referenced this pull request Sep 5, 2021
Directories can still have a suffix which looks like
a file extension like `foo.html`. This can lead to
unexpected behavior with glob patterns like
`**/*.html`. Therefore filter these out.
#262 (comment)
E.g. `web%20site` becomes `web site`.
That's because Url::from_file_path will encode the full URL in the end.
This behavior cannot be configured.
See #262 (comment)
@mre
Copy link
Member Author

mre commented Sep 8, 2021

Another nice catch. This should be resolved now as well.
I had to look into how the url crate handles encoding. Turns out they encode the URL automatically and there is no way to disable that functionality. Since splitting the URL manually and encoding the individual parts would sort of defeat the point of using the url crate in the first place, I decided to do the simple thing and percent_decode the path section before passing it to Url::from_file_path. Added a test for that but let me know if that doesn't make any sense.
Other than that, +1?

@MichaIng
Copy link
Member

MichaIng commented Sep 9, 2021

Hmm, but now when the base or root path contains by chance a literal URL coding character (series), it will be decoded as well, leading to a false negative 🤔. For regular three character percent coding, this case is very unlikely, so IMO could be ignored. But there is this nasty case of +, coding to space (same as %20), which has a higher chance to exist in a file path. To be 100% failsafe, there is no way around splitting base/root path and link path, encoding only the base/root path once and expecting the link path to be encoded already, else double-encode or double-decode is theoretically possible. But if the + is not decoded by percent_decode_str, it is IMHO good enough. Sadly the documentation does not state that detail. I'll test.

I just remember this case from a session token, passed via query string, decoded by a buggy webserver option, decoded again by PHP when accessing via $_GET, leading to a token with spaces, where literal + were expected, which caused logout issues from a web application for years until we tracked it down to the unexpected webserver behaviour 😄.

@mre
Copy link
Member Author

mre commented Sep 9, 2021

I've added a test for that case:

#[test]
fn test_create_uri_from_path() {
    let result = create_uri_from_path(&PathBuf::from("/README.md"), &None, "test+encoding").unwrap();
    assert_eq!(result.as_str(), "file:///test+encoding");
}

Is that what you meant?

@MichaIng
Copy link
Member

MichaIng commented Sep 9, 2021

EDIT: Sorry in the first place, as the blow tests and thoughts caused me headache 😄. TL;DR: Skip to the bottom.


I just realised that indeed only the path within the link is decoded, not the full path, so the theoretical case of a literal percent code in the root/base path shouldn't cause an issue. Made some tests:

~# ls web+site
 README.md 'web site'

~# cat web+site/README.md
<a href="/web site/">
<a href="/web%20site">

~# lychee -vb 'web+site' 'web+site/README.md'
✔ file:///root/web+site/web%20site [200 OK]

~/web+site# lychee -vb .
✔ file:///root/web+site/web%20site [200 OK]
  • So in both cases, with + in the given base path or in the root path, it is taken literally and hence succeeds the check 👍.
  • I actually would have expected that the + itself is URL encoded by from_file_path to %2B, like the space in the sub directory correctly is to %20 🤔. EDIT: All correct, see further below!
  • While double decoding of the link path is now not an issue anymore, un-encoded characters in the link do succeed. The %20 is decoded into space, so the two links are equal, then from_file_path encodes both to %20. Most browsers convert encode spaces (and probably other special characters) automatically, but not all clients, so strictly seen this is a false positive:
    # curl -f 'https://dietpi.com/test%20dir/'
    # curl -f 'https://dietpi.com/test dir/'
    curl: (22) The requested URL returned error: 400
    

With a real %-code:

~# lychee -vb 'web%20site' 'web%20site/README.md'
✔ file:///root/web%2520site/web%20site [200 OK]

~/web%20site# lychee -vb .
✔ file:///root/web%2520site/web%20site [200 OK]
  • So the literal % character, as part of either the base path or the root path, is correctly skipped for percent_decode_str but then encoded to %25 by from_file_path 👍.

Another one:

~/web+site# cat index.html
<a href="/web%2Bsite/">

~/web+site# ls
index.html  web+site

~/web+site# lychee -vb . 'index.html'
✔ file:///root/web+site/web+site [200 OK]
  • So the %2B is decoded correctly to +, but then not encoded anymore by from_file_path. Makes sense as it was the same with the + in the base path. Looks like at least these functions do neither see the + character as something that needs to be decoded (to space), nor as something that needs to be URL-encoded. I just tested it in the browser, and while it automatically changes a space in the address bar into %20, it leaves the + untouched. Further checking on this: https://stackoverflow.com/a/1006074/16145737
    So if I understand it correctly, the + has indeed officially no special meaning in URL paths, but only in the query string part, interpreted as space by PHP, and otherwise by some RFC violating behaviour of .NET or so. Since we strip the query string, it is hence perfectly fine that the + is neither decoded, nor encoded but just left untouched.

Last test:

~/web+site# ls
index.html  web%20site

~/web+site# cat index.html
<a href="/web%20site/">

~/web+site# lychee -vb . 'index.html'
✗ file:///root/web+site/web%20site (Invalid file URI: file:///root/web+site/web%20site)

~/web+site# cat index.html
<a href="/web%2520site/">
~/web+site# lychee -vb . 'index.html'

  • In this case, where the link path contains a literally meant URL code, it is required to have it encoded in the link.
  • This is now inconsistent with the fact that a space or other special characters do not need to be encoded in the link.

So as a result

the only left issue is that special characters, which are not co-incidentally part of an URL coding, do not need to be URL encoded in the input file/link. While most browsers handle un-encoded spaces and probably other characters gracefully, it is at least bad practice and fails definitely curl and some other command line tools and libraries. If there is co-incidentally a literally meant URL code in the link path, it is decoded by percent_decode_str and hence in this case needs to be URL encoded in the input file/link, which is an inconsistency, though one that becomes apparent in a highly unlikely case.

IMHO it is good enough to go like that. Probably a note could be left in the code, so that in case of a refactoring, probably this left issue and inconsistency can be resolved. As said, for this the base and root path elements need to be URL-encoded, while the input link path element needs to be appended untouched, so that it is correctly expected to be URL encoded already.

The full URL checks also do not enforce URL encoding but do it automatically, when missing. I personally would change this as well, but it makes the local file check again consistent 😄:

# cat index.html
<a href="https://dietpi.com/web site/">
# lychee -v index.html
✔ https://dietpi.com/web%20site/ [200 OK]

@mre
Copy link
Member Author

mre commented Sep 9, 2021

Good point. I added a TODO so that we don't forget.

@mre mre merged commit 9b5fc39 into master Sep 9, 2021
@mre mre deleted the local-files branch September 9, 2021 17:50
@mre
Copy link
Member Author

mre commented Sep 9, 2021

Merged! 🎉
Thanks for your help @MichaIng and @untitaker.

@untitaker
Copy link
Collaborator

@MichaIng re discrepancy in link count: will file follow-up ticket if I find anything, in principle anchor checking is disabled in hyperlink by default

@untitaker
Copy link
Collaborator

Great work, sorry I didn't really catch up with the last couple messages at all! I'll do some more functional testing at a later point but I think yall got it covered...

@mre
Copy link
Member Author

mre commented Sep 9, 2021

It wouldn't have worked without your inspiration, so thanks for that. I learned a lot.

@untitaker
Copy link
Collaborator

@mre

Had a look at hyperlink yesterday and I'm impressed by the readability and clarity of the code without sacrificing performance. Good job there! Also if you find anything regarding the missing links I'd be super interested.

Thanks! There's def a lack of code comments and a lot of premature optimization, though the latter is sort of the selling point... But yeah it's been interesting with regard to performance for sure

MichaIng added a commit to MichaIng/DietPi-Docs that referenced this pull request Oct 8, 2021
MichaIng added a commit to MichaIng/DietPi-Website that referenced this pull request Oct 8, 2021
MichaIng added a commit to MichaIng/DietPi-Docs that referenced this pull request Oct 8, 2021
MichaIng added a commit to MichaIng/DietPi-Docs that referenced this pull request Oct 10, 2021
MichaIng added a commit to MichaIng/DietPi-Website that referenced this pull request Oct 10, 2021
MichaIng added a commit to MichaIng/DietPi-Docs that referenced this pull request Oct 14, 2021
MichaIng added a commit to MichaIng/DietPi-Docs that referenced this pull request Oct 27, 2021
MichaIng added a commit to MichaIng/DietPi-Website that referenced this pull request Oct 28, 2021
MichaIng added a commit to MichaIng/DietPi-Website that referenced this pull request Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support checking local file links
4 participants