-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for local files #262
Conversation
Question: Should lychee check the existence of an I'm thinking we shouldn't check for the |
I agree for the same reasons. When the directory exists, then in 99% of cases something is served. Theoretically an |
path::{Path, PathBuf}, | ||
}; | ||
|
||
use glob::glob_with; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: All of the removed type definitions below got moved into the the types
module.
|
||
impl<P: AsRef<Path>> From<P> for FileType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't removed, just moved to types/file.rs
@lebensterben @MichaIng @pawroman: Apart from that, the tests are passing and I hope I covered all cases (let me know if I didn't!). Also if you can spare some time I'd be super thankful if you could try it on a local project like a static website for instance. Something like this should work:
The |
I think I disagree with the decision made here. I would suggest that you pick one or two static site hosters, such as gh pages and netlify, and document that your idea of a valid link matches those service's idea. Particularly gh pages offers no configuration as to what is considered an index page which imo already establishes strong conventions that most static site generators try to follow. I think php files are a red herring because you'd have to execute them to find broken links, so for those usecases lychee-on-local-files is already a bad fit. I'd focus on pure static sites for the local fs usecases. |
@untitaker For PHP files on the other hand, I don't see why it makes a difference, as the convention to use For now however, I think it is pretty great, @mre many thanks for your hard work the last days, I'll implement it into our workflows and show you a before/after. |
Test on main websitelychee v0.7.1
lychee v0.7.2 (local-files branch)
It finds more than doubled amount of links, and, looking at the failures, I guess it interprets page anchors, like |
So if I understand correctly it should strip off those anchors right? 😅 |
Theoretically, if the URL-encoding done somewhere would be removed, the "fragment" (now I remember the correct word for this 😄) may be ignored, stripped and handled as one URL already, as this is true for external URLs with current lychee releases. But if indeed local URLs are handled differently in the first place, so that not (only) the URL-encoding is the issue, then yes. Pure fragment links are ignored here already, and this seems to work fine: https://github.com/lycheeverse/lychee/pull/262/files#diff-9974c1c6274d302fb56650636debf746561e468f46fccc30dec65f4345abe268R30-R50 I'm unable to find where the URL-encoding actually happens. The first function where things are started to get treated differently seems to be
|
Yeah that's a tricky one. My fear is that whatever standard we'll pick will lead to weird inconsistencies down the road. I agree that Github Pages is sort of a quasi-standard at the moment, but I wouldn't bet on any current solution for the future. I'm just trying to be conservative with my guarantees here so that there are no big surprises. If anything, we should think about making this configurable. We could add a param like |
I think it's more dangerous and less conservative to over-accept links as valid rather than under-accept. If you choose to opt for over-accepting to avoid floods of errors by default, I fear that the experience over time will be that newer versions lychee will find less and less 404s in its default mode as support for more usecases is added
Yes, you can validate a link to index.php without parsing the PHP file. But I would argue that in such a hypothetical usecase lychee is already a bad fit anyway because one would likely want to check the entire website for broken links, not just individual files that lychee happens to be able to parse (i.e. they will likely opt for HTTP-based link checking). My broader argument here is that lychee-on-fs is only realistically usable on fully static sites so why even think about PHP? |
bug: If I run
The reason for this is that the glob matches a directory named 404.html (don't ask me what that is, it's some random stuff Gatsby generated), and that cannot be read as a file. Would argue that the glob should never match dirs, but not sure what arguments I'm passing. So I resorted to:
There's some overhead imposed by |
Yes that is true, and I think it is out of scope to hassle with this. For a dynamic website one can spin up a webserver, as well in CI, and then use
Lol, what the hell creates a directory named like that 😄. However, I think directories should be skipped silently and not throw an error, at least when being matched by a glob instead of given as explicit input (if it is sufficiently easy to implement skipping for globs only). |
Directories can still have a suffix which looks like a file extension like `foo.html`. This can lead to unexpected behavior with glob patterns like `**/*.html`. Therefore filter these out. #262 (comment)
E.g. `web%20site` becomes `web site`. That's because Url::from_file_path will encode the full URL in the end. This behavior cannot be configured. See #262 (comment)
Another nice catch. This should be resolved now as well. |
Hmm, but now when the base or root path contains by chance a literal URL coding character (series), it will be decoded as well, leading to a false negative 🤔. For regular three character percent coding, this case is very unlikely, so IMO could be ignored. But there is this nasty case of I just remember this case from a session token, passed via query string, decoded by a buggy webserver option, decoded again by PHP when accessing via |
I've added a test for that case: #[test]
fn test_create_uri_from_path() {
let result = create_uri_from_path(&PathBuf::from("/README.md"), &None, "test+encoding").unwrap();
assert_eq!(result.as_str(), "file:///test+encoding");
} Is that what you meant? |
EDIT: Sorry in the first place, as the blow tests and thoughts caused me headache 😄. TL;DR: Skip to the bottom. I just realised that indeed only the path within the link is decoded, not the full path, so the theoretical case of a literal percent code in the root/base path shouldn't cause an issue. Made some tests:
With a real %-code:
Another one:
Last test:
So as a resultthe only left issue is that special characters, which are not co-incidentally part of an URL coding, do not need to be URL encoded in the input file/link. While most browsers handle un-encoded spaces and probably other characters gracefully, it is at least bad practice and fails definitely IMHO it is good enough to go like that. Probably a note could be left in the code, so that in case of a refactoring, probably this left issue and inconsistency can be resolved. As said, for this the base and root path elements need to be URL-encoded, while the input link path element needs to be appended untouched, so that it is correctly expected to be URL encoded already. The full URL checks also do not enforce URL encoding but do it automatically, when missing. I personally would change this as well, but it makes the local file check again consistent 😄:
|
Good point. I added a TODO so that we don't forget. |
Merged! 🎉 |
@MichaIng re discrepancy in link count: will file follow-up ticket if I find anything, in principle anchor checking is disabled in hyperlink by default |
Great work, sorry I didn't really catch up with the last couple messages at all! I'll do some more functional testing at a later point but I think yall got it covered... |
It wouldn't have worked without your inspiration, so thanks for that. I learned a lot. |
Thanks! There's def a lack of code comments and a lot of premature optimization, though the latter is sort of the selling point... But yeah it's been interesting with regard to performance for sure |
Fixes #21
TODO:
lychee deploy/specific/sub/page.html --base-url deploy/
) for absolute local links.Absolute local links (leading slash) require the base dir option to define the document root, which can be an external URL as well as a local directory path. So I was able to do liche -d build -r build/docs to check all local links, where the website is built into the build/docs directory and will be reached at https://example.org/docs finally. (see comment)
--document-root
or--base
defaults to the parent path of the file-x
prevents liche from checking external links)