Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always succeed empty fragment checks #1599

Open
MichaIng opened this issue Dec 25, 2024 · 11 comments · May be fixed by #1609
Open

Always succeed empty fragment checks #1599

MichaIng opened this issue Dec 25, 2024 · 11 comments · May be fixed by #1609
Labels
enhancement New feature or request

Comments

@MichaIng
Copy link
Member

MichaIng commented Dec 25, 2024

Currently we see errors like

file:///home/runner/work/repo/path/to/stats.html# | Failed: Cannot find fragment

while the empty fragment # is well known to scroll to the top of the page, just like #top, while the latter can be overwritten, if there is an HTML element with id="top". I suggest to succeed the fragment check right away if any of those two special fragments is found.

Sadly I could not find a good source in RFCs or similar. Maybe it is just browsers implementing this outside of specs. The empty fragment is very widely used, the #top however I found just now, while searching for a related source. It works however well in all browsers I checked: https://stackoverflow.com/a/77560206

@thiru-appitap
Copy link

Please apply this patch and confirm if this resolves the issue
top_fragment.patch

FYI: the patch also updates the fixtures/fragments/file.html with the empty fragment anchor and a #top anchor (included for testing)

@mre mre added the enhancement New feature or request label Jan 3, 2025
@MichaIng
Copy link
Member Author

MichaIng commented Jan 6, 2025

@thiru-appitap
Somehow it does not work here. Does it succeed the added tests in your case? cargo test currently runs into OOM errors here 😅.

root@VM-Bookworm:~/DietPi-Docs# grep '"top"' /root/lychee/lychee-lib/src/utils/fragment_checker.rs
                let contains_fragment = fragment.is_empty() && file_frags.contains("top")
root@VM-Bookworm:~/DietPi-Docs# /root/lychee/target/debug/lychee --include-fragments -vvv test.md
     [200] file:///root/DietPi-Docs/build/docs/index.html#getting-started
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html#nonexistent | Cannot find fragment
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html# | Cannot find fragment
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html#top | Cannot find fragment

Issues found in 1 input. Find details below.

[test.md]:
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html#nonexistent | Cannot find fragment
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html#top | Cannot find fragment
   [ERROR] file:///root/DietPi-Docs/build/docs/index.html# | Cannot find fragment

EDIT: The # and #top fragments needs to be checked explicitly here: https://github.com/lycheeverse/lychee/blob/master/lychee-bin/tests/cli.rs#L1810
The links in that file are not currently checked, AFAIK, and they would fail anyway, since relative file path checks fail in any case, which includes fragment-only URLs: #1574

From your patch, the #top fragment is also not included correctly. Currently, the logic says to allow an empty fragment if the file contains a top ID, if I understand it correctly? A #top fragment should just succeed in any case, it does not need to exist in the file, but is always interpreted by browser to scroll to top, if the ID does not exist, else scrolls to the element with the ID. And as there are no parenthesis, not sure whether then && takes procedure over the following || or so. Let me try to fix it.

EDIT2: Okay now I have:

                let contains_fragment = fragment.is_empty()
                    || fragment.contains("top")
                    || file_frags.contains(fragment)
                    || file_frags.contains(&fragment_decoded as &str);

With this, the #top fragment check succeeds, but the # still fails. so is_empty() does not apply here for some reason 🤔. Does fragment contain the # itself?

EDIT3: fragment does not contain the #. Another problem is that contains("top") of course applies to "abctop123", which must not be true. I tried:

                let contains_fragment = fragment == ""
                    || fragment == "top"
                    || file_frags.contains(fragment)
                    || file_frags.contains(&fragment_decoded as &str);

which weirdly sometimes succeeds with one of "" and "top", sometimes fails with both, but never succeeds with both, same when adding some to_string() and alike. I clearly do not understand enough about Rust types and pointers 😅. Searching around, I also tried:

                let contains_fragment = match fragment {
                        "" | "top" => true,
                        _ => false
                    }
                    || file_frags.contains(fragment)
                    || file_frags.contains(&fragment_decoded as &str);

which again consequently fails for both. However, I think the logic I aim for is clear, and I hope someone can implement this correctly, based on the type of fragment. I did not want to start using additional crates for this.

@mre
Copy link
Member

mre commented Jan 6, 2025

Can you open a pull request with your last attempt and the changes from @thiru-appitap? Then I can take a look and perhaps spot the bug.

@MichaIng MichaIng linked a pull request Jan 8, 2025 that will close this issue
@MichaIng
Copy link
Member Author

MichaIng commented Jan 8, 2025

PR up: #1609
Tests included.

@mre
Copy link
Member

mre commented Jan 9, 2025

Thanks for the PR. The race-condition you describe is certainly odd.

@MichaIng
Copy link
Member Author

MichaIng commented Jan 9, 2025

Just to be sure: any cache includes the fragment, right?

@mre
Copy link
Member

mre commented Jan 9, 2025

The fragment cache is here: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/utils/fragment_checker.rs
The cache is stored in a HashMap with the URL as the key and a HashSet of fragments as the value.

The normal request caching logic is here:

async fn handle(
client: &Client,
cache: Arc<Cache>,
cache_exclude_status: HashSet<u16>,
request: Request,
accept: HashSet<u16>,
) -> Response {
let uri = request.uri.clone();
if let Some(v) = cache.get(&uri) {
// Found a cached request
// Overwrite cache status in case the URI is excluded in the
// current run
let status = if client.is_excluded(&uri) {
Status::Excluded
} else {
// Can't impl `Status::from(v.value().status)` here because the
// `accepted` status codes might have changed from the previous run
// and they may have an impact on the interpretation of the status
// code.
Status::from_cache_status(v.value().status, &accept)
};
return Response::new(uri.clone(), status, request.source);
}
// Request was not cached; run a normal check
let response = check_url(client, request).await;
// - Never cache filesystem access as it is fast already so caching has no
// benefit.
// - Skip caching unsupported URLs as they might be supported in a
// future run.
// - Skip caching excluded links; they might not be excluded in the next run.
// - Skip caching links for which the status code has been explicitly excluded from the cache.
let status = response.status();
if ignore_cache(&uri, status, &cache_exclude_status) {
return response;
}
cache.insert(uri, status.into());
response
}

The full URI (including the fragment) is cached.

@thiru-appitap
Copy link

thiru-appitap commented Jan 12, 2025

@MichaIng, @mre I was away and so have missed this conversation so far - my comments below:

  1. yes - the simplified logic check (EDIT: 2) makes more sense.
  2. The empty fragment (#) check is failing because:
    • in the cache-hit scenario, empty fragment check is missing (fragment.is_empty()
      (extractor never includes the empty string for "#" into the fragments list and so an explicit empty fragment check is required)

Made this below code change and verified it is working (file1.html & file1.md changes are noted below)

       let is_emtpty_or_top_fragment = fragment.is_empty() || fragment.eq_ignore_ascii_case("top");
        match self.cache.lock().await.entry(url_without_frag) {
            Entry::Vacant(entry) => {
                let content = fs::read_to_string(path).await?;
                let file_frags = extractor(&content);
                let contains_fragment = is_emtpty_or_top_fragment 
                    || file_frags.contains(fragment)
                    || file_frags.contains(&fragment_decoded as &str);
                log::debug!("contains_fragment ({fragment}): {contains_fragment}");
                entry.insert(file_frags);
                Ok(contains_fragment)
            }
            Entry::Occupied(entry) => Ok(is_emtpty_or_top_fragment
                || entry.get().contains(fragment)
                || entry.get().contains(&fragment_decoded as &str)),
        }
  1. Testing:
    a). made below changes into file.html directly and tested with it to confirm the fix is working
  <body>
    <p id="top">Top of the page</p>
    <section id="in-the-beginning" style="height: 100vh;">
      <p id="Upper-ÄÖö">
        <div id="tangent%3A-kustomize"></div>
        To start
        <a href="file1.md#fragment-1">
          let's run away.
        </a>
      </p>
    </section>
    <section>
      <p id="a-word">Word</p>
      <a href="#in-the-beginning">back we go</a><br>
      <a href="#in-THE-begiNNing">back we go upper does not work</a><br>
      <a href="#tangent%3A-kustomize">id with percent encoding</a><br>
      <a href="#Upper-ÄÖö">back to Upper-ÄÖö</a><br>
      <a href="#Upper-%C3%84%C3%96%C3%B6">back to öüä encoded</a><br>
      <a href="#in-the-end">doesn't exist</a><br>
      <a href="#">To the top through empty fragment</a><br>
      <a href="#top">Back to the top</a>
      <a href="#embeddedtopabc">To top (embedded in between) - does not work</a>
      <a href="#endingtop">To top (ending) - does not work</a>
    </section>
  </body>
b). added an empty heading (as below) to file1.md and confirmed working of file1.md# & file1.md#top (browser honours both of these)
#  

BTW, the cargo test had succeeded in my case and i didn't encounter the race condition - have run few tests and still couldn't hit the issue and will keep trying.

thiru-appitap pushed a commit to thiru-appitap/lychee that referenced this issue Jan 12, 2025
@mre
Copy link
Member

mre commented Jan 12, 2025

If you don't mind, could you open a pull request with your changes similar to how @MichaIng did? It would make review and testing much easier.

@MichaIng
Copy link
Member Author

@thiru-appitap
Great find with the missing empty/top fragment handling, when a cache entry is present. This explains the race condition/inconsistent results in the same turn, since checks run in parallel, and sometimes the cache key might have been created already, sometimes not.

Why did you add a top ID and an empty heading to the test files? Those fragments are supposed to work without any of the IDs. Hence in my tests I added checks for those fragments, but did explicitly not add related headings or IDs, otherwise false negatives are possible.

MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
MichaIng added a commit to MichaIng/lychee that referenced this issue Jan 13, 2025
The empty "#" and "#top" fragments are always valid without related HTML element. Browser will scroll to the top of the page. Hence lychee must not fail on those.

Credits go to @thiru-appitap for initial attempt and helping to find missing parts of the implementation.

Solves: lycheeverse#1599

Signed-off-by: MichaIng <micha@dietpi.com>
@thiru-appitap
Copy link

@thiru-appitap Great find with the missing empty/top fragment handling, when a cache entry is present. This explains the race condition/inconsistent results in the same turn, since checks run in parallel, and sometimes the cache key might have been created already, sometimes not.

Why did you add a top ID and an empty heading to the test files? Those fragments are supposed to work without any of the IDs. Hence in my tests I added checks for those fragments, but did explicitly not add related headings or IDs, otherwise false negatives are possible.

@MichaIng
You are correct - I should've removed those entries (top ID and empty heading) from the test files (and doing so in my fork :-)) - thanks for pointing it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants