Recursion Support (closes #55, #78) #165

mre · 2021-02-28T18:44:15Z

⚠️ This is on hold as we started implementing a stream-based approach in #330, which might supersede this branch soon. Recursion support will be added once this gets merged. We can probably rebase it on top of master then and make adjustments in this PR (or create new PR). Closing this for now.

This is a super basic recursion implementation of recursion as discussed in #78.
It doesn't use an extractor pool as discussed in #55. I think it's not needed at this point.
The current implementation just checks the responses, filters out websites, and pushes them back into the request queue.

I deliberately avoided to add the recursion support into the client. It would break separation of concerns. Also didn't want to add a separate recursion handler to the library as it I wouldn't expect any gains from that. Recursion handling is surprisingly tricky, because there are quite a few design decisions to make. (Recurse indefinitely or stop at a predefined recursion depth? Recurse on all domains or just the set of input domains?)
So most people using the library would have to make the decisions based on their use-case. The current recursion depth is part of the request/response struct, so it won't have to be wrapped in another struct at least.

TODO:

…cursion

joesan · 2021-03-01T05:54:46Z

Could you let me know what this task is about? The title does give me a hint, but I fail to understand the context?

mre · 2021-03-01T09:07:32Z

Sorry, forgot to link the issue. Here it is: #78.

mre · 2021-03-02T13:15:52Z

From my side this should be good to go.
@pawroman, @joesan any final comments before merging this?

pawroman

Nice work! Great to see this implemented.

I have left a few code comments and I have one generic comment: I think we should add at least one simple test for the recursive link checking.

pawroman · 2021-03-13T17:58:40Z

README.md

@@ -174,6 +175,8 @@ OPTIONS:
    -b, --base-url <base-url>                  Base URL to check relative URLs
        --basic-auth <basic-auth>              Basic authentication support. E.g. `username:password`
    -c, --config <config-file>                 Configuration file to use [default: ./lychee.toml]
+        --depth <depth>                        Stop link checking beyond this maximum recursion depth. (Recommended for
+                                               large inputs.)


I believe we should mention that depth less than zero also means infinite.

pawroman · 2021-03-13T18:03:38Z

src/bin/lychee/options.rs

+
+    /// Stop link checking beyond this maximum recursion depth. (Recommended for large inputs.)
+    #[structopt(long)]
+    pub depth: Option<usize>,


I'd consider allowing this to be signed (e.g. isize), so that users could explicitly express the need for infinite recursion with numbers less than zero (e.g. -1).

pawroman · 2021-03-13T18:07:29Z

src/bin/lychee/stats.rs

                },
                Response {
                    uri: website("http://example.org/redirect"),
                    status: Status::Redirected(http::StatusCode::PERMANENT_REDIRECT),
                    source: Input::Stdin,
+                    recursion_level: 0,


For convenience and to cover the "most common case", it might be worth considering adding an alternative constructor for Response which would set recursion_level to 0.

Or even have ::new constructor which would have this behavior (recursion = 0) and another one named something like ::with_recursion which would allow setting recursion level.

pawroman · 2021-03-13T18:09:32Z

src/lib.rs

-* }
-* ```
-*/
+//!


Great to make this a doc-comment!

But isn't a blank line here creating a newline in the rendered docs? Just curious.

Didn't check yet, but I assumed cargo would be smart enough to trim empty lines at the beginning and the end.

pawroman · 2021-03-13T18:13:47Z

src/bin/lychee/main.rs

+    if !response.status.is_success() {
+        return Ok(0);
+    }
+    if cache.contains(response.uri.as_str()) {


Nitpick: I'm a bit skeptical of the name cache here. We're not really caching, we're merely "checking if this URI was checked before, and should therefore be skipped".

How about seen?

pawroman · 2021-03-13T18:51:38Z

src/bin/lychee/main.rs

+    if cache.contains(response.uri.as_str()) {
+        return Ok(0);
+    }
+    cache.insert(response.uri.to_string());


Since we're allowing infinite recursion, this data structure can really take up the entire memory with sufficient number of links. Depending on the expected maximum number of links checked in recursive mode, we might consider a bloom filter here:

HashSet is probably fine if the number of links checked is in the order of millions or fewer

Bloom filter might be a good idea if the number of links checked is in the order of tens of millions or more

Super-rough calculations below. (Note that I'm just calculating the size of this one structure, there's probably going to be more memory overheads from various bits of the program).

Assumptions:

64 bit machine, e.g. 8 bytes per pointer

64 bytes per link on average

1 million links

Calculations:

Link string storage: 64 MB

String overhead (24 bytes per string): 24 MB
source: https://cheats.rs/#owned-strings

HashSet overhead is 1 byte per entry: 1MB
source: https://github.com/rust-lang/hashbrown#features

So, HashSet with 1 million unique links will take up ~64+24+1 = ~89 MB (~85 MiB)

For comparison, a Bloom Filter with ~1e-7 false positive probability is expected to take up:

~4MiB for 1 million links: https://hur.st/bloomfilter/?n=1M&p=1.0E-7&m=&k=

~80MiB for 20 million liks: https://hur.st/bloomfilter/?n=20M&p=1.0E-7&m=&k=

~4GiB for 1 billion links: https://hur.st/bloomfilter/?n=1G&p=1.0E-7&m=&k=

So, to be honest, HashSet should be absolutely fine for the most users. But we should make a remark in the README that memory usage when using recursive mode with tens millions of links might cause RAM usage issues.

Hi, would it be merged soon? and Docker image updated with the change to make GH action use the latest?

@pawroman: Good idea with the bloom filter. I'll add that as soon as #208 is merged.
@lenisha: Hopefully yes. I don't like the current implementation anymore (see comment below), but I think it's not gonna be a blocker either way.

…cursion

mre · 2021-04-12T14:43:01Z

TBH I'm not super happy with the current impl anymore as I count the links in the queue and then close the channel after all links got checked. It can lead to subtle bugs I think.
There must be a better way. Does anyone have an idea?

lebensterben · 2021-04-12T22:31:43Z

We can have a dedicated Collector type that collects links.

First a channel for Input is created. The Collector is the receiver end and we initially send some Input to the channel.
While Collector is still receiving new Input, it sends Request. (If max recursive level is not reached, etc)
While ClientPool is still receiving new Request, it spawns Clients to validate the Request.
Once Client finished validation, it sends the Response, which contains Input.
The Input in the Response is fed back to the channel for Collector to receive.

mre · 2021-04-13T07:58:25Z

src/bin/lychee/main.rs

+    let mut curr = 0;
+
+    while curr < total_requests {
+        curr += 1;


@lebensterben, I'm not too happy about the above three lines. Is there a better way to handle the request queue compared to counting the number of total requests like I did?

Check this
https://docs.rs/tokio-stream/0.1.5/tokio_stream/#returning-a-stream-from-a-function
more examples:
https://tokio.rs/tokio/tutorial/streams

tokio-stream crate has a wrapper type for Reciever of the channel.
That should work without counting the number.
It's like Future trait, you can poll it.
It's also like Iterator trait, next returns None when it's over.

Ah that's great. Sounds like what we need.

mre · 2021-04-13T07:58:56Z

Yup, that's correct.
More generally though, is there a better way to handle the request queue. (Added a comment to the code in question.)

mre · 2021-09-16T14:05:29Z

Will put this on hold once again as we started implementing a stream-based approach in #330, which might supersede this branch soon. Sorry to everyone waiting on recursion support to land, but I'd like to get this right instead of merging a buggy solution prematurely.

mre added 10 commits January 28, 2021 17:55

wip

a365f6a

wip

ef564d6

wip

f65543a

wip

9c6735f

Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…

1d47d2b

…cursion

update

5be75b7

Replace own cache type with simple hashset

756e5b4

Move recursion to separate function

a5a68e8

Fix documentation

0b86f1a

Add recursion flag

5310e8b

mre changed the title ~~Recursion Support~~ Recursion Support (closes #78) Mar 1, 2021

mre mentioned this pull request Mar 1, 2021

Match protocol relative URLs (://) robinst/linkify#17

Closed

mre added 8 commits March 2, 2021 01:36

Add logic to terminate the program with recursion

e2a8624

Rename flag from "recursion" to "recursive"

aa9c88b

Add support for max recursion level

148fbcd

Simplify loop

23743ed

Make code more idiomatic

8d42cbb

Update comments

b73f709

Formatting

15ec139

Formatting

0f73fcb

mre requested a review from pawroman March 2, 2021 12:26

Mention recursion support in readme

0544799

mre changed the title ~~Recursion Support (closes #78)~~ Recursion Support (closes #55, #78) Mar 2, 2021

mre added 2 commits March 3, 2021 13:07

Change max-recursion param to depth

1d73636

Paint progress bar in lychee color

c6611ea

dholbach mentioned this pull request Mar 9, 2021

Move off of peter-evans/link-checker GH action kubereboot/kured#308

Closed

pawroman suggested changes Mar 13, 2021

View reviewed changes

mre added 6 commits March 14, 2021 20:26

Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…

7516dc7

…cursion

Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…

5be722d

…cursion

Add local link extraction test

3c4e646

Merge branch 'master' of github.com:lycheeverse/lychee into simple-re…

363cf63

…cursion

Accept localhost in recursion

eb42714

WIP integration test for recursion

b94f4ef

mre mentioned this pull request Mar 24, 2021

Add recursive option #78

Open

mre commented Apr 13, 2021

View reviewed changes

MichaIng mentioned this pull request Aug 27, 2021

[Question] Is there a way to customize the 404 logic? #300

Closed

MichaIng mentioned this pull request Sep 5, 2021

Add support for local files #262

Merged

3 tasks

mre mentioned this pull request Sep 6, 2021

Send extracted links from extractor pool via channel #193

Closed

mre added the blocked label Sep 16, 2021

mre closed this Dec 1, 2021

mre mentioned this pull request Jan 17, 2022

Design feedback: Recursion v2 #465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursion Support (closes #55, #78) #165

Recursion Support (closes #55, #78) #165

mre commented Feb 28, 2021 •

edited

Loading

joesan commented Mar 1, 2021

mre commented Mar 1, 2021

mre commented Mar 2, 2021

pawroman left a comment •

edited

Loading

pawroman Mar 13, 2021

pawroman Mar 13, 2021

pawroman Mar 13, 2021

pawroman Mar 13, 2021

mre Apr 12, 2021

pawroman Mar 13, 2021

mre Apr 12, 2021

pawroman Mar 13, 2021

lenisha Mar 15, 2021

mre Apr 12, 2021

mre commented Apr 12, 2021

lebensterben commented Apr 12, 2021

mre Apr 13, 2021

lebensterben Apr 19, 2021

lebensterben Apr 19, 2021

mre Apr 19, 2021

mre commented Apr 13, 2021

mre commented Sep 16, 2021

Recursion Support (closes #55, #78) #165

Recursion Support (closes #55, #78) #165

Conversation

mre commented Feb 28, 2021 • edited Loading

joesan commented Mar 1, 2021

mre commented Mar 1, 2021

mre commented Mar 2, 2021

pawroman left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mre commented Apr 12, 2021

lebensterben commented Apr 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mre commented Apr 13, 2021

mre commented Sep 16, 2021

mre commented Feb 28, 2021 •

edited

Loading

pawroman left a comment •

edited

Loading