Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursion Support (closes #55, #78) #165

Closed
wants to merge 27 commits into from
Closed

Recursion Support (closes #55, #78) #165

wants to merge 27 commits into from

Conversation

mre
Copy link
Member

@mre mre commented Feb 28, 2021

⚠️ This is on hold as we started implementing a stream-based approach in #330, which might supersede this branch soon. Recursion support will be added once this gets merged. We can probably rebase it on top of master then and make adjustments in this PR (or create new PR). Closing this for now.


This is a super basic recursion implementation of recursion as discussed in #78.
It doesn't use an extractor pool as discussed in #55. I think it's not needed at this point.
The current implementation just checks the responses, filters out websites, and pushes them back into the request queue.

I deliberately avoided to add the recursion support into the client. It would break separation of concerns. Also didn't want to add a separate recursion handler to the library as it I wouldn't expect any gains from that. Recursion handling is surprisingly tricky, because there are quite a few design decisions to make. (Recurse indefinitely or stop at a predefined recursion depth? Recurse on all domains or just the set of input domains?)
So most people using the library would have to make the decisions based on their use-case. The current recursion depth is part of the request/response struct, so it won't have to be wrapped in another struct at least.

TODO:

  • Cache requests
  • Filter out domains that were not in the set of inputs
  • Check recursion level
  • Find out why recursion does not terminate
  • Cleanup
  • Integration test?

@joesan
Copy link
Contributor

joesan commented Mar 1, 2021

Could you let me know what this task is about? The title does give me a hint, but I fail to understand the context?

@mre mre changed the title Recursion Support Recursion Support (closes #78) Mar 1, 2021
@mre
Copy link
Member Author

mre commented Mar 1, 2021

Sorry, forgot to link the issue. Here it is: #78.

@mre mre requested a review from pawroman March 2, 2021 12:26
@mre
Copy link
Member Author

mre commented Mar 2, 2021

From my side this should be good to go.
@pawroman, @joesan any final comments before merging this?

@mre mre changed the title Recursion Support (closes #78) Recursion Support (closes #55, #78) Mar 2, 2021
Copy link
Member

@pawroman pawroman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Great to see this implemented.

I have left a few code comments and I have one generic comment: I think we should add at least one simple test for the recursive link checking.

@@ -174,6 +175,8 @@ OPTIONS:
-b, --base-url <base-url> Base URL to check relative URLs
--basic-auth <basic-auth> Basic authentication support. E.g. `username:password`
-c, --config <config-file> Configuration file to use [default: ./lychee.toml]
--depth <depth> Stop link checking beyond this maximum recursion depth. (Recommended for
large inputs.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should mention that depth less than zero also means infinite.


/// Stop link checking beyond this maximum recursion depth. (Recommended for large inputs.)
#[structopt(long)]
pub depth: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider allowing this to be signed (e.g. isize), so that users could explicitly express the need for infinite recursion with numbers less than zero (e.g. -1).

},
Response {
uri: website("http://example.org/redirect"),
status: Status::Redirected(http::StatusCode::PERMANENT_REDIRECT),
source: Input::Stdin,
recursion_level: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience and to cover the "most common case", it might be worth considering adding an alternative constructor for Response which would set recursion_level to 0.

Or even have ::new constructor which would have this behavior (recursion = 0) and another one named something like ::with_recursion which would allow setting recursion level.

* }
* ```
*/
//!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to make this a doc-comment!

But isn't a blank line here creating a newline in the rendered docs? Just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check yet, but I assumed cargo would be smart enough to trim empty lines at the beginning and the end.

if !response.status.is_success() {
return Ok(0);
}
if cache.contains(response.uri.as_str()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I'm a bit skeptical of the name cache here. We're not really caching, we're merely "checking if this URI was checked before, and should therefore be skipped".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about seen?

if cache.contains(response.uri.as_str()) {
return Ok(0);
}
cache.insert(response.uri.to_string());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're allowing infinite recursion, this data structure can really take up the entire memory with sufficient number of links. Depending on the expected maximum number of links checked in recursive mode, we might consider a bloom filter here:

  • HashSet is probably fine if the number of links checked is in the order of millions or fewer
  • Bloom filter might be a good idea if the number of links checked is in the order of tens of millions or more

Super-rough calculations below. (Note that I'm just calculating the size of this one structure, there's probably going to be more memory overheads from various bits of the program).

Assumptions:

  • 64 bit machine, e.g. 8 bytes per pointer
  • 64 bytes per link on average
  • 1 million links

Calculations:

So, HashSet with 1 million unique links will take up ~64+24+1 = ~89 MB (~85 MiB)

For comparison, a Bloom Filter with ~1e-7 false positive probability is expected to take up:

So, to be honest, HashSet should be absolutely fine for the most users. But we should make a remark in the README that memory usage when using recursive mode with tens millions of links might cause RAM usage issues.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, would it be merged soon? and Docker image updated with the change to make GH action use the latest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawroman: Good idea with the bloom filter. I'll add that as soon as #208 is merged.
@lenisha: Hopefully yes. I don't like the current implementation anymore (see comment below), but I think it's not gonna be a blocker either way.

@mre mre mentioned this pull request Mar 24, 2021
@mre
Copy link
Member Author

mre commented Apr 12, 2021

TBH I'm not super happy with the current impl anymore as I count the links in the queue and then close the channel after all links got checked. It can lead to subtle bugs I think.
There must be a better way. Does anyone have an idea?

@lebensterben
Copy link
Member

We can have a dedicated Collector type that collects links.

  • First a channel for Input is created. The Collector is the receiver end and we initially send some Input to the channel.
  • While Collector is still receiving new Input, it sends Request. (If max recursive level is not reached, etc)
  • While ClientPool is still receiving new Request, it spawns Clients to validate the Request.
  • Once Client finished validation, it sends the Response, which contains Input.
  • The Input in the Response is fed back to the channel for Collector to receive.

let mut curr = 0;

while curr < total_requests {
curr += 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lebensterben, I'm not too happy about the above three lines. Is there a better way to handle the request queue compared to counting the number of total requests like I did?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokio-stream crate has a wrapper type for Reciever of the channel.
That should work without counting the number.
It's like Future trait, you can poll it.
It's also like Iterator trait, next returns None when it's over.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's great. Sounds like what we need.

@mre
Copy link
Member Author

mre commented Apr 13, 2021

Yup, that's correct.
More generally though, is there a better way to handle the request queue. (Added a comment to the code in question.)

@mre
Copy link
Member Author

mre commented Sep 16, 2021

Will put this on hold once again as we started implementing a stream-based approach in #330, which might supersede this branch soon. Sorry to everyone waiting on recursion support to land, but I'd like to get this right instead of merging a buggy solution prematurely.

@mre mre added the blocked label Sep 16, 2021
@mre mre closed this Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants