Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce extractor pool #55

Closed
mre opened this issue Dec 1, 2020 · 6 comments
Closed

Introduce extractor pool #55

mre opened this issue Dec 1, 2020 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mre
Copy link
Member

mre commented Dec 1, 2020

Introduction

As of now we send each URI we'd like to check to a client pool in main.rs. Code is here:

lychee/src/main.rs

Lines 128 to 135 in d2e349c

tokio::spawn(async move {
for link in links {
if let Some(pb) = &bar {
pb.set_message(&link.to_string());
};
send_req.send(link).await.unwrap();
}
});

This is not ideal for a few reasons:

  • All links get extracted on startup. This is a slow process that can take up to a few seconds for long link lists.
    It's not necessary to block the client during this step, though as we could lazy-load the links on demand from the inputs.
  • There is no clear separation of concerns between main and the link extraction. Ideally the responsibilities could should be split up to make testing and refactoring easier.

We already use a channel for sending the links to check to the client pool. We could use the same abstraction for extracting the links, too in form of an extractor pool.

In the future this would allow implementing some advanced features in an extensible way:

  • Recursively check links: Push newly discovered websites into the input channel of the extractor pool
  • Skip duplicate URLs: Filter input links with a HashSet or even a Bloom filter (for constant memory-usage) that is maintained by the extractor pool before sending it to the client pool.
  • Request throttling: Group requests per website and apply some throttling to not overload the server.

How to contribute

  1. Create an extractor pool similar to our client pool
  2. Spawn the pool inside main on startup, pass the channel to the pool and start processing the inputs.

(The other end of the channel the channel is already passed to the client pool.)

@mre mre added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Dec 1, 2020
@mre
Copy link
Member Author

mre commented Dec 1, 2020

Update: @pawroman is working on this as part of #35.

@pawroman
Copy link
Member

pawroman commented Dec 1, 2020

Update: @pawroman is working on this as part of #35.

It's a first step in this, but doesn't cover the entire scope.

@mre
Copy link
Member Author

mre commented Dec 1, 2020

Quoting your comment from the PR so we don't forget:

The next step (some time in the future) could be to dispatch each link for checking as soon as it's extracted, making the program much faster for large amounts of inputs and links

@TimoFreiberg
Copy link
Contributor

I'm currently taking a look at this, any ideas on how to best handle this spot?

let bar = ProgressBar::new(links.len() as u64)

Right now it seems to me like initializing the bar length before progress starts is impossible if progress is supposed to start with the first Request :/

@pawroman
Copy link
Member

@TimoFreiberg nice, thanks for looking into this!

I think that at first the progressbar could be just an "unbounded spinner" indicating link extraction (as per indicatif documentation) and could then be converted into a "progress bar" once some links were buffered up to be checked.

Once the progress bar is rendered, it seems to be possible to set its length using set_length or even perhaps more conveniently: inc_length. I think that the length should converge quite quickly to the desired length. Perhaps the fact that "it's still growing" could be indicated somehow.

I know it's a lot of hand-waving, but I hope it makes sense 😄

Just one problem I might anticipate with this is that the number of calls to the update the progress bar (which is internally held behind a Arc<RwLock>) might slow down the main thread a bit. But we should only worry about that once we have some working prototype.

@mre
Copy link
Member Author

mre commented Dec 3, 2021

This can be closed because of merging #330. Ongoing work is happening in #414
The progress bar was fixed by @TimoFreiberg. Thanks for that.

@mre mre closed this as completed Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants