Introduce extractor pool #55

mre · 2020-12-01T00:16:12Z

Introduction

As of now we send each URI we'd like to check to a client pool in main.rs. Code is here:

Lines 128 to 135 in d2e349c

    
           tokio::spawn(async move { 
        
               for link in links { 
        
                   if let Some(pb) = &bar { 
        
                       pb.set_message(&link.to_string()); 
        
                   }; 
        
                   send_req.send(link).await.unwrap(); 
        
               } 
        
           });

This is not ideal for a few reasons:

All links get extracted on startup. This is a slow process that can take up to a few seconds for long link lists.
It's not necessary to block the client during this step, though as we could lazy-load the links on demand from the inputs.
There is no clear separation of concerns between main and the link extraction. Ideally the responsibilities could should be split up to make testing and refactoring easier.

We already use a channel for sending the links to check to the client pool. We could use the same abstraction for extracting the links, too in form of an extractor pool.

In the future this would allow implementing some advanced features in an extensible way:

Recursively check links: Push newly discovered websites into the input channel of the extractor pool
Skip duplicate URLs: Filter input links with a HashSet or even a Bloom filter (for constant memory-usage) that is maintained by the extractor pool before sending it to the client pool.
Request throttling: Group requests per website and apply some throttling to not overload the server.

How to contribute

Create an extractor pool similar to our client pool
Spawn the pool inside main on startup, pass the channel to the pool and start processing the inputs.

(The other end of the channel the channel is already passed to the client pool.)

The text was updated successfully, but these errors were encountered:

mre · 2020-12-01T10:15:50Z

Update: @pawroman is working on this as part of #35.

pawroman · 2020-12-01T21:01:17Z

Update: @pawroman is working on this as part of #35.

It's a first step in this, but doesn't cover the entire scope.

mre · 2020-12-01T21:26:01Z

Quoting your comment from the PR so we don't forget:

The next step (some time in the future) could be to dispatch each link for checking as soon as it's extracted, making the program much faster for large amounts of inputs and links

TimoFreiberg · 2021-03-18T18:17:30Z

I'm currently taking a look at this, any ideas on how to best handle this spot?

lychee/src/bin/lychee/main.rs

Line 132 in 00bce9d

let bar = ProgressBar::new(links.len() as u64)

Right now it seems to me like initializing the bar length before progress starts is impossible if progress is supposed to start with the first Request :/

pawroman · 2021-03-18T19:34:39Z

@TimoFreiberg nice, thanks for looking into this!

I think that at first the progressbar could be just an "unbounded spinner" indicating link extraction (as per indicatif documentation) and could then be converted into a "progress bar" once some links were buffered up to be checked.

Once the progress bar is rendered, it seems to be possible to set its length using set_length or even perhaps more conveniently: inc_length. I think that the length should converge quite quickly to the desired length. Perhaps the fact that "it's still growing" could be indicated somehow.

I know it's a lot of hand-waving, but I hope it makes sense 😄

Just one problem I might anticipate with this is that the number of calls to the update the progress bar (which is internally held behind a Arc<RwLock>) might slow down the main thread a bit. But we should only worry about that once we have some working prototype.

mre · 2021-12-03T12:55:02Z

This can be closed because of merging #330. Ongoing work is happening in #414
The progress bar was fixed by @TimoFreiberg. Thanks for that.

mre added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Dec 1, 2020

pawroman mentioned this issue Dec 2, 2020

Add skip_missing flag, add Input enum (closes #26) #35

Merged

6 tasks

polarathene mentioned this issue Feb 27, 2021

Response caching? #163

Closed

mre mentioned this issue Feb 28, 2021

Recursion Support (closes #55, #78) #165

Closed

6 tasks

MichaIng mentioned this issue Oct 2, 2021

Extend repitition test to span across two files #349

Merged

mre closed this as completed Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce extractor pool #55

Introduce extractor pool #55

mre commented Dec 1, 2020 •

edited

Loading

mre commented Dec 1, 2020

pawroman commented Dec 1, 2020

mre commented Dec 1, 2020

TimoFreiberg commented Mar 18, 2021

pawroman commented Mar 18, 2021

mre commented Dec 3, 2021 •

edited

Loading

Introduce extractor pool #55

Introduce extractor pool #55

Comments

mre commented Dec 1, 2020 • edited Loading

Introduction

How to contribute

mre commented Dec 1, 2020

pawroman commented Dec 1, 2020

mre commented Dec 1, 2020

TimoFreiberg commented Mar 18, 2021

pawroman commented Mar 18, 2021

mre commented Dec 3, 2021 • edited Loading

mre commented Dec 1, 2020 •

edited

Loading

mre commented Dec 3, 2021 •

edited

Loading