Filter links before downloading / adding to the queue #175

raphCode · 2022-03-10T16:19:15Z

This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules.
This also improves memory usage in these scenarios since the links are not stored.
Also the network traffic is reduced by not downloading these links in the first place.

This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules. This also improves memory usage in these scenarios since the links are not stored. Also the network traffic is reduced by not downloading these links in the first place.

codecov · 2022-03-10T19:57:52Z

Codecov Report

Merging #175 (a14abdc) into master (84276b9) will decrease coverage by 0.27%.
The diff coverage is 45.45%.

❗ Current head a14abdc differs from pull request most recent head 0c0bc83. Consider uploading reports for the commit 0c0bc83 to get more accurate results

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   64.73%   64.45%   -0.28%     
==========================================
  Files          17       17              
  Lines         621      633      +12     
==========================================
+ Hits          402      408       +6     
- Misses        219      225       +6

Impacted Files	Coverage Δ
src/args.rs	`0.00% <ø> (ø)`
src/scraper.rs	`24.74% <45.45%> (+1.66%)`	⬆️

This allows fine-grained control whether a page is visited, that means its links analyzed, and saved to disk. The decoupling of download and visit filter means the complete website may still be explored while only downloading some files. To speed up scraping, irrelevant links can be easily excluded from visiting.

raphCode · 2022-03-15T22:17:34Z

I thought a bit more, and what we want is separate regex filters for downloading and visiting:

specify what to download while still exploring and finding all links
exclude any irrelevant links (speeding up crawling, better memory footprint etc)

To put in other words:

just using the proposed --in/exclude-download arguments represents the traditional behavior like before the PR: visiting all pages, downloading what matches.
using --in/exclude-download with --visit-filter-is-download-filter behaves like the initial commit in this PR: not visiting pages that won't be downloaded
anything in between can be specified by using --in/exclude-download with --in/exclude-visit. Of course, only visited pages can be downloaded.

If this is merged, I advise to use a squash commit to hide the intermediate development.

Skallwar

Thanks you very much. This is good idea. Please add some tests for this and we will merge it

raphCode · 2022-03-17T11:12:36Z

I always fail to think of useful tests.
How would you check that some links are not visited?

I could just duplicate the existing tests in filters.rs and make them use the visit filters instead of the download filters, relying on the side effect that unvisited links are not downloaded as well.
But I think this does not capture the distinction that is intended with those options.

Skallwar · 2022-03-17T22:41:57Z

How would you check that some links are not visited?

You could put valid link (that should be downloaded) to visit in a page that should not be visited and check if the valid link has been downloaded or not.

Otherwise a failed test leaves the directory populated, which fails future test runs.

raphCode · 2022-03-22T22:35:02Z

Basically just duplicated the existing tests and adjusted for visit filter regex.

telugu-boy · 2022-04-23T21:34:39Z

Any update on this? It's almost a bug without the feature at this point, --exclude and --include are not doing what they are supposed to 😢

Skallwar · 2022-04-23T22:54:50Z

Sorry for the delay, I will review this tomorrow

CohenArthur

Sorry for the delay! Looks good to me!

Skallwar

Thanks you very much! Sorry once again for the delay

Skallwar · 2022-04-24T16:12:09Z

Humm the CI is not running

Skallwar · 2022-04-24T16:37:33Z

We are in this case... https://github.saobby.my.eu.orgmunity/t/missing-approve-and-run-button/200572
I will merge this in the meantime, hopefully Github will fix this

Skallwar · 2022-04-24T16:53:40Z

The CI works perfectly on master, all good 🚀 🔥

Skallwar requested review from Skallwar and CohenArthur March 10, 2022 19:54

CohenArthur approved these changes Mar 10, 2022

View reviewed changes

raphCode added 2 commits March 14, 2022 23:47

Fix lint warning (remove brackets)

6927b38

Increase sleep duration

a82a57b

raphCode force-pushed the master branch from cac28ed to a82a57b Compare March 15, 2022 23:16

Merge remote-tracking branch 'upstream/master' into fork

a14abdc

raphCode force-pushed the master branch from b75a0de to a14abdc Compare March 16, 2022 15:38

Skallwar approved these changes Mar 16, 2022

View reviewed changes

Skallwar requested changes Mar 16, 2022

View reviewed changes

Rename existing tests

041347b

raphCode added 3 commits March 22, 2022 22:24

Empty test directories before running tests

1b3e311

Otherwise a failed test leaves the directory populated, which fails future test runs.

Add visit filter tests

4444531

Merge remote-tracking branch 'upstream/master'

0c0bc83

CohenArthur approved these changes Apr 24, 2022

View reviewed changes

Skallwar approved these changes Apr 24, 2022

View reviewed changes

Skallwar merged commit 97e3a16 into Skallwar:master Apr 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter links before downloading / adding to the queue #175

Filter links before downloading / adding to the queue #175

raphCode commented Mar 10, 2022

codecov bot commented Mar 10, 2022 •

edited

Loading

raphCode commented Mar 15, 2022 •

edited

Loading

Skallwar left a comment

raphCode commented Mar 17, 2022

Skallwar commented Mar 17, 2022

raphCode commented Mar 22, 2022

telugu-boy commented Apr 23, 2022

Skallwar commented Apr 23, 2022

CohenArthur left a comment

Skallwar left a comment

Skallwar commented Apr 24, 2022

Skallwar commented Apr 24, 2022

Skallwar commented Apr 24, 2022

Filter links before downloading / adding to the queue #175

Filter links before downloading / adding to the queue #175

Conversation

raphCode commented Mar 10, 2022

codecov bot commented Mar 10, 2022 • edited Loading

Codecov Report

raphCode commented Mar 15, 2022 • edited Loading

Skallwar left a comment

Choose a reason for hiding this comment

raphCode commented Mar 17, 2022

Skallwar commented Mar 17, 2022

raphCode commented Mar 22, 2022

telugu-boy commented Apr 23, 2022

Skallwar commented Apr 23, 2022

CohenArthur left a comment

Choose a reason for hiding this comment

Skallwar left a comment

Choose a reason for hiding this comment

Skallwar commented Apr 24, 2022

Skallwar commented Apr 24, 2022

Skallwar commented Apr 24, 2022

codecov bot commented Mar 10, 2022 •

edited

Loading

raphCode commented Mar 15, 2022 •

edited

Loading