Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter links before downloading / adding to the queue #175

Merged
merged 9 commits into from
Apr 24, 2022

Conversation

raphCode
Copy link
Contributor

This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules.
This also improves memory usage in these scenarios since the links are not stored.
Also the network traffic is reduced by not downloading these links in the first place.

This commit speeds up scraping for scenarios where pages have a high branch factor, that
is many links and a majority of these links is excluded by the --exclude / --include
rules.
This also improves memory usage in these scenarios since the links are not stored.
Also the network traffic is reduced by not downloading these links in the first place.
@codecov
Copy link

codecov bot commented Mar 10, 2022

Codecov Report

Merging #175 (a14abdc) into master (84276b9) will decrease coverage by 0.27%.
The diff coverage is 45.45%.

❗ Current head a14abdc differs from pull request most recent head 0c0bc83. Consider uploading reports for the commit 0c0bc83 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   64.73%   64.45%   -0.28%     
==========================================
  Files          17       17              
  Lines         621      633      +12     
==========================================
+ Hits          402      408       +6     
- Misses        219      225       +6     
Impacted Files Coverage Δ
src/args.rs 0.00% <ø> (ø)
src/scraper.rs 24.74% <45.45%> (+1.66%) ⬆️

This allows fine-grained control whether a page is visited, that means its links analyzed,
and saved to disk.
The decoupling of download and visit filter means the complete website may still be
explored while only downloading some files. To speed up scraping, irrelevant links can be
easily excluded from visiting.
@raphCode
Copy link
Contributor Author

raphCode commented Mar 15, 2022

I thought a bit more, and what we want is separate regex filters for downloading and visiting:

  • specify what to download while still exploring and finding all links
  • exclude any irrelevant links (speeding up crawling, better memory footprint etc)

To put in other words:

  • just using the proposed --in/exclude-download arguments represents the traditional behavior like before the PR: visiting all pages, downloading what matches.
  • using --in/exclude-download with --visit-filter-is-download-filter behaves like the initial commit in this PR: not visiting pages that won't be downloaded
  • anything in between can be specified by using --in/exclude-download with --in/exclude-visit. Of course, only visited pages can be downloaded.

If this is merged, I advise to use a squash commit to hide the intermediate development.

Copy link
Owner

@Skallwar Skallwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you very much. This is good idea. Please add some tests for this and we will merge it

@raphCode
Copy link
Contributor Author

I always fail to think of useful tests.
How would you check that some links are not visited?

I could just duplicate the existing tests in filters.rs and make them use the visit filters instead of the download filters, relying on the side effect that unvisited links are not downloaded as well.
But I think this does not capture the distinction that is intended with those options.

@Skallwar
Copy link
Owner

How would you check that some links are not visited?

You could put valid link (that should be downloaded) to visit in a page that should not be visited and check if the valid link has been downloaded or not.

Otherwise a failed test leaves the directory populated, which fails future test runs.
@raphCode
Copy link
Contributor Author

Basically just duplicated the existing tests and adjusted for visit filter regex.

@telugu-boy
Copy link

Any update on this? It's almost a bug without the feature at this point, --exclude and --include are not doing what they are supposed to 😢

@Skallwar
Copy link
Owner

Sorry for the delay, I will review this tomorrow

Copy link
Collaborator

@CohenArthur CohenArthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Looks good to me!

Copy link
Owner

@Skallwar Skallwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you very much! Sorry once again for the delay

@Skallwar
Copy link
Owner

Humm the CI is not running

@Skallwar
Copy link
Owner

We are in this case... https://github.saobby.my.eu.orgmunity/t/missing-approve-and-run-button/200572
I will merge this in the meantime, hopefully Github will fix this

@Skallwar Skallwar merged commit 97e3a16 into Skallwar:master Apr 24, 2022
@Skallwar
Copy link
Owner

The CI works perfectly on master, all good 🚀 🔥

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants