-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter links before downloading / adding to the queue #175
Conversation
This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules. This also improves memory usage in these scenarios since the links are not stored. Also the network traffic is reduced by not downloading these links in the first place.
Codecov Report
@@ Coverage Diff @@
## master #175 +/- ##
==========================================
- Coverage 64.73% 64.45% -0.28%
==========================================
Files 17 17
Lines 621 633 +12
==========================================
+ Hits 402 408 +6
- Misses 219 225 +6
|
This allows fine-grained control whether a page is visited, that means its links analyzed, and saved to disk. The decoupling of download and visit filter means the complete website may still be explored while only downloading some files. To speed up scraping, irrelevant links can be easily excluded from visiting.
I thought a bit more, and what we want is separate regex filters for downloading and visiting:
To put in other words:
If this is merged, I advise to use a squash commit to hide the intermediate development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you very much. This is good idea. Please add some tests for this and we will merge it
I always fail to think of useful tests. I could just duplicate the existing tests in |
You could put valid link (that should be downloaded) to visit in a page that should not be visited and check if the valid link has been downloaded or not. |
Otherwise a failed test leaves the directory populated, which fails future test runs.
Basically just duplicated the existing tests and adjusted for visit filter regex. |
Any update on this? It's almost a bug without the feature at this point, --exclude and --include are not doing what they are supposed to 😢 |
Sorry for the delay, I will review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! Looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you very much! Sorry once again for the delay
Humm the CI is not running |
We are in this case... https://github.saobby.my.eu.orgmunity/t/missing-approve-and-run-button/200572 |
The CI works perfectly on master, all good 🚀 🔥 |
This commit speeds up scraping for scenarios where pages have a high branch factor, that is many links and a majority of these links is excluded by the --exclude / --include rules.
This also improves memory usage in these scenarios since the links are not stored.
Also the network traffic is reduced by not downloading these links in the first place.