No link extraction on URI not successfully downloaded #161
+10
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
During a test crawl I noticed a bunch of alerts like the following:
Turns out they are caused by the
ExtractorHTML.shouldExtract()
method trying to peek at the start of the content stream.At first I thought it was just down to this being a DNS entry, but I quickly realized that these all related to non-successful DNS lookups.
Which got me thinking, why are we even trying to extract links from non-successful CrawlURIs?
So this PR adds a check to the
ContentExtractor.shouldProcess()
to immediately skip any non-successfully downloaded URIs. Avoids these errors and saves a little, otherwise wasted, effort.