You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dynamic crawlers with RequestQueue often enqueue URLs that never get processed because of the maxRequestsPerCrawl limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.
The calls to enqueueLinks or addRequests on the crawler instance might turn noop as soon as the related RequestQueue's length reaches the maxRequestsPerCrawl.
Possible issues & considerations
This might be breaking for users reading the RQ after the enqueuing crawler stops on the limit.
This would only work for crawler (helper) methods, RQ.addRequests must still work as before (maxRequestsPerCrawl is a crawler option).
The text was updated successfully, but these errors were encountered:
Another possible problem is that if there's a high failure rate, you could get way less that maxRequestsPerCrawl results if you cut off the request queue too early.
Afaik that's expected with maxRequestsPerCrawl - if e.g. maxRequestsPerCrawl: 20, only 20 Request objects will be processed (and possibly retried on errors maxRequestRetries times), regardless on the success / failure state.
If I understand the current codebase correctly, the > maxRequestsPerCrawl requests in the RQ will never be touched.
Dynamic crawlers with
RequestQueue
often enqueue URLs that never get processed because of themaxRequestsPerCrawl
limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.The calls to
enqueueLinks
oraddRequests
on the crawler instance might turnnoop
as soon as the relatedRequestQueue
's length reaches themaxRequestsPerCrawl
.Possible issues & considerations
RQ.addRequests
must still work as before (maxRequestsPerCrawl
is a crawler option).The text was updated successfully, but these errors were encountered: