Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't enqueue over maxRequestsPerCrawl #2728

Open
barjin opened this issue Oct 29, 2024 · 2 comments
Open

Don't enqueue over maxRequestsPerCrawl #2728

barjin opened this issue Oct 29, 2024 · 2 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Milestone

Comments

@barjin
Copy link
Contributor

barjin commented Oct 29, 2024

Dynamic crawlers with RequestQueue often enqueue URLs that never get processed because of the maxRequestsPerCrawl limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.

The calls to enqueueLinks or addRequests on the crawler instance might turn noop as soon as the related RequestQueue's length reaches the maxRequestsPerCrawl.

Possible issues & considerations

  • This might be breaking for users reading the RQ after the enqueuing crawler stops on the limit.
  • This would only work for crawler (helper) methods, RQ.addRequests must still work as before (maxRequestsPerCrawl is a crawler option).
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Oct 29, 2024
@janbuchar
Copy link
Contributor

Another possible problem is that if there's a high failure rate, you could get way less that maxRequestsPerCrawl results if you cut off the request queue too early.

@barjin
Copy link
Contributor Author

barjin commented Oct 29, 2024

Afaik that's expected with maxRequestsPerCrawl - if e.g. maxRequestsPerCrawl: 20, only 20 Request objects will be processed (and possibly retried on errors maxRequestRetries times), regardless on the success / failure state.

If I understand the current codebase correctly, the > maxRequestsPerCrawl requests in the RQ will never be touched.

@B4nan B4nan added this to the 4.0 milestone Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants