-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid having 2 times the same spider running at the same time #228
Comments
@rolele see #140, Can you explain your use case in more detail? Why do you crawl each part of the site in a different run? Why can't the jobs run together? Are you worried about total traffic |
Thanks @Digenis for your answer I do not have any pipeline beside fetching the whole page to kafka. It is not about parallel requests per ip/domain because I set this to 1 on my spider and I add a delay of 0.5s per request so I am not very aggressive I think. I am basically crawling e-commerce that have categories that change ones every few month and categories such as "special sales" that change every weeks, and categories that change something in the middle. Now because I do not want to hit the website too much I would prefer if scrapyd could have an option so that I can schedule all my jobs in the queue but then scrapyd is running them sequentially. Also I do not want to waist crawling the unchanged categories while I could use this computing power to crawl other websites that have changed categories. I was looking at the per project limit and it is not what I wish was in scrapyd. It is a limit of job_per_spider. |
This is also currently a concern to me.. I'm having the same case as @rolele . I wish to run multiple jobs on a single spider in parallel. My current solutions is setting the |
@Dean-Christian-Armada, @rolele, |
Duplicates #153 (which is also a bit more general). |
My use case is that I parametrized my spider to crawl part of the website.
The problem is that there isn't an option to prevent scrapyd to start those crawls in parallel.
I would like those crawl to be sequential.
Is this something possible ? I saw the option max_proc_per_cpu which limit the crawling but does not provide the functionality I would like.
The text was updated successfully, but these errors were encountered: