avoid having 2 times the same spider running at the same time #228

rolele · 2017-04-12T03:25:01Z

My use case is that I parametrized my spider to crawl part of the website.

For instance one crawl will do one category and another crawl another category using the same spider

The problem is that there isn't an option to prevent scrapyd to start those crawls in parallel.
I would like those crawl to be sequential.
Is this something possible ? I saw the option max_proc_per_cpu which limit the crawling but does not provide the functionality I would like.

Digenis · 2017-04-12T06:08:45Z

@rolele see #140,
(this is a per-project limit)

Can you explain your use case in more detail?

Why do you crawl each part of the site in a different run?
Are there dependencies, like foreign keys in a database, that you need to fulfil first?
See if https://github.com/rolando/scrapy-inline-requests can help
(scrapy may integrate it one day, scrapy/scrapy#1144)

Why can't the jobs run together?
Are there locks in the pipeline?

Are you worried about total traffic
exceeding the configured download delay
or paralel requests per ip/domain?
See #221

rolele · 2017-04-12T06:47:44Z

Thanks @Digenis for your answer

I do not have any pipeline beside fetching the whole page to kafka.

It is not about parallel requests per ip/domain because I set this to 1 on my spider and I add a delay of 0.5s per request so I am not very aggressive I think.
I do not want to crawl the whole website every time, I want to detect the changing categories and crawl them more often then the ones that are not changing often. I basically will schedule some cronjob curl command on the shedule.json endpoint.

I am basically crawling e-commerce that have categories that change ones every few month and categories such as "special sales" that change every weeks, and categories that change something in the middle.

Now because I do not want to hit the website too much I would prefer if scrapyd could have an option so that I can schedule all my jobs in the queue but then scrapyd is running them sequentially.

Also I do not want to waist crawling the unchanged categories while I could use this computing power to crawl other websites that have changed categories.

I was looking at the per project limit and it is not what I wish was in scrapyd. It is a limit of job_per_spider.

Dean-Christian-Armada · 2017-11-07T09:26:14Z

This is also currently a concern to me.. I'm having the same case as @rolele . I wish to run multiple jobs on a single spider in parallel.

My current solutions is setting the poll_interva to 1.0 seconds. But it will be better if parallelism can be implemented

Digenis · 2017-11-08T11:05:04Z

@Dean-Christian-Armada,
Your issue is different. You want to run more spiders in parallel.
See #187 and #173. Also know that you can have a sub-second polling interval.

@rolele,
Unfortunately, scrapyd is still long way from this feature.
You can however solve your usecase in the spider itself.

jpmckinney · 2021-09-23T23:46:59Z

Duplicates #153 (which is also a bit more general).

Digenis added the type: enhancement label Nov 8, 2017

jpmckinney added the status: duplicate label Sep 23, 2021

jpmckinney closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid having 2 times the same spider running at the same time #228

avoid having 2 times the same spider running at the same time #228

rolele commented Apr 12, 2017

Digenis commented Apr 12, 2017

rolele commented Apr 12, 2017 •

edited

Loading

Dean-Christian-Armada commented Nov 7, 2017 •

edited

Loading

Digenis commented Nov 8, 2017

jpmckinney commented Sep 23, 2021

avoid having 2 times the same spider running at the same time #228

avoid having 2 times the same spider running at the same time #228

Comments

rolele commented Apr 12, 2017

Digenis commented Apr 12, 2017

rolele commented Apr 12, 2017 • edited Loading

Dean-Christian-Armada commented Nov 7, 2017 • edited Loading

Digenis commented Nov 8, 2017

jpmckinney commented Sep 23, 2021

rolele commented Apr 12, 2017 •

edited

Loading

Dean-Christian-Armada commented Nov 7, 2017 •

edited

Loading