Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check if a spider exists before schedule it (with sqlite cache) #17

Merged
merged 5 commits into from
Jul 10, 2014

Conversation

xaqq
Copy link
Contributor

@xaqq xaqq commented May 6, 2013

I pulled #8 to my local repo and added sqlite caching like pablohoffman suggested. I ran some performance tests using apache bench, and its ok.

@@ -7,6 +8,28 @@
from scrapy.utils.python import stringify_dict, unicode_to_str
from scrapyd.config import Config

class UtilsCache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a class level doc to explain what it does?

@jayzeng
Copy link
Contributor

jayzeng commented Jul 4, 2014

looks good to me, @pablohoffman thoughts?

@jayzeng
Copy link
Contributor

jayzeng commented Jul 10, 2014

I will go ahead to pull this pull

jayzeng added a commit that referenced this pull request Jul 10, 2014
check if a spider exists before schedule it (with sqlite cache)
@jayzeng jayzeng merged commit b9a38f6 into scrapy:master Jul 10, 2014
@pablohoffman
Copy link
Member

OK, but we should probably move this check to store data in whatever database scrapyd ends up using for persisting data.

jpmckinney added a commit that referenced this pull request Jul 22, 2024
UtilsCache.__init__ calls JsonSqliteDict(table="utils_cache_manager"), which uses ":memory:" as the database.

A comment in #17 suggests persisting this cache. However, there isn no contract that egg storage must only be modified
by Scrapyd. (For example, users can happily store eggs in the egg directory, before deploying Scrapyd.)

Without persistence, there is really no reason to use SQLite. We can therefore use a simpler approach.

This changes the get_spider_list function to a SpiderList class

- Require the runner agument
- Remove the pythonpath argument (unused)
- Remove the config argument (see next commit)
- Use get(), set() and delete() methods, instead of having to invalidate the cache with calls to UtilsCache
- Evict only the specified version and default version on delversion.json, instead of all versions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants