Scrapy-sqlite is a tool that lets you feed and queue URLs from SQLite via Scrapy spiders, using the Scrapy framework.
Inspired by and modled after scrapy-rabbitmq which was inspired by and modled after scrapy-redis.
This classes use by default the same table for Scheduler, Dupefilter, Queue and Httpcache.
Using pip, type in your command-line prompt
pip install scrapy-sqlite
Or clone the repo and inside the scrapy-sqlite directory, type
python setup.py install
# Enables scheduling storing requests queue in sqlite.
SCHEDULER = "scrapy_sqlite.scheduler.Scheduler"
# Don't cleanup sqlite queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Provide path to SQLite db file, this example creates imdb.sqlite3 filename for bot named imdb
SQLITE_DATABASE = '%(spider)s.sqlite3'
# Store scraped item in sqlite for post-processing.
ITEM_PIPELINES = {
'scrapy_sqlite.pipelines.SQLitePipeline': 100
}
# Cache Storage class
HTTPCACHE_STORAGE = 'scrapy_sqlite.httpcache.SQLiteCacheStorage'
# Enables gzip compression of responses. If response are already recieved compressed (recommended and very used feature by webservers), it compresses response second time
HTTPCACHE_GZIP = True
from scrapy.contrib.spiders import CrawlSpider
from scrapy_sqlite.spiders import SQLiteMixin
class MultiDomainSpider(SQLiteMixin, CrawlSpider):
name = 'multidomain'
def parse(self, response):
# parse all the things
pass
Step 3: Run spider using scrapy client
scrapy runspider multidomain_spider.py
#TODO: reprogramm to sqlite
#!/usr/bin/env python
import sqlite3
import settings
connection = sqlite3.connect('db.sqlite3'))
connection.close()
See the changelog for release details.
Version | Release Date |
---|---|
0.1.0 | 2017-05-01 |
Copyright (c) 2017 Filip Hanes - Released under The MIT License.