Crochet-based blocking API for Scrapy.
This module provides function helpers to run Scrapy in a blocking fashion. See the scrapydo-overview.ipynb notebook for a quick overview of this module.
Using pip
:
pip install scrapydo
The function scrapydo.setup
must be called once to initialize the reactor.
Example:
import scrapydo
scrapydo.setup()
scrapydo.default_settings.update({
'LOG_LEVEL': 'DEBUG',
'CLOSESPIDER_PAGECOUNT': 10,
})
# Enable logging display
import logging
logging.basicConfig(level=logging.DEBUG)
# Fetch a single URL.
response = scrapydo.fetch("http://example.com")
# Crawl an URL with given callback.
def parse_page(response):
yield {
'title': response.css('title').extract(),
'url': response.url,
}
for href in response.css('a::attr(href)'):
url = response.urljoin(href)
yield Request(url, callback=parse_page)
items = scrapydo.crawl('http://example.com', callback)
# Run an existing spider class.
spider_args = {'foo': 'bar'}
items = scrapydo.run_spider(MySpider, **spider_args)
scrapydo.setup()
- Initialize reactor.
scrapydo.fetch(url, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)
- Fetches an URL and returns the response.
scrapydo.crawl(url, callback, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT)
- Crawls an URL with given callback and returns the scraped items.
scrapydo.run_spider(spider_cls, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT, **kwargs)
- Runs a spider and returns the scraped items.
highlight(code, lexer='html', formatter='html', output_wrapper=None)
- Highlights given code using pygments. This function is suitable for use in a IPython notebook.