Skip to content

Latest commit

 

History

History
66 lines (42 loc) · 1.06 KB

README.md

File metadata and controls

66 lines (42 loc) · 1.06 KB

second-spider

A simple python gevent concurrency spider

Features

  1. The concurrency foundation on gevent
  2. The spider strategy highly configurable:
  • Max depth
  • Sum totals of urls
  • Max concurrency of http request,avoid dos
  • Request headers and cookies
  • Same host strategy
  • Same domain strategy
  • Max running time

Dependencies

Test

python spider.py -v

Example

import logging
from spider  import Spider

logging.basicConfig(
        level=logging.DEBUG ,
        format='%(asctime)s %(levelname)s %(message)s')

spider = Spider()
spider.setRootUrl("http://www.sina.com.cn")
spider.run()

TODO

  • Support Distributed , update gevent.Queue -> redis.Queue
  • Storage backend highly configurable
  • Support Ajax url (webkit etc..)

LICENSE

Copyright © 2013 by kenshin

Under MIT license : rem.mit-license.org