"because crawlers are awesome"
##Balerion is a very basic crawler that does the following stuff.
- Take a given input url and fetch all the links using regest match on that page into a repository of links.
- Take a random url from existing repository and repeat step 1 recursively.
- Stop when all the links have been fetched / previously set limit of max fetch is reached.
##Working : # python crawler.py [root-url] [external-allowed] [redirect allowed] [max-limit] # 1 => allowed | 0 => not allowed python http://rohitjangid.com 1 1 100 # test suite python test.py
Basic algorithm:
- take input seed url
- if seed url is valid .. push it into the queue with 0 priority. proceed
- while queue is not empty or max-limit not reached. pop next-url
3.a read the page and extract the http-links
3.b normalize them and store in a priority queue which maintaines unique links only using help of hashtable. internal links given priority 1. external given 2
3.c repeat
##pylint score 10/10 for crawler.py
###credits:
- blog.mischel.com
- theanti9-pycrawler
- wrttnwrd-cmcrawler
- block8437-python-spyder
- oocities
- bbrodriges-pholcidae
###Day 1:
- Using urllib module to open any http link.
- Read the response and convert into utf-8 string to allow regex search.
- Store found links into file.
todos for tomorrow: ####fix errors:
- Only accept links of the form 'http://rohitjangid.com' and not 'rohitjangid.com'
- Store non-http links found on the website with base url prefix.
- A very basic test suite.
####things to study:
- Scrapy architecture and if using using xpath is required or not.
- A better way to deal with encodings.
major update
- maintain a queue of ramaining links to be processes.
- use hashtable to keep track of old links and only add links to queue iff they are new.
- used beautifulsoup for html processing.
- filters added for checking content-type and http-header scheme.
- downgraded python version to 2.7.4
###next:
- robots.txt check
- switch for internal and external links.
- explore multiple connections or parallel links to improve performance
- logging support added. [ yawning ... ]
trying to solve url uniqueness problem : seems to be the most important issue for a crawler.
standard url format : scheme://netloc/path;parameters?query#fragment.
possible strategy : url normalization followed by priorities . urls in same domain should be traversed earlier than external ones.
normalization and using priority queue allowed features of internal links only . and minimization of infinite loop situation.
laptop out of reach. :-/ couldn't do anything.