Balerion

"because crawlers are awesome"

##Balerion is a very basic crawler that does the following stuff.

Take a given input url and fetch all the links using regest match on that page into a repository of links.
Take a random url from existing repository and repeat step 1 recursively.
Stop when all the links have been fetched / previously set limit of max fetch is reached.

##Working : # python crawler.py [root-url] [external-allowed] [redirect allowed] [max-limit] # 1 => allowed | 0 => not allowed python http://rohitjangid.com 1 1 100 # test suite python test.py

Basic algorithm:

take input seed url
if seed url is valid .. push it into the queue with 0 priority. proceed
while queue is not empty or max-limit not reached. pop next-url 3.a read the page and extract the http-links
3.b normalize them and store in a priority queue which maintaines unique links only using help of hashtable. internal links given priority 1. external given 2
3.c repeat

##pylint score 10/10 for crawler.py

###credits:

the making

Identifying links in page content

###Day 1:

Using urllib module to open any http link.
Read the response and convert into utf-8 string to allow regex search.
Store found links into file.

todos for tomorrow: ####fix errors:

Only accept links of the form 'http://rohitjangid.com' and not 'rohitjangid.com'
Store non-http links found on the website with base url prefix.

additions:

A very basic test suite.

####things to study:

Scrapy architecture and if using using xpath is required or not.
A better way to deal with encodings.

Day 2

major update

maintain a queue of ramaining links to be processes.
use hashtable to keep track of old links and only add links to queue iff they are new.
used beautifulsoup for html processing.
filters added for checking content-type and http-header scheme.
downgraded python version to 2.7.4

###next:

robots.txt check
switch for internal and external links.
explore multiple connections or parallel links to improve performance

Day 3

logging support added. [ yawning ... ]

Day 4 . lots of reading and lot to be documented

trying to solve url uniqueness problem : seems to be the most important issue for a crawler.

standard url format : scheme://netloc/path;parameters?query#fragment.

possible strategy : url normalization followed by priorities . urls in same domain should be traversed earlier than external ones.

Day5.

normalization and using priority queue allowed features of internal links only . and minimization of infinite loop situation.

Day 6, 7

laptop out of reach. :-/ couldn't do anything.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
balerian.log		balerian.log
crawler.py		crawler.py
logs.py		logs.py
test.py		test.py
urlnorm.py		urlnorm.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Balerion

the making

Identifying links in page content

additions:

Day 2

Day 3

Day 4 . lots of reading and lot to be documented

Day5.

Day 6, 7

About

Releases

Packages

Languages

rohit-nsit08/Balerion

Folders and files

Latest commit

History

Repository files navigation

Balerion

the making

Identifying links in page content

additions:

Day 2

Day 3

Day 4 . lots of reading and lot to be documented

Day5.

Day 6, 7

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages