creepy-crawler

Ruby web crawler that takes a url as input and produces a sitemap using a neo4j graph database - Nothing creepy about it.

##Installation ####Clone git clone https://github.com/udryan10/creepy-crawler.git && cd creepy-crawler ####Install Required Gems bundle install ####Install graph database rake neo4j:install ####Start graph database rake neo4j:start

####Requirements

Gems listed in Gemfile
Ruby 1.9+
neo4j
Oracle jdk7 (for neo4j graphing database)
lsof (for neo4j graphing database)

##Usage ###Code ####Require require './creepy-crawler' ####Start a crawl Creepycrawler.crawl("http://example.com") ####Limit number of pages to crawl Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) ####Extract some (potentially) useful statistics crawler = Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) # list of broken links puts crawler.broken_links # list of sites that were visited puts crawler.visited_queue # count of crawled pages puts crawler.page_crawl_count

####Options DEFAULT_OPTIONS = { # whether to print crawling information :verbose => true, # whether to obey robots.txt :obey_robots => true, # maximum number of pages to crawl, value of nil will attempt to crawl all pages :max_page_crawl => nil, # should pages be written to the database. Likely only used for testing, but may be used if you only wanted to get at the broken_links data :graph_to_neo4j => true }

####Example examples located in examples/ directory

###Command line # Crawl site ruby creepy-crawler.rb --site "http://google.com" # Get command options ruby creepy-crawler.rb --help

Note: If behind a proxy, export your proxy environment variables

export http_proxy=<proxy_host>; export https_proxy=<proxy_host>

###Docker For testing, I have included the ability to run the environment and a crawl inside of a docker container

##Output creepy-crawler uses neo4j graph database to store and display the site map.

Web interface

neo4j has a web interface for viewing and interacting with the graph data. When running on local host, visit: http://localhost:7474/webadmin/

Click the Data Browser tab
Enter Query to search for nodes (will search all nodes):
START root=node(*) RETURN root
Click into a node
Click switch view mode at top right to view a graphical map

Note: to have the map display url names instead of node numbers, you must create a style

REST interface

neo4j also has a full REST API for programatic access to the data

###Example Output Map

##TODO

convert to gem
multi-threaded to increase crawl performance

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
docker		docker
examples		examples
lib		lib
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
DOCKER.md		DOCKER.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
Rakefile		Rakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

creepy-crawler

Web interface

REST interface

About

Releases

Packages

Languages

udryan10/creepy-crawler

Folders and files

Latest commit

History

Repository files navigation

creepy-crawler

Web interface

REST interface

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages