Skip to content
/ rdig Public

Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.

License

Notifications You must be signed in to change notification settings

jkraemer/rdig

Repository files navigation

RDig

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way to specify such an OR dependency in a gem specification, the gem depends on Hpricot. If this is a problem for you, install the gem with –force and manually do a +gem install rubyful_soup+.

basic usage

Index creation

  • create a config file based on the template in doc/examples

  • to create an index:

    rdig -c CONFIGFILE
  • to run a query against the index (just to try it out)

    rdig -c CONFIGFILE -q 'your query'

    this will dump the first 10 search results to STDOUT

Handle search in your application:

require 'rdig'
require 'rdig_config'   # load your config file here
search_results = RDig.searcher.search(query)

see RDig::Search::Searcher for more information.

usage in rails

  • add to config/environment.rb :

    require 'rdig'
    require 'rdig_config'
    
  • place rdig_config.rb into config/ directory.

  • build index:

    rdig -c config/rdig_config.rb
  • in your controller that handles the search form:

    search_results = RDig.searcher.search(params[:query])
    @results = search_results[:list]
    @hitcount = search_results[:hitcount]
    

search result paging

Use the :first_doc and :num_docs options to implement paging through search results. (:num_docs is 10 by default, so without using these options only the first 10 results will be retrieved)

sample configuration

from doc/examples/config.rb. The tag_selector properties are called with a BeautifulSoup instance as parameter. See the RubyfulSoup Site for more info about this cool lib. You can also have a look at the html_content_extractor unit test.

:include:doc/examples/config.rb

About

Crawler and content extractor for building a full text index of a website's contents. Uses Ferret for indexing.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages