Skip to content

WebSearch Parsers Configuration

g4jc edited this page Jul 10, 2014 · 1 revision

This is a proposal for configuring :

  • engines that get scraped by the websearch plugin
  • parsers that need a declarative configuration
  • configuration of each parser
The configuration uses the xml format, the advantage of using this syntax is that one can use a schema to validate the configuration at seeks startup. Some translators form curly brakets syntax to xml may be provided in a second time.

The goal of a parser is to identifiy in the scapped data a set of snippets, each snippet beeing a structure with the following facets:

  • title
  • summary
  • URL
  • URL to cached version of the page, if available
  • type, if available (forum, ...)
  • date, if available
  • language, if available
for image snippets:
  • image URL
  • image URL in engine's cache
There may be other data to scrap, general to a set of snippets, such as:
  • related queries
  • related documents
  • categories
More facets may occur in certain use-cases (news, code search, ...).

Table of Contents

Search Engine Configuration

A search engine is defined by :

  • An url template, (should conform to opensearch url patterns)
  • If http protocol is used, The http method used to query (POST | GET)
  • A type of websearch : page, image, video, tweet
  • A parser as defined further in othis document

Parsers configuration

  • Parser name, should be unique
  • Parser type : sax, xpath, css selector, json, ...
  • Parameters, which deoed on the parser type

sax parsers

  • class, the class implmenting sax handlers

xpath parsers

  • namespace declarations
  • snippet : xpath return the set of snippets contained in the page, further xpath exoression shall be expressed relatively to the nodes returned by this expression
  • url : xpath returning a string containing the url facet of the snippet
  • cache : xpath returning the cached verision of the resourc
_

For web