WebSearch Parsers Configuration

This is a proposal for configuring :

engines that get scraped by the websearch plugin
parsers that need a declarative configuration
configuration of each parser

The configuration uses the xml format, the advantage of using this syntax is that one can use a schema to validate the configuration at seeks startup. Some translators form curly brakets syntax to xml may be provided in a second time.

The goal of a parser is to identifiy in the scapped data a set of snippets, each snippet beeing a structure with the following facets:

title
summary
URL
URL to cached version of the page, if available
type, if available (forum, ...)
date, if available
language, if available

for image snippets:

image URL
image URL in engine's cache

There may be other data to scrap, general to a set of snippets, such as:

related queries
related documents
categories

More facets may occur in certain use-cases (news, code search, ...).

Table of Contents Search Engine Configuration Parsers configuration sax parsers xpath parsers

Search Engine Configuration

A search engine is defined by :

An url template, (should conform to opensearch url patterns)
If http protocol is used, The http method used to query (POST | GET)
A type of websearch : page, image, video, tweet
A parser as defined further in othis document

Parsers configuration

Parser name, should be unique
Parser type : sax, xpath, css selector, json, ...
Parameters, which deoed on the parser type

sax parsers

class, the class implmenting sax handlers

xpath parsers

namespace declarations
snippet : xpath return the set of snippets contained in the page, further xpath exoression shall be expressed relatively to the nodes returned by this expression
url : xpath returning a string containing the url facet of the snippet
cache : xpath returning the cached verision of the resourc

_

For web

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSearch Parsers Configuration

Table of Contents

Search Engine Configuration

Parsers configuration

sax parsers

xpath parsers

Clone this wiki locally