-
Notifications
You must be signed in to change notification settings - Fork 54
WebSearch Parsers Configuration
g4jc edited this page Jul 10, 2014
·
1 revision
This is a proposal for configuring :
- engines that get scraped by the websearch plugin
- parsers that need a declarative configuration
- configuration of each parser
The goal of a parser is to identifiy in the scapped data a set of snippets, each snippet beeing a structure with the following facets:
- title
- summary
- URL
- URL to cached version of the page, if available
- type, if available (forum, ...)
- date, if available
- language, if available
- image URL
- image URL in engine's cache
- related queries
- related documents
- categories
A search engine is defined by :
- An url template, (should conform to opensearch url patterns)
- If http protocol is used, The http method used to query (POST | GET)
- A type of websearch : page, image, video, tweet
- A parser as defined further in othis document
- Parser name, should be unique
- Parser type : sax, xpath, css selector, json, ...
- Parameters, which deoed on the parser type
- class, the class implmenting sax handlers
- namespace declarations
- snippet : xpath return the set of snippets contained in the page, further xpath exoression shall be expressed relatively to the nodes returned by this expression
- url : xpath returning a string containing the url facet of the snippet
- cache : xpath returning the cached verision of the resourc
For web