Skip to content

Scraper Formats

Albert Schimpf edited this page Feb 1, 2021 · 1 revision

Scraper Specification Formats

There are currently 3 core parsers implemented for Scraper specifications.

Json and Yaml

The jf/json parser and the yf/yaml/yml parser are identical. The json representation can be converted into the yml representation with external tools without problems and vice versa.

For simplicity we only cover the yml format.

A full yml specification:

name: String
entry: String # default: "start"
globalNodeConfigurations: # default: {}
  String: String          # key: Static String or Regex enclosed in '/ /' (e.g. "/ech.*/") 
                          # value: value to be used in matching nodes
imports:        # default: {}
  String:       # key: full path to imported taskflow (e.g. 'child.yf'). Should be the same format.
                # value: not used
graphs:         # mandatory map of String -> List of NodeSpec
  String: [NodeSpec]

Where a NodeSpec is a key-value map of an implemented node (see Node Documentation).

Simple Yaml

For one-off taskflows, a more simple specification parser is provided which assumes no imports are needed. The format is as follows:

String: [NodeSpec]

Where a NodeSpec is a key-value map of an implemented node (see Node Documentation). Basically, a map where keys are graphs and the values are list of nodes. The name key is by convention the filename without the file ending.

Custom Parser

Your custom parser should implement the ScraperSpecificationParser interface. It should provide the file endings it accepts and should transform all valid custom formats to valid internal ScrapeSpecifications. See the three core parsers for reference.