Skip to content

Glossary

Hika van den Hoven edited this page May 17, 2017 · 27 revisions

DataTreeGrab

Glossary

accept-header
autoclose-tags
caller_id
current_date
current_ordinal
child_index
data_def
data-format
DATAnode
DATAtree
date-range-splitter
date-sequence
date-splitter
datetimestring
default-item-count
empty-values
enclose-with-html-tag
encoding
init_def
item-range-splitter
key_def
key-node
link_def
link-value
month-names
name-value
node_def
NULLnode
path_def
.print_searchtree
relative-weekdays
root-node
severity
.show_result
start_node
str-list-splitter
text_replace
time-splitter
time-type
timezone
unquote_html
URL_def
url
url-data
url-date-format
url-date-multiplier
url-date-type
url-header
url-type
url-weekdays
value_def
value-filters
warngoal
weekdays

accept-header

An optional string value in the root of the data_def part of the data_def language URL extension. It can hold the accept string to use with an URL and returned by the DataTreeGrab.get_url() function.

autoclose-tags

An optional list value in the root of the data_def used by DataTreeShell to initializing an HTMLtree. This is a list of HTML tags that might neither be self-closing nor have a closing tag.

caller_id

The numerical ID of a HTMLtree, JSONtree or DataTreeShell as given on creation and used by the warning framework. Any filter created with caller_id 0 is generic.
If you choose to use a queue object as your warngoal it is returned together with the message.

child_index

Within a DATAtree every DATAnode can have zero or more child-nodes. They are indexed from zero up. This is called the child_index and can be used to select a child-node. In a JSON dict this value has no real meaning as in the intermediate Python dict stage the order is randomized. In the future we will add an adapted JSON-decoder that will bypass this stage decoding strait into a DATAtree.

current_date

This is determined on initialization of any HTMLtree, JSONtree or DataTreeShell object but can also be set using their set_current_date() function. You can pass it any datetime.datetime, datetime.date or ordinal value to set it to another date than today. Next to the datetime.date object also the ordinal is stored in current_ordinal

current_ordinal

See current_date

data_def

A JSON or JSON oriented data file defining a HTML or JSON site and how to extract the desired data from it through a DATAtree. It consists of several path_defs and node_def defining how and where to locate the data. It also can contain several URL_defs to define the URL and link_defs for post-processing the found data.

data-format

An optional string value in the root of the data_def part of the data_def language URL extension holding a web data format string like "text/html" or "application/json"

DATAnode

A class object containing the associated data from a point in a DATAtree class object. It is aware of its parents, children and siblings. It represents either a HTML-tag with associated attributes and text values, A JSON-dict, a JSON-list or a JSON-value. See DATAnode, HTMLnode, JSONnode and NULLnode for a class description.

DATAtree

A class object containing either a HTML or a JSON datapage in the form of a tree of DATAnodes where every node is aware of its surrounding nodes making it possible to walk the tree in search of desired data. This allows for a unified way to extract data without any need for the receiving application to be aware of where or how the data is extracted. Any changes in de source data only need adaptation of the data_def which could at need be retrieved in its latest version from for instance a web-location. See DATAtree, HTMLtree and JSONtree for a class description.

date-range-splitter

An optional string value in the root of the data_def part of the data_def language URL extension. It is used by function 14 to hold the character to separate the two dates in a date range. It defaults to "~".

date-sequence

An optional list of characters in the root of the data_def extracted into the date_sequence parameter of the DATAtree. It represent the order of items after splitting a date value on the date-splitter character. Only "m", "d", "y" are used. Any other value means to ignore this part of the date value. For instance a weekday. It defaults to ["y","m","d"]. It also can be added to a node_def to ad-hoc overrule the default.

date-splitter

An optional string value in the root of the data_def extracted into the date_splitter parameter of the DATAtree. It represent the character to use for splitting a date value into its components. It defaults to "-". It also can be added to a node_def to ad-hoc overrule the default.

datetimestring

An optional string value in the root of the data_def extracted into the datetimestring parameter of the DATAtree. It can be used to recognize a datetime value using the datetime.datetime.strptime() function. It defaults to u"%Y-%m-%d %H:%M:%S". It also can be added to a node_def to ad-hoc overrule the default.

default-item-count

An optional integer value in the root of the data_def part of the data_def language URL extension. It is used by function 4 to hold the default length of a range. It defaults to 0.

empty-values

An optional list in the root value of the data_def used by the Link extension to the data_def language

enclose-with-html-tag

An optional boolean in the root of the data_def used by DataTreeShell. If set to True an HTML page will be encosed with a <html> and a </html> tag before passing it to HTMLtree. This is needed if the page does not consist out of a single tree.

encoding

An optional string value in the root of the data_def part of the data_def language URL extension holding the codeset (for instance "utf-8") that is used for the webside returned by the URL.

init_def.

A path_def within a data_def at ["data"]["init-path"] defining how to locate the start_node to start any search.

item-range-splitter

An optional string value in the root of the data_def part of the data_def language URL extension. It is used by function 4 to hold the character to separate the two values in a range. It defaults to "-".

key_def

A path_def within a data_def at ["data"]["key-path"] or ["data"]["iter"]["key-path"] defining the relative path against the start-node to find a set of key-nodes to further extract data from through the associated value_defs.

key_node

A node within a DATAtree found through a key_def. It is the ancher-point to a data-set as defined by a set of value_defs.

link_def

A dict in ["values"] within the data_def following the Link extension to the data_def language defining a data value.

link-value

A value extracted and saved during a DATAtree search using the "link" keyword to be used further down the search to identify the next DATAnode.

month-names

An optional list of 13 string values in the root of the data_def extracted into the month_names parameter of the DATAtree. It represent the names of the months with the first being a dummy value as month 0 does not exist. It is used to recognize not numerical month values in a date value and has no default.

name-value

A value extracted and saved during a DATAtree search using the "name" keyword. The final value then will be return in a dict with this as the key.

node_def

A dict mostly as part of a path_def within a data_def defining zero or more nodes relative to the current node within a DATAtree. It also can contain statements to define data to retrieve either as an end value in a key_def or a value_def or to store for later reference as a link-value.

NULLnode

A class object used as a special return value on a DATAnode search request. While a return value of None can mean that the node was found but returned None on the final value request. NULLnode means there is no Node to request a value from or to reference to for further searches. It only contains a value attribute with a value of None. It is used as a special return value, when looking for further data values in reference to a key-node does invalidate that key-node. At present only possible through a "member-off" statement.

path_def

A set of node_defs within a data_def defining the path to a node. The last node_def normally will also contain details on what data to extract from that node. It exists in three flavours:

.print_searchtree

A boolean debug attribute of DATAtree, HTMLtree, JSONtree and DataTreeShell. If you set it to True the DATAtree will be printed to the file object set through the output startup attribute. This defaults to sys.stdout.

relative-weekdays

An optional dict in the root of the data_def. For instance: "relative-weekdays":{"yesterday":-1,"today":0,"tomorrow":1}. The numbers are replaced with ordinals pointing to the right date and the values in weekdays are added pointing to today up to a week in the future. The resulting dict is placed in the relative_weekdays parameter of the DATAtree. It can be used to translate a day to a date value. It has no default.

root-node

The node at the root of a DATAtree.

severity

All warnings issued have a severity ID:

  • 1 meaning probably fatal. These are mostly dtDataWarnings, but also a missing or invalid part of your data_def will be seen as severity 1
  • 2 meaning the warning probably was caused by an error in your data_def
  • 4 meaning something occurred, but it very probably was not serious and caused by variations in your data page.

If you choose to use a queue object as your warngoal it is returned together with the message.

show_result

A boolean debug attribute of DATAtree, HTMLtree, JSONtree and DataTreeShell. If you set it to True every step in the search process will be printed to the file object set through the output startup attribute. This defaults to sys.stdout.

start_node

The node from where all searches start and defined in the init_def in a data_def.

str-list-splitter

An optional dict in the root of the data_def extracted into the datetimestring parameter of the DATAtree. It can be used to recognize a datetime value using the datetime.datetime.strptime() function. It defaults to u"%Y-%m-%d %H:%M:%S". It also can be added to a node_def to ad-hoc overrule the default.

text_replace

An optional keyword in the root of a HTML data_def used by DataTreeShell. It's a list of regex pairs to use in the re.sub() command to identify and replace potential errors or problems in the HTML page prior to importing it into the HTMLtree For instance on a certain page sometimes the layout word "theme-text" is added after a class statement making it difficult to use it for parsing the tree. ["class=\"([-a-z]*?) theme-text\"", "class=\"\\g<1>\""] will remove that word.

time-splitter

An optional string value in the root of the data_def extracted into the time_splitter parameter of the DATAtree. It represent the character to use for splitting a time value into its components. It defaults to ":". It also can be added to a node_def to ad-hoc overrule the default.

time-type

An optional list in the root of the data_def extracted into the time-type parameter of the DATAtree. The first item should be either 24 or 12 indicating a 24 hour or a 12 hour time system. If 12 is specified it can be followed by the string values indicating "am" and "pm" defaulting to those. It defaults to [24]. It also can be added to a node_def to ad-hoc overrule the default.

timezone

An optional string in the root of the data_def like "Europe/Amsterdam" indicating a timezone. The coresponding datetime.tzinfo object is extracted into the timezone parameter of the DATAtree. It defaults to "UTC".

unquote_html

An optional keyword in the root of a HTML data_def used by DataTreeShell. It's a list of regexs to identify (attribute)text locations containing illegal ", < an > characters replacing them with the correct codes &quot;, &lt; and &gt;. For instance: "<h6 class=\"title[-\\s\\w]*?\">\\n(.*?)\\n(?:span class=\"broadcaster\">.*?</span>)?.*?/h6>" will look for a "h6" tag with attribute class="title" (with an optional extra layout parameter and also optional containing "span" tag) and parses the containing text. Only the groups (sections contained in hooks) are parsed. If you need to use a group that like in the above example should not be parsed, start it with ?:.

url

An optional list or string value in the root of the data_def part of the data_def language URL extension. When it is a list the items can be either literal string values or an URL_def whose return value will be concatenated with the strings to form a complete URL string.

url-data

An optional dict in the root of the data_def part of the data_def language URL extension. It holds the post-data items to return when calling the DataTreeGrab.get_url() function. Every dict key can hold either a literal string value or an URL_def

url-date-format

An optional string value in the root of the data_def part of the data_def language URL extension. It is used by function 11 and 14 to hold a format string used when url-date-type is set to 0. It defaults to None implying to return the relative offset to today.

url-date-multiplier

An optional integer value in the root of the data_def part of the data_def language URL extension. It is used by function 11 and 14 to hold a multiplication factor for the timestamp returned when url-date-type is set to 1. It defaults to 1.

url-date-type

An optional integer value in the root of the data_def part of the data_def language URL extension. It is used by function 11 and 14 to define the date format to be used. It defaults to 0.

URL_def

A dict following the URL extension to the data_def language defining a data value. This is a simple implementation of the Link extension to the data_def language and is used to define part of an URL, an url Header or url post-data.

url-header

An optional dict in the root of the data_def part of the data_def language URL extension. It holds the header items to return when calling the DataTreeGrab.get_url() function. Every dict key can hold either a literal string value or an URL_def

url-type

An integer value in the root of the data_def part of the data_def language URL extension. Not jet used.

url-weekdays

An optional list in the root of the data_def part of the data_def language URL extension. It is used by function 11 and 14 to hold a list of the 7 weekdays starting with Monday to return when url-date-type is set to 2. If it is empty the weeknumber is returned. It can also hold a different numbering. It has no default.

value_def

A path_def at ["data"]["values"] or ["data"]["iter"]["values"] defining the data to extract for every key-node.
If you want to select independent data-values from the DATAtree through the .find_data_value() function, the same syntax is used.
As of version 1.3.3 you can also use ["data"]["values2"] or ["data"]["iter"]["values2"] for a second instance with the incompatible "node" keyword, leaving the old path_def in place for older DataTreeGrab instances.

value-filters

An optional dict in the root of the data_def extracted into the value_filters parameter of the DATAtree. It can be used to through a "member-off" statement to filter out invalid key-nodes. It has no default.

warngoal

The file or queue object to receive all warnings. With a queue object the warning is put there as a three part tuple:

weekdays

An optional list of 7 integer or string values in the root of the data_def extracted into the weekdays parameter of the DATAtree. It represent the names of the days of the week starting with Monday. It is used in constructing the relative_weekdays attribute.

Clone this wiki locally