-
Notifications
You must be signed in to change notification settings - Fork 0
Draft
Motto: https://xkcd.com/927/
- Changes
- Preface
- Basic assumptions
- New tabular format
- Utilities (filters)
- Integration with existing tools
- Other formats and implementations
- Clarify meta prog: field - it should be a single-word identifier, not path to binary. Former seems like more usable, for example during converting to XML, full path doesn't act as identifier very well.
- Some thoughts about headers naming rules
- After some discussions and few readings I change my opinion about XML a little.
- Add note to error/warning messages - err(3) is a part of libc
There are periodically raised discussions and attempts to define new formats of output generated by typical Unix tools like route, netstat, ps or ls, suitable for integration with external tools. There is a NetBSD own project too.
- Required number of changes - both in-code and user-visible ones - should be minimal.
- Output format itself should be:
- directly (i.e. without additional tools) readable for users
- parsable by existing tools (cut, awk, grep)
- Existing tools (df, netstat) should retain backward-compatibility.
- Whole mechanism should be modular and open to future improvements
- Following operations should be easy to implement in new and existing (awk, perl) tools:
- filtering data according to different rules (greater than, smaller than, equal to...)
- selecting subset of data (for example "only mount point and capacity from df output")
- re-ordering data
- converting output data to JSON/YAML/XML/...
In this document programs that consumes, parses and displays new output format, according to some criterias are called 'filters'.
After examining few data-formats (JSON, YAML) and already available projects (libxo in FreeBSD) I realized that only one format fulfills all requirements: simple, well-defined, tabular plain text.
In my opinion solution based on tabular plain text offers the best balance between readability, usability and amount of work required for implementation while still being easy to parse. There are better formats for each, individual aspect - but not for all of them simultaneously.
Converting tabular data to other format, like XML is relatively easy and straightforward.
Current status is a mess. Typical output is separated by spaces or tabs, usually output is also column-aligned. Sometimes columns are separated by commas (dkctl), sometimes by equals sign (sysctl, mixerctl). We can read a data from any column by awk or cut but we must remember about excluding column name (in every case) or part of column name or separator between column names and data (cpuctl).
For example, if we need an src IP from netstat: we need to skip part of column
name ('Local
') and word '(including
' output. Then we must strip a last
octet, separated by dot. And so on...
Example:
localhost$ netstat -an -f inet
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 1.2.3.221.22 1.2.3.91.37542 ESTABLISHED
tcp 0 0 1.2.3.221.22 1.2.3.91.37276 ESTABLISHED
tcp 0 0 *.22 *.* LISTEN
Some tools provide a rudimentary filters and column selectors (ps) but it isn't a standard.
After several iterations and tests made on innocent humans (shame on me), for sake of a better readability, I've even simplified my initial thoughts: in the current form we have only three types of lines (alongside with few subtypes):
$ arp -n < command invocation
# meta prog:arp ver:2 < meta data
# there are three known addresses below < not-structured message
# < 'start-of-headers' marker
Host IPv4 MAC Interface < headers
- 1.2.3.2 00:11:32:cc:aa:bb wm0 < tabular data
- 1.2.3.1 aa:aa:bb:20:1a:cb wm0
- 1.2.3.91 44:8a:5b:aa:aa:aa wm0
#err arp Something gone wrong! < error message
Lines that starts from a hash: #
, optionally followed by space and by
string. Rule is simple: "if it begins with #
then it doesn't contain
tabular data".
In my opinion tools shouldn't hide meta-data from users - output can be coloured to provide better visibility but user should see the same data from 'raw' and 'filtered' output.
# meta <key>:<value> <key>:<value> <key>:<value>
Meta-lines should be formatted in pleasant for eye form and may be split into multiple lines. Meta-lines shouldn't be overused because it can create visual chaos.
At this moment I can see some use-cases for meta fields:
-
prog:<identifier>
:with full pathname orprogram name. It may be used by filter to adjust behavior or to format data according additional criteria, based on program name or version. Example: conversion to XML. -
version:<version string>
: program version. May be useful in the future or when some tools (for example GNU-ones) has similar name. -
fmt:<format string>
: additional information about columns (sizes, data types). For examplefmt:'%ip4 %20s %i %-'
: should be interpreted as "treat first column as IPv4 addr, second as string with max length 20 chars, third as integer, type of fourth should be determine in other way"It is somewhat redundant feature because column format may also be determined by:
- well-known names, hard-coded into tools (Interface, MAC, Route, IPv4)
- column names, program names and - possible - program version taken from meta-fields and described in optional configuration database (like termcap one)
- format can be guessed ("if it is all-numeric one then convert it to standard integer") although it requires some caution
Example:
# meta prog:df
# meta fmt:'%s %s %s %s"
#
Single hash: #
. Marks NEXT line as a header line.
That - MANDATORY - line has additional purpose: provide better, visual separation between "machine" and regular, tabular data and was explicitly indicated by test subjects as "important difference".
# <message>
Line, that starts with hash and space: #
. This is a place
for additional, unstructured messages.
Example:
# total count: 30
# Active Internet connections (including servers)
#err <progname> <message text>
#warn <progname> <message text>
Starts from #err
or #warn
followed by program name and
message text.
Word of warning: implementation requires changes in libc due to err(3). Therefore that part should be considered as future work.
Example:
#err df Invalid moon phase detected.
Defines column names and allow filters to determine kind of data in particular column.
In principle, column names should be limited to characters that doesn't conflict with shell commands and filters operands (see below). So, typical alphanumerics, underscores, dashes and percent signs should be sufficient.
In this model internationalization mechanisms (i.e. translating
column names to localized ones) may be transferred to filters-level,
which will makes integration witch tools easier (no more LANG=C
in every script).
All headers MUST fit into single line.
Headers should be treated by filters as case insensitive. It is more
convenient to write where local_ipv4=1.2.3.4
than Local_IPv4=...
BUT
XML elements are case-sensitive, and we should have conversions to other formats in mind as well as concise general behaviour ("why column names in filter expressions are case-insensitive but XML not?").
Headers shouldn't be separated from data by additional, vertical marks,
like --------
(like in cpuctl
output). This is redundant and may
be added by filters.
In general XML naming rules as well "XML best naming practices" may give us a good starting point:
XML elements must follow these naming rules:
- Element names are case-sensitive
- Element names must start with a letter or underscore
- Element names cannot start with the letters xml (or XML, or Xml, etc)
- Element names can contain letters, digits, hyphens, underscores, and periods
- Element names cannot contain spaces
Any name can be used, no words are reserved (except xml).
Best Naming Practices
- Create descriptive names, like this: , , .
- Create short and simple names, like this: <book_title> not like this: <the_title_of_the_book>.
- Avoid "-". If you name something "first-name", some software may think you want to subtract "name" from "first".
- Avoid ".". If you name something "first.name", some software may think that "name" is a property of the object "first".
- Avoid ":". Colons are reserved for namespaces (more later).
- Non-English letters like éòá are perfectly legal in XML, but watch out for problems if your software doesn't support them.
Column-formatted, separated by tabs and spaces.
We should provide basic guidelines for data formatting, i.e.:
-
allow/disallow suffixes like KiB, kb, M etc. (as far I can see there is no problem with suffixes)
-
allow/disallow using percent sign
%
to denote percentage values. It is redundant, I think, and should be forbidden- column names give us enough information about kind of data but it is easy to parsing, so I don't have fixed opinion in this case.
-
using quotations when string data may contains tabs or spaces? This is most tricky part.
-
unknown/not applicable/missing values: there are several possibilities, like:
-
or?
or_
.There should be only one value - but it also depends from selected variant of filter syntax (see below). Some forms may conflict with current shells or getopt library (
?
or-
). -
"ANY IP addr" should be denoted as '0.0.0.0', not by asterisk (also due to shell interactions)
Some additional examples (with colored variants included) will be available at external page.
-
Current arp format
localhost$ arp -an ? (1.2.3.2) at 00:11:32:cc:aa:bb on wm0 ? (1.2.3.1) at aa:aa:bb:20:1a:cb on wm0 ? (1.2.3.91) at 44:8a:5b:aa:aa:aa on wm0
-
New arp format
localhost$ arp -an # meta prog:arp # Host IPv4 MAC Interface - 1.2.3.2 00:11:32:cc:aa:bb wm0 - 1.2.3.1 aa:aa:bb:20:1a:cb wm0 - 1.2.3.91 44:8a:5b:aa:aa:aa wm0
-
New df format
$ df # meta prog:df ver:8.25-gnu # Filesystem 1024-blocks Used Available Capacity Mounted udev 1837416 0 1837416 0 /dev tmpfs 371560 5956 365604 2 /run
-
df output, colored by filter (older example):
-
New netstat format
localhost$ netstat -an -f inet6 # meta prog:netstat # Active Internet6 connections (including servers) # Proto Recv-Q Send-Q Local_Addr Local_port Foreign_Addr Foreign_Port State tcp6 0 0 0.0.0.0 22 0.0.0.0 - LISTEN ...
I'm not sure about 'Proto' column. Maybe it should be 'tcp' with extra 'Family'
column instead?
Choose format or filter logic for following, non-tabular data:
- ifconfig
- df -G
- ping
There already are many: awk, sed, perl, python, cut, grep... But improved output format will make a room for some new utilities that may be useful for users and system administrators.
During my tests I wrote a kind of "technology demonstrator", simple tool
called sel
that helps me in testing and verifying various use cases.
sel
is available from github repo.
It is inspired by some PowerShell features, but without all hassle and cruft. For example, in PS, we are able to do something like that:
Y:\> Get-Process | Select VM,Id,Handles,Name | select -last 5
VM Id Handles Name
-- -- ------- ----
59392000 7655 147 WmiPrvSE
44191744 9811 165 WmiPrvSE
165384192 3122 226 Gogo
45375488 6433 24 GogoHelper
56348672 5244 205 WUDFHost
Word of warning: some samples were created from gnu-tools on Linux host.
Example: "show columns number 1,2,3 and one named 'Mounted' but only when 'Filesystem' is equal to 'tmpfs' and column 'Used' is greater than 10".
$ df | sel 1,2,3,mounted where Filesystem -eq tmpfs -a Used -gt 10
Filesystem 1024-blocks Used Mounted
tmpfs 371560 5956 /run
tmpfs 1857800 56916 /dev/shm
tmpfs 371560 36 /run/user/3000
"Get single value"
$ df | sel -q Used where Mounted -eq /opt
123456
Raw output (discussed below):
$ df
# meta prog:df ver:8.25-gnu
#
Filesystem 1024-blocks Used Available Capacity Mounted
udev 1837416 0 1837416 0 /dev
tmpfs 371560 5956 365604 2 /run
...
Possible features - command combining?
( netstat -f inet6 ; arp -n ) | \
sel netstat.source_ipv4, \
arp.mac \
where "netstat.source_ipv4 -eq arp.ipv4 -a netstat.state -eq 'ESTABLISHED'"
Word of warning: this particular area was not well-tested and I'm not sure that that is a best possible solution.
-
Output format may be selected by two environment variables: general and particular one, for example:
OUTPUT_FORMAT=classic - forces current behavior OUTPUT_FORMAT=cof - select 'new' format OUTPUT_FORMAT_df=cof - the same but only for 'df'
-
Optional automatic filter may be selected in the same way:
OUTPUT_FILTER = /usr/bin/sel - select filter for all commands OUTPUT_FILTER_df = - disable filter for single command
-
Passing parameters to integrated filter: Filter's options should be easy to use and pass directly to
$OUTPUT_FILTER
.Possible options:
a.
df -/ '[filter here]'
(is it a valid option for bsd getopt?)b.
df --of '[filter options here]'
-
Filter may be called in pipe-to-child schema (untested).
There have been many attempts and interesting ways to use XML as a data exchange format for UNIX tools.
- xml-coreutils - although I'm still a bit skeptic about XML though linked essay "Is the Unix shell ready for XML?" is definitely worth reading.
Pros
- familiar to users
- widespread
- many, simple parsers available
- JSON format, when used consequently, has ability to pass multiple, structured information to consumers: number of selected elements, implementation details and so on
Cons
-
not convenient for standard shell tools nor humans, after all:
$ df --libxo='json,pretty,no-top' "storage-system-information": { "filesystem": [ { "name": "/dev/gpt/rootfs", "total-blocks": 20307196, "used-blocks": 1449604, "available-blocks": 17233020, "used-percent": 8, "mounted-on": "/" }, { "name": "devfs", "total-blocks": 1, "used-blocks": 1, "available-blocks": 0, "used-percent": 100, "mounted-on": "/dev" } ] }
BUT there are interesting tools
-
example from manual looks nice, but more complicated ones don't
-
JSON isn't stream-oriented (but there is a Line_Delimited_JSON). Unfortunately - DLJSON isn't readable.
-
creating proper JSON requires from us "opening" and "closing" sections, see libxo examples. Whole process is tedious and libxo even has additional mode as workaround (see 'DTRT Mode')
I was fascinated by YAML but - finally - I gave up.
- http://yaml.org/
- https://en.wikipedia.org/wiki/YAML
- https://arp242.net/weblog/yaml_probably_not_so_great_after_all.html
Pros
- also well-known
- better suited for stream data than JSON (but: how about LDJSON?)
- probably easier to create than JSON (we don't need closing bracket for marking end-of-array)
Cons
- overcomplicated specification and parsers
- not-so-readable after all (I tried some variants but they was unreadable without syntax-hilighting and event then effect wasn't good).
"Simpler than YAML" but - unfortunately - not compatible with my assumptions
It is not a format but existing and integrating into base system (FreeBSD) implementation that provides XML, JSON and HTML output.
Pros
- From a console user point of view I don't see any of them, sorry: output is suitable for machine-parsing but inconvenient to read (see previous examples in JSON section).
Cons
-
I have fundamental objections against integrating such a complicated functionality as JSON/XML/HTML formatter into tools (even by shared library)
-
I don't like idea of embedding additional code, responsible of formatting output formats into particular tools.Integration with libxo requires things like "open container","open section" and so on:
-
integration with libxo requires large amount of work.