Skip to content
Piotr Meyer edited this page Oct 29, 2017 · 37 revisions

Common Output Format for unix-like tools

Motto: https://xkcd.com/927/

TOC

Changes

29-10-2017

  • Clarify meta prog: field - it should be a single-word identifier, not path to binary. Former seems like more usable, for example during converting to XML, full path doesn't act as identifier very well.

26-10-2017

22-10-2017

  • Add note to error/warning messages - err(3) is a part of libc

Preface

There are periodically raised discussions and attempts to define new formats of output generated by typical Unix tools like route, netstat, ps or ls, suitable for integration with external tools. There is a NetBSD own project too.

Basic assumptions

  1. Required number of changes - both in-code and user-visible ones - should be minimal.
  2. Output format itself should be:
    • directly (i.e. without additional tools) readable for users
    • parsable by existing tools (cut, awk, grep)
  3. Existing tools (df, netstat) should retain backward-compatibility.
  4. Whole mechanism should be modular and open to future improvements
  5. Following operations should be easy to implement in new and existing (awk, perl) tools:
    • filtering data according to different rules (greater than, smaller than, equal to...)
    • selecting subset of data (for example "only mount point and capacity from df output")
    • re-ordering data
    • converting output data to JSON/YAML/XML/...

In this document programs that consumes, parses and displays new output format, according to some criterias are called 'filters'.

After examining few data-formats (JSON, YAML) and already available projects (libxo in FreeBSD) I realized that only one format fulfills all requirements: simple, well-defined, tabular plain text.

In my opinion solution based on tabular plain text offers the best balance between readability, usability and amount of work required for implementation while still being easy to parse. There are better formats for each, individual aspect - but not for all of them simultaneously.

Converting tabular data to other format, like XML is relatively easy and straightforward.

Output format - current status

Current status is a mess. Typical output is separated by spaces or tabs, usually output is also column-aligned. Sometimes columns are separated by commas (dkctl), sometimes by equals sign (sysctl, mixerctl). We can read a data from any column by awk or cut but we must remember about excluding column name (in every case) or part of column name or separator between column names and data (cpuctl).

For example, if we need an src IP from netstat: we need to skip part of column name ('Local') and word '(including' output. Then we must strip a last octet, separated by dot. And so on...

Example:

localhost$ netstat -an -f inet
Active Internet connections (including servers)
Proto Recv-Q Send-Q  Local Address          Foreign Address        State
tcp        0      0  1.2.3.221.22           1.2.3.91.37542         ESTABLISHED
tcp        0      0  1.2.3.221.22           1.2.3.91.37276         ESTABLISHED
tcp        0      0  *.22                   *.*                    LISTEN

Some tools provide a rudimentary filters and column selectors (ps) but it isn't a standard.

New tabular format

After several iterations and tests made on innocent humans (shame on me), for sake of a better readability, I've even simplified my initial thoughts: in the current form we have only three types of lines (alongside with few subtypes):

$ arp -n                                           < command invocation
# meta prog:arp ver:2                              < meta data 
# there are three known addresses below            < not-structured message
#                                                  < 'start-of-headers' marker
Host  IPv4        MAC                Interface     < headers
-     1.2.3.2     00:11:32:cc:aa:bb  wm0           < tabular data
-     1.2.3.1     aa:aa:bb:20:1a:cb  wm0
-     1.2.3.91    44:8a:5b:aa:aa:aa  wm0
#err arp Something gone wrong!                     < error message

Special lines

Lines that starts from a hash: #, optionally followed by space and by string. Rule is simple: "if it begins with # then it doesn't contain tabular data".

In my opinion tools shouldn't hide meta-data from users - output can be coloured to provide better visibility but user should see the same data from 'raw' and 'filtered' output.

Meta-data

# meta <key>:<value> <key>:<value> <key>:<value> 

Meta-lines should be formatted in pleasant for eye form and may be split into multiple lines. Meta-lines shouldn't be overused because it can create visual chaos.

At this moment I can see some use-cases for meta fields:

  • prog:<identifier>: with full pathname or program name. It may be used by filter to adjust behavior or to format data according additional criteria, based on program name or version. Example: conversion to XML.

  • version:<version string>: program version. May be useful in the future or when some tools (for example GNU-ones) has similar name.

  • fmt:<format string>: additional information about columns (sizes, data types). For example fmt:'%ip4 %20s %i %-': should be interpreted as "treat first column as IPv4 addr, second as string with max length 20 chars, third as integer, type of fourth should be determine in other way"

    It is somewhat redundant feature because column format may also be determined by:

    • well-known names, hard-coded into tools (Interface, MAC, Route, IPv4)
    • column names, program names and - possible - program version taken from meta-fields and described in optional configuration database (like termcap one)
    • format can be guessed ("if it is all-numeric one then convert it to standard integer") although it requires some caution

Example:

# meta prog:df  
# meta fmt:'%s %s %s %s"

Headers indicator

#

Single hash: #. Marks NEXT line as a header line.

That - MANDATORY - line has additional purpose: provide better, visual separation between "machine" and regular, tabular data and was explicitly indicated by test subjects as "important difference".

Message

# <message>

Line, that starts with hash and space: # . This is a place for additional, unstructured messages. Example:

# total count: 30
# Active Internet connections (including servers) 

Error and warning messages (optional)

#err <progname> <message text>
#warn <progname> <message text>

Starts from #err or #warn followed by program name and message text.

Word of warning: implementation requires changes in libc due to err(3). Therefore that part should be considered as future work.

Example:

#err df Invalid moon phase detected. 

Headers

Defines column names and allow filters to determine kind of data in particular column.

In principle, column names should be limited to characters that doesn't conflict with shell commands and filters operands (see below). So, typical alphanumerics, underscores, dashes and percent signs should be sufficient.

In this model internationalization mechanisms (i.e. translating column names to localized ones) may be transferred to filters-level, which will makes integration witch tools easier (no more LANG=C in every script).

All headers MUST fit into single line.

Headers should be treated by filters as case insensitive. It is more convenient to write where local_ipv4=1.2.3.4 than Local_IPv4=...

BUT

XML elements are case-sensitive, and we should have conversions to other formats in mind as well as concise general behaviour ("why column names in filter expressions are case-insensitive but XML not?").

Headers shouldn't be separated from data by additional, vertical marks, like -------- (like in cpuctl output). This is redundant and may be added by filters.

Proposed header format

In general XML naming rules as well "XML best naming practices" may give us a good starting point:

XML elements must follow these naming rules:

  • Element names are case-sensitive
  • Element names must start with a letter or underscore
  • Element names cannot start with the letters xml (or XML, or Xml, etc)
  • Element names can contain letters, digits, hyphens, underscores, and periods
  • Element names cannot contain spaces

Any name can be used, no words are reserved (except xml).

Best Naming Practices

  • Create descriptive names, like this: , , .
  • Create short and simple names, like this: <book_title> not like this: <the_title_of_the_book>.
  • Avoid "-". If you name something "first-name", some software may think you want to subtract "name" from "first".
  • Avoid ".". If you name something "first.name", some software may think that "name" is a property of the object "first".
  • Avoid ":". Colons are reserved for namespaces (more later).
  • Non-English letters like éòá are perfectly legal in XML, but watch out for problems if your software doesn't support them.

Data

Column-formatted, separated by tabs and spaces.

We should provide basic guidelines for data formatting, i.e.:

  1. allow/disallow suffixes like KiB, kb, M etc. (as far I can see there is no problem with suffixes)

  2. allow/disallow using percent sign % to denote percentage values. It is redundant, I think, and should be forbidden

    • column names give us enough information about kind of data but it is easy to parsing, so I don't have fixed opinion in this case.
  3. using quotations when string data may contains tabs or spaces? This is most tricky part.

  4. unknown/not applicable/missing values: there are several possibilities, like: - or ? or _.

    There should be only one value - but it also depends from selected variant of filter syntax (see below). Some forms may conflict with current shells or getopt library (? or -).

  5. "ANY IP addr" should be denoted as '0.0.0.0', not by asterisk (also due to shell interactions)

Examples

Some additional examples (with colored variants included) will be available at external page.

  1. Current arp format

    localhost$ arp -an
    ? (1.2.3.2) at 00:11:32:cc:aa:bb on wm0
    ? (1.2.3.1) at aa:aa:bb:20:1a:cb on wm0
    ? (1.2.3.91) at 44:8a:5b:aa:aa:aa on wm0
    
  2. New arp format

    localhost$ arp -an
    # meta prog:arp 
    #
    Host  IPv4        MAC                Interface
    -     1.2.3.2     00:11:32:cc:aa:bb  wm0
    -     1.2.3.1     aa:aa:bb:20:1a:cb  wm0
    -     1.2.3.91    44:8a:5b:aa:aa:aa  wm0
    
  3. New df format

    $ df
    # meta prog:df ver:8.25-gnu
    #
    Filesystem    1024-blocks      Used Available Capacity Mounted
    udev              1837416         0   1837416        0 /dev
    tmpfs              371560      5956    365604        2 /run
    
  4. df output, colored by filter (older example):

    df-in-color

  5. New netstat format

    localhost$ netstat -an -f inet6
    # meta prog:netstat
    # Active Internet6 connections (including servers)
    #
    Proto Recv-Q Send-Q  Local_Addr Local_port Foreign_Addr Foreign_Port  State
    tcp6       0      0    0.0.0.0          22      0.0.0.0            -  LISTEN
    ...
    

    I'm not sure about 'Proto' column. Maybe it should be 'tcp' with extra 'Family'
    column instead?

Open questions

Choose format or filter logic for following, non-tabular data:

  • ifconfig
  • df -G
  • ping

Utilities (filters)

There already are many: awk, sed, perl, python, cut, grep... But improved output format will make a room for some new utilities that may be useful for users and system administrators.

During my tests I wrote a kind of "technology demonstrator", simple tool called sel that helps me in testing and verifying various use cases. sel is available from github repo.

It is inspired by some PowerShell features, but without all hassle and cruft. For example, in PS, we are able to do something like that:

Y:\> Get-Process | Select VM,Id,Handles,Name | select -last 5

        VM          Id     Handles Name
        --          --     ------- ----
  59392000        7655         147 WmiPrvSE
  44191744        9811         165 WmiPrvSE
 165384192        3122         226 Gogo
  45375488        6433          24 GogoHelper
  56348672        5244         205 WUDFHost

Word of warning: some samples were created from gnu-tools on Linux host.

Example: "show columns number 1,2,3 and one named 'Mounted' but only when 'Filesystem' is equal to 'tmpfs' and column 'Used' is greater than 10".

$ df | sel 1,2,3,mounted where Filesystem -eq tmpfs -a Used -gt 10
Filesystem 1024-blocks   Used  Mounted        
tmpfs           371560   5956  /run          
tmpfs          1857800  56916  /dev/shm      
tmpfs           371560     36  /run/user/3000 

"Get single value"

$ df | sel -q Used where Mounted -eq /opt
123456

Raw output (discussed below):

$ df
# meta prog:df ver:8.25-gnu
#
Filesystem    1024-blocks      Used Available Capacity Mounted
udev              1837416         0   1837416        0 /dev
tmpfs              371560      5956    365604        2 /run
...

Possible features - command combining?

( netstat -f inet6 ; arp -n ) | \
     sel netstat.source_ipv4, \
         arp.mac \
  where "netstat.source_ipv4 -eq arp.ipv4 -a netstat.state -eq 'ESTABLISHED'"

Integration with existing tools

Word of warning: this particular area was not well-tested and I'm not sure that that is a best possible solution.

  1. Output format may be selected by two environment variables: general and particular one, for example:

    OUTPUT_FORMAT=classic  - forces current behavior
    OUTPUT_FORMAT=cof      - select 'new' format
    OUTPUT_FORMAT_df=cof   - the same but only for 'df' 
    
  2. Optional automatic filter may be selected in the same way:

    OUTPUT_FILTER    = /usr/bin/sel   - select filter for all commands
    OUTPUT_FILTER_df =                - disable filter for single command
    
  3. Passing parameters to integrated filter: Filter's options should be easy to use and pass directly to $OUTPUT_FILTER.

    Possible options:

    a. df -/ '[filter here]' (is it a valid option for bsd getopt?)

    b. df --of '[filter options here]'

  4. Filter may be called in pipe-to-child schema (untested).

Other formats and implementations

XML

There have been many attempts and interesting ways to use XML as a data exchange format for UNIX tools.

  • xml-coreutils - although I'm still a bit skeptic about XML though linked essay "Is the Unix shell ready for XML?" is definitely worth reading.

JSON

Pros

  • familiar to users
  • widespread
  • many, simple parsers available
  • JSON format, when used consequently, has ability to pass multiple, structured information to consumers: number of selected elements, implementation details and so on

Cons

  • not convenient for standard shell tools nor humans, after all:

     $ df --libxo='json,pretty,no-top'
     "storage-system-information": {
       "filesystem": [
         {
           "name": "/dev/gpt/rootfs",
           "total-blocks": 20307196,
           "used-blocks": 1449604,
           "available-blocks": 17233020,
           "used-percent": 8,
           "mounted-on": "/"
         },
         {
           "name": "devfs",
           "total-blocks": 1,
           "used-blocks": 1,
           "available-blocks": 0,
           "used-percent": 100,
           "mounted-on": "/dev"
         }
       ]
     }
    

    BUT there are interesting tools

  • example from manual looks nice, but more complicated ones don't

  • JSON isn't stream-oriented (but there is a Line_Delimited_JSON). Unfortunately - DLJSON isn't readable.

  • creating proper JSON requires from us "opening" and "closing" sections, see libxo examples. Whole process is tedious and libxo even has additional mode as workaround (see 'DTRT Mode')

YAML

I was fascinated by YAML but - finally - I gave up.

Pros

  • also well-known
  • better suited for stream data than JSON (but: how about LDJSON?)
  • probably easier to create than JSON (we don't need closing bracket for marking end-of-array)

Cons

  • overcomplicated specification and parsers
  • not-so-readable after all (I tried some variants but they was unreadable without syntax-hilighting and event then effect wasn't good).

TOML

"Simpler than YAML" but - unfortunately - not compatible with my assumptions

FreeBSD libxo

It is not a format but existing and integrating into base system (FreeBSD) implementation that provides XML, JSON and HTML output.

Pros

  • From a console user point of view I don't see any of them, sorry: output is suitable for machine-parsing but inconvenient to read (see previous examples in JSON section).

Cons