Skip to content

hobeika/orgaggregate

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aggregate Values in a Table

Aggregating a table is creating a new table by computing sums, averages, and so on, out of material from the first table.

This is the second version, with:

  • more flexibility for the formulas,
  • modifiers and formatters with the same meanings as in the spreadsheet,
  • increased performance on large datasets.

Examples

A very simple example. We have a table of activities and quantities (whatever they are) over several days.

#+NAME: original
| Day       | Color | Level | Quantity |
|-----------+-------+-------+----------|
| Monday    | Red   |    30 |       11 |
| Monday    | Blue  |    25 |        3 |
| Thuesday  | Red   |    51 |       12 |
| Thuesday  | Red   |    45 |       15 |
| Thuesday  | Blue  |    33 |       18 |
| Wednesday | Red   |    27 |       23 |
| Wednesday | Blue  |    12 |       16 |
| Wednesday | Blue  |    15 |       15 |
| Turdsday  | Red   |    39 |       24 |
| Turdsday  | Red   |    41 |       29 |
| Turdsday  | Red   |    49 |       30 |
| Friday    | Blue  |     7 |        5 |
| Friday    | Blue  |     6 |        8 |
| Friday    | Blue  |    11 |        9 |

We want to aggregate this table for each day (because several rows exist for each day). First we give a name to the table through the #+NAME: or #+TBLNAME: tags, just above the table. Then we create a dynamic block to receive the aggregation:

#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)"
#+END

Now typing C-c C-c in the dynamic block computes the aggregation:

#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)"
| Day       | mean(Level) | sum(Quantity) |
|-----------+-------------+---------------|
| Monday    |        27.5 |            14 |
| Thuesday  |          43 |            45 |
| Wednesday |          18 |            54 |
| Turdsday  |          43 |            83 |
| Friday    |           8 |            22 |
#+END

The source table is not change in any way.

To get this result, we specified columns in this way, after the :cols parameter:

  • Day : we got the same column as in the source table, except entries are not duplicated. Here Day acts as a key grouping colum. We may specify as many key columns as we want just by naming them. We get only one aggregated row for each different combination of values of key grouping columns.
  • vmean(Level) : this instructs Org to compute the average of values found in the Level column, grouped by the same Day.
  • vsum(Quantity): Org computes the sum of values found in the Quantity column, one sum for each Day.

SQL equivalent

If you are familiar with SQL, you would get a similar result with the GROUP BY statement:

select Day, mean(Level), sum(Quantity)
from original
group by Day;

R equivalent

If you are familiar with the R statistical package, you would get a similar result with factor and aggregate functions:

original <- the table as a data.frame
day_factor <- factor(original$Day)
aggregate (original$Level   , list(Day=day_factor), mean)
aggregate (original$Quantity, list(Day=day_factor), sum )

Key-binding & Wizard

Type C-c C-x i to launch a wizard for creating new dynamic blocks. Then answer aggregate for the type of block, and follow the instructions. (There are several other dynamic blocks that can be built this way: columnview, clocktable, propview, invoice, transpose, and any future block).

This is given by the file org-insert-dblock.el, which extends the original C-c C-x i key-binding to any dynamic block. If for any reason you do not want to load this file, you can still create dynamic aggregate blocks by typing M-x org-dblock-write:aggregate.

Other examples

Maybe we are just interested in the sum of Quantities, regardless of Days. We just type:

#+BEGIN: aggregate :table "original" :cols "vsum(Quantity)"
| vsum(Quantity) |
|----------------|
|            218 |
#+END

Or we may want to count the number of rows for each combination of Day and Color:

#+BEGIN: aggregate :table "original" :cols "count() Day Color"
| count() | Day       | Color |
|---------+-----------+-------|
|       1 | Monday    | Red   |
|       1 | Monday    | Blue  |
|       2 | Thuesday  | Red   |
|       1 | Thuesday  | Blue  |
|       1 | Wednesday | Red   |
|       2 | Wednesday | Blue  |
|       3 | Turdsday  | Red   |
|       3 | Friday    | Blue  |
#+END

If we want to get measurments for Colors rather than Days, we type:

#+BEGIN: aggregate :table "original" :cols "Color vmean(Level) vsum(Quantity)"
| Color |  vmean(Level) | vsum(Quantity) |
|-------+---------------+----------------|
| Red   | 40.2857142857 |            144 |
| Blue  | 15.5714285714 |             74 |
#+END

The :cols parameter

The :cols parameter lists the columns of the resulting table. It contains in any order, grouping key columns and aggregation formulas.

The names of the columns in the original table may be:

  • the names as they appear in the header of the source table,
  • or $1, $2, $3 and so on (as in spreadsheet formulas),
  • additionaly, the special column hline is used to group parts of the source table separated by horizontal lines.

The :cols parameter may be a string or a list of strings. Examples:

:cols "Day vmean(Level);f3 vsum(Quantity);f2"
:cols ("Day" "vmean(Level);f3" "vsum(Quantity);f2")

If a single string is used, it is split by spaces. Thus, a given formula, including its semicolon and modifiers, must not contain any space. If spaces are required within a formula, then use the paranthesis list.

Grouping specifications in :cols

Grouping is done on columns of the source table acting as key columns. Just name the key columns.

Additionally, the hline specification means that rows between two horizontal lines should be grouped.

Key columns and hline are used to group rows of the source table with uniq combinations of those columns.

hline example

Here is a source table containing 3 blocks separated by horizontal lines:

#+NAME: originalhl
| Color | Level | Quantity |
|-------+-------+----------|
| Red   |    30 |       11 |
| Blue  |    25 |        3 |
| Red   |    51 |       12 |
| Red   |    45 |       15 |
| Blue  |    33 |       18 |
|-------+-------+----------|
| Red   |    27 |       23 |
| Blue  |    12 |       16 |
| Blue  |    15 |       15 |
| Red   |    39 |       24 |
| Red   |    41 |       29 |
|-------+-------+----------|
| Red   |    49 |       30 |
| Blue  |     7 |        5 |
| Blue  |     6 |        8 |
| Blue  |    11 |        9 |

And here is the aggregation by those 3 blocks:

#+BEGIN: aggregate :table originalhl :cols "hline vmean(Level) vsum(Quantity)"
| hline | vmean(Level) | vsum(Quantity) |
|-------+--------------+----------------|
|     0 |         36.8 |             59 |
|     1 |         26.8 |            107 |
|     2 |        18.25 |             52 |
#+END:

If we want additionnal details with the Color column, we just name it:

#+BEGIN: aggregate :table originalhl :cols "hline Color vmean(Level) vsum(Quantity)"
| hline | Color |  vmean(Level) | vsum(Quantity) |
|-------+-------+---------------+----------------|
|     0 | Red   |            42 |             38 |
|     0 | Blue  |            29 |             21 |
|     1 | Red   | 35.6666666667 |             76 |
|     1 | Blue  |          13.5 |             31 |
|     2 | Red   |            49 |             30 |
|     2 | Blue  |             8 |             22 |
#+END:

Aggregation formulas in :cols

Aggregation formulas are applied for each of those groupings, on the specified columns.

We saw examples with sum, mean, count aggregations. There are many other aggregations. They are based on functions provided by Calc:

  • count() or vcount()
    • in Calc: `u #' (`calc-vector-count') [`vcount'])
    • gives the number of elements in the group being aggregated; this function may or may not take a column parameter; with a parameter, empty cells are not counted (except with the E modifier)..
  • sum or vsum
    • in Calc: `u +' (`calc-vector-sum') [`vsum']
    • computes the sum of elements being aggregated
  • max or vmax
    • in Calc: `u X' (`calc-vector-max') [`vmax']
    • gives the largest of the elements being aggregated
  • min or vmin
    • in Calc: `u N' (`calc-vector-min') [`vmin']
    • gives the smallest of the elements being aggregated
  • mean or vmean
    • in Calc: `u M' (`calc-vector-mean') [`vmean']
    • computes the average (arithmetic mean) of elements being aggregated
  • meane or vmeane
    • in Calc: `I u M' (`calc-vector-mean-error') [`vmeane']
    • computes the average (as mean) along with the estimated error of elements being aggregated
  • median or vmedian
    • in Calc: `H u M' (`calc-vector-median') [`vmedian']
    • computes the median of elements being aggregated, by taken the middle element after sorting them
  • hmean or vhmean
    • in Calc: `H I u M' (`calc-vector-harmonic-mean') [`vhmean']
    • computes the harmonic mean of elements being aggregated
  • gmean or vgmean
    • in Calc: `u G' (`calc-vector-geometric-mean') [`vgmean']
    • computes the geometric mean of elements being aggregated
  • sdev or vsdev
    • in Calc: `u S' (`calc-vector-sdev') [`vsdev']
    • computes the standard deviation of elements being aggregated
  • psdev or vpsdev
    • in Calc: `I u S' (`calc-vector-pop-sdev') [`vpsdev']
    • computes the population standard deviation (divide by N instead of N-1)
  • pvar or vvar
    • in Calc: `H u S' (`calc-vector-variance') [`vvar']
    • computes the variance of elements being aggregated
  • pcov or vpcov
    • in Calc: `I u C' (`calc-vector-pop-covariance') [`vpcov']
    • computes the population covariance of elements being aggregated from two columns (divides by N)
  • cov or vcov
    • in Calc: `u C' (`calc-vector-covariance') [`vcov']
    • computes the sample covariance of elements being aggregated from two columns (divides by N-1)
  • corr or vcorr
    • in Calc: `H u C' (`calc-vector-correlation') [`vcorr']
    • computes the linear correlation coefficient of elements being aggregated in two columns
  • prod or vprod
    • in Calc: `u *' (`calc-vector-product') [`vprod']
    • computes the product of elements being aggregated
  • list or (X)
    • gives the list of elements being aggregated, without aggregation

The aggregation functions may be written with or without a leading v. sum and vsum are equivalent. The v form should be prefered, as it is the one used in the Org table spreadsheet, and in Calc. The non-v names may be dropped in the future.

The :cond filtering

This parameter is optional. If present, it specifies a lisp expression which tells whether or not a row should be kept. When the expression evaluates to nil, the row is discarded.

Examples of useful expressions includes:

  • :cond (equal Color "Red")
    • to keep only rows where Color is Red
  • :cond (> (string-to-number Quantity) 19)
    • to keep only rows for which Quantity is more than 19
    • note the call to string-to-number; without this call, Quantity would be used as a string
  • :cond (> (* (string-to-number Level) 2.5) (string-to-number Quantity))
    • to keep only rows for which 2.5*Level > Quantity

Pull & Push

Two modes are available: pull & push.

In the pull mode, we use so called “dynamic blocks”. The resulting table knows how to build itself. Example:

We have a source table which is unaware that it will be derived in an aggregated table:

#+NAME: source1
| Day       | Color | Level | Quantity |
|-----------+-------+-------+----------|
| Monday    | Red   |    30 |       11 |
| Monday    | Blue  |    25 |        3 |
| Thuesday  | Red   |    51 |       12 |
| Thuesday  | Red   |    45 |       15 |
| Thuesday  | Blue  |    33 |       18 |
| Wednesday | Red   |    27 |       23 |
| Wednesday | Blue  |    12 |       16 |
| Wednesday | Blue  |    15 |       15 |
| Turdsday  | Red   |    39 |       24 |
| Turdsday  | Red   |    41 |       29 |
| Turdsday  | Red   |    49 |       30 |
| Friday    | Blue  |     7 |        5 |
| Friday    | Blue  |     6 |        8 |
| Friday    | Blue  |    11 |        9 |

We create somewhere else a dynamic block which carries the specification of the aggregation:

#+BEGIN: aggregate :table "source1" :cols "Day vmean(Level) vsum(Quantity)"
| Day       | vmean(Level) | vsum(Quantity) |
|-----------+--------------+----------------|
| Monday    |         27.5 |             14 |
| Thuesday  |           43 |             45 |
| Wednesday |           18 |             54 |
| Turdsday  |           43 |             83 |
| Friday    |            8 |             22 |
#+END

Typing C-c C-c in the dynamic block recomputes it freshly.

In push mode, the source table drives the creation of derived tables. We specify the wanted results in #+ORGTBL: SEND directives (as many as desired):

#+ORGTBL: SEND derived1 orgtbl-to-aggregated-table :cols "vmean(Level) vsum(Quantity)"
#+ORGTBL: SEND derived2 orgtbl-to-aggregated-table :cols "Day vmean(Level) vsum(Quantity)"
| Day       | Color | Level | Quantity |
|-----------+-------+-------+----------|
| Monday    | Red   |    30 |       11 |
| Monday    | Blue  |    25 |        3 |
| Thuesday  | Red   |    51 |       12 |
| Thuesday  | Red   |    45 |       15 |
| Thuesday  | Blue  |    33 |       18 |
| Wednesday | Red   |    27 |       23 |
| Wednesday | Blue  |    12 |       16 |
| Wednesday | Blue  |    15 |       15 |
| Turdsday  | Red   |    39 |       24 |
| Turdsday  | Red   |    41 |       29 |
| Turdsday  | Red   |    49 |       30 |
| Friday    | Blue  |     7 |        5 |
| Friday    | Blue  |     6 |        8 |
| Friday    | Blue  |    11 |        9 |

We must create the receiving blocks somewhere else in the same file:

#+BEGIN RECEIVE ORGTBL derived1
#+END RECEIVE ORGTBL derived1
#+BEGIN RECEIVE ORGTBL derived2
#+END RECEIVE ORGTBL derived2

Then we come back to the source table and type C-c C-c with the cursor on the 1st pipe of the table, to refresh the derived tables:

#+BEGIN RECEIVE ORGTBL derived1
|  vmean(Level) | vsum(Quantity) |
|---------------+----------------|
| 27.9285714286 |            218 |
#+END RECEIVE ORGTBL derived1
#+BEGIN RECEIVE ORGTBL derived2
| Day       | vmean(Level) | vsum(Quantity) |
|-----------+--------------+----------------|
| Monday    |         27.5 |             14 |
| Thuesday  |           43 |             45 |
| Wednesday |           18 |             54 |
| Turdsday  |           43 |             83 |
| Friday    |            8 |             22 |
#+END RECEIVE ORGTBL derived2

Pull & push modes use the same engine in the background. Thus, using either is just a matter of convenience.

Symbolic computation

The computations are based on Calc, which is a symbolic calculator. Thus, symbolic computations are built-in. Example:

This is the source table:

#+NAME: symtable
| Day       | Color |  Level | Quantity |
|-----------+-------+--------+----------|
| Monday    | Red   |   30+x |     11+a |
| Monday    | Blue  | 25+3*x |        3 |
| Thuesday  | Red   | 51+2*x |       12 |
| Thuesday  | Red   |   45-x |       15 |
| Thuesday  | Blue  |     33 |       18 |
| Wednesday | Red   |     27 |       23 |
| Wednesday | Blue  |   12+x |       16 |
| Wednesday | Blue  |     15 |   15-6*a |
| Turdsday  | Red   |     39 |   24-5*a |
| Turdsday  | Red   |     41 |       29 |
| Turdsday  | Red   |   49+x |   30+9*a |
| Friday    | Blue  |      7 |      5+a |
| Friday    | Blue  |      6 |        8 |
| Friday    | Blue  |     11 |        9 |

And here is the aggregated, symbolic result:

#+BEGIN: aggregate :table "symtable" :cols "Day vmean(Level) vsum(Quantity)"
| Day       | vmean(Level)          | vsum(Quantity) |
|-----------+-----------------------+----------------|
| Monday    | 2. x + 27.5           | a + 14         |
| Thuesday  | 0.333333333334 x + 43 | 45             |
| Wednesday | x / 3 + 18            | 54 - 6 a       |
| Turdsday  | x / 3 + 43.           | 4 a + 83       |
| Friday    | 8                     | a + 22         |
#+END

Symbolic calculations are correctly performed on x and a, which are symbolic (as opposed to numeric) expressions.

Note that if there are empty cells in the input, they will be changed to nan not a number, and the whole aggregation will yield nan. This is probably not the expected result. The N modifier (see paragraph “modifiers and formatters”) won’t help, because even though it will replace empty cells with zero, it will do the same for anything which does not look like a number. The best is to just avoid empty cells when dealing with symbolic calculations.

Correlation of two columns

Some aggregations work on two columns (rather than one column for vsum(), vmean()). Those aggregations are vcov(,), vpcov(,), vcorr(,).

  • vcorr(,) computes the linear correlation between two columns.
  • vcov(,) and vpcov(,) compute the covariance of two columns.

Example. We create a table where column y is a noisy version of column x:

#+TBLNAME: noisydata
| bin   |  x |       y |
|-------+----+---------|
| small |  1 |  10.454 |
| small |  2 |  21.856 |
| small |  3 |  30.678 |
| small |  4 |  41.392 |
| small |  5 |  51.554 |
| large |  6 |  61.824 |
| large |  7 |  71.538 |
| large |  8 |  80.476 |
| large |  9 |  90.066 |
| large | 10 | 101.070 |
| large | 11 | 111.748 |
| large | 12 | 121.084 |
#+TBLFM: $3=$2*10+random(1000)/500;%.3f
#+BEGIN: aggregate :table noisydata :cols "bin vcorr(x,y) vcov(x,y) vpcov(x,y)"
| bin   |     vcorr(x,y) |     vcov(x,y) |    vpcov(x,y) |
|-------+----------------+---------------+---------------|
| small | 0.999459736649 |        25.434 |       20.3472 |
| large | 0.999542438688 | 46.4656666667 | 39.8277142857 |
#+END

We see that the correlation between x and y is very close to 1, meaning that both columns are correlated. Indeed they are, as the y is computed from x with the formula y = 10*x + noise_between_0_and_2.

Dates

Some (limited) aggregation is possible on dates.

Example. Here is a source table containing dates:

#+tblname: datetable
| Date                    |
|-------------------------|
| [2013-12-22 dim. 09:01] |
| [2013-11-23 sam. 13:04] |
| [2011-09-24 sam. 13:54] |
| [2013-09-25 mer. 03:54] |
| [2014-02-26 mer. 16:11] |
| [2014-01-18 sam. 03:51] |
| [2013-12-25 mer. 00:00] |
| [2012-12-25 mar. 00:00] |
#+BEGIN: aggregate :table datetable :cols "vmin(Date) vmax(Date) vmean(Date)"
| vmin(Date)                 | vmax(Date)                 |   vmean(Date) |
|----------------------------+----------------------------+---------------|
| <2011-09-24 sat. 13:54:00> | <2014-02-26 wed. 16:11:00> | 735074.937066 |
#+END:

Empty and malformed input cells

The input table may contain malformed mathematical text. For instance, a cell containing 5+ is malformed, because an expression is missing after the + symbol. In this case, the value will be replaced by error(2, '"Expected a number") which will appear in the aggregated table, signaling the problem.

An input cell may be empty. In this case, it may be ignored or converted to zero, depending on modifier flags E and N.

The empty cells treatment

  • makes no difference for vsum and count.
  • may result in zero for prod,
  • change vmean result,
  • change vmin and vmax, a possibly empty list of values resulting in inf or -inf

Some aggregation functions operate on two columns. If the two columns have empty values at different locations, then they should be interpreted as zero with the NE modifier, othervise the result will be inconsistent.

Expressions that can be specified

Virtually any Calc formula can be specified as an aggregation formula.

Single column name (as they appear in the header of the source table, or in the form of $1, $2, …, or the virtual column hline) are key columns. Everything else is given to Calc, to be computed as an aggregation.

For instance:

(3)                        ;; a constant
vmean(2*X+1)               ;; aggregate a complex expression
exp(vmean(map(log,N)))     ;; the exponential average
vsum((X-vmean(X))^2)       ;; X-vmean(X) centers the sample on zero

Arguably, the first expression is useless, but legal. The aggregation can apply to a computed list of values. The result of an aggregation can be further processed in a formula. An aggreagation can even apply to an expression containing another aggregation.

In an expression, if a variable has the name of a column, then it is replaced by a Calc vector containing values from this column.

The special expression (C) (a column name within parenthesis) yields a list of values to be aggregated from this column, except they are not aggregated. Note that parenthesis are required, otherwise, C would act as a key grouping column.

Modifiers and formatters

An expression may optionally be followed by modifiers and formatters, after a semicolon. Examples:

vsum(X);p20    ;; increase Calc internal precision to 20 digits
vsum(X);f3     ;; output the result with 3 digits after the decimal dot
vsum(X);%.3f   ;; output the result with 3 digits after the decimal dot

The modifiers and formatters are fully compatible with those of the Org spreadsheet.

Post-aggregation spreadsheet formulas

Additional columns can be specified for the resulting table. With a previous example, adding a :formula parameter, we specify a new column $4 which uses aggregated columns. It is translated into a usual #+TBLFM: spreadsheet line.

#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)" :formula "$4=$2*$3"
| Day       | vmean(Level) | vsum(Quantity) |      |
|-----------+--------------+----------------+------|
| Monday    |         27.5 |             14 | 385. |
| Thuesday  |           43 |             45 | 1935 |
| Wednesday |           18 |             54 |  972 |
| Turdsday  |           43 |             83 | 3569 |
| Friday    |            8 |             22 |  176 |
#+TBLFM: $4=$2*$3
#+END:

Moreover, if a #+TBLFM: was already there, it survives aggregation re-computations.

This happens in pull mode only.

This feature may be used to change the headers. Usually, the aggregated table headers are the aggragation formulas. To change that, enter a cell formula, which will end up in the #+TBLFM: tag. Example:

#+BEGIN: aggregate :table original :cols "vsum(Quantity)" :formula "@1$1=SQuantity"
| SQuantity |
|-----------|
|       218 |
#+TBLFM: @1$1=SQuantity
#+END:

Chaining

The result of an aggregation may become the source of further processing. To do that, just add a #+NAME: or #+TBLNAME: line just above the aggragated table. Here is an example of a double aggregation:

#+NAME: squantity
#+BEGIN: aggregate :table original :cols "Day vsum(Quantity)"
| Day       | SQuantity |
|-----------+-----------|
| Monday    |        14 |
| Thuesday  |        45 |
| Wednesday |        54 |
| Turdsday  |        83 |
| Friday    |        22 |
#+TBLFM: @1$2=SQuantity
#+END:

#+BEGIN: aggregate :table "squantity" :cols "vsum(SQuantity)"
| vsum(SQuantity) |
|-----------------|
|             218 |
#+END:

Note the spreadsheet cell formula @1$2=SQuantity, which changes the column heading from it default vsum(Quantity) to SQuantity. This new heading will survive any refresh.

Installation

Emacs package on Melpa: add the following lignes to your .emacs file, and reload it.

(add-to-list 'package-archives '("melpa" . "http://melpa.milkbox.net/packages/") t)
(package-initialize)

Then browse the list of available packages and install orgtbl-aggregate

M-x package-list-packages

Alternatively, you can download the lisp files, and load them:

(load-file "orgtbl-aggregate.el")
(load-file "org-insert-dblock.el") ;; optional, extends C-c C-c i

Author, contributors

Authors

  • Thierry Banel, tbanelwebmin at free dot fr, first version, implementations.
  • Michael Brand, Calc unleached, #+TBLFM survival, empty input cells, formatters.

Contributors

  • Eric Abrahamsen, non-ascii column names

Changes

  • Wizard now correctly ask for columns with $1, $2... names when table header is missing
  • Handle tables begining with hlines
  • Handle non-ascii column names
  • :formula parameter and #+TBLFM survival
  • Empty cells are ignored.
  • Empty output upon too small input set
  • Fix ordering of output values
  • Aggregations formulas may now be arbitrary expressions
  • Table headers (and the lack of) are better handled
  • Modifiers and formatters can now be specified as in the spreadsheet
  • Aggregation function names can optionally have a leading v, like sum & vsum
  • Increased performance on large data sets
  • Tables can be named with #+NAME: besides #+TBLNAME:
  • Document Melpa installation

About

Aggregates tables in Org mode

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Emacs Lisp 100.0%