Aggregating a table is creating a new table by computing sums, averages, and so on, out of material from the first table.
This is the second version, with:
- more flexibility for the formulas,
- modifiers and formatters with the same meanings as in the spreadsheet,
- increased performance on large datasets.
A very simple example. We have a table of activities and quantities (whatever they are) over several days.
#+NAME: original | Day | Color | Level | Quantity | |-----------+-------+-------+----------| | Monday | Red | 30 | 11 | | Monday | Blue | 25 | 3 | | Thuesday | Red | 51 | 12 | | Thuesday | Red | 45 | 15 | | Thuesday | Blue | 33 | 18 | | Wednesday | Red | 27 | 23 | | Wednesday | Blue | 12 | 16 | | Wednesday | Blue | 15 | 15 | | Turdsday | Red | 39 | 24 | | Turdsday | Red | 41 | 29 | | Turdsday | Red | 49 | 30 | | Friday | Blue | 7 | 5 | | Friday | Blue | 6 | 8 | | Friday | Blue | 11 | 9 |
We want to aggregate this table for each day (because several rows
exist for each day). First we give a name to the table through the
#+NAME:
or #+TBLNAME:
tags, just above the table. Then we create a
dynamic block to receive the aggregation:
#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)" #+END
Now typing C-c C-c
in the dynamic block computes the aggregation:
#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)" | Day | mean(Level) | sum(Quantity) | |-----------+-------------+---------------| | Monday | 27.5 | 14 | | Thuesday | 43 | 45 | | Wednesday | 18 | 54 | | Turdsday | 43 | 83 | | Friday | 8 | 22 | #+END
The source table is not change in any way.
To get this result, we specified columns in this way, after the
:cols
parameter:
Day
: we got the same column as in the source table, except entries are not duplicated. HereDay
acts as a key grouping colum. We may specify as many key columns as we want just by naming them. We get only one aggregated row for each different combination of values of key grouping columns.vmean(Level)
: this instructs Org to compute the average of values found in theLevel
column, grouped by the sameDay
.vsum(Quantity)
: Org computes the sum of values found in theQuantity
column, one sum for eachDay
.
If you are familiar with SQL, you would get a similar result with the
GROUP BY
statement:
select Day, mean(Level), sum(Quantity)
from original
group by Day;
If you are familiar with the R statistical package, you would get a
similar result with factor
and aggregate
functions:
original <- the table as a data.frame
day_factor <- factor(original$Day)
aggregate (original$Level , list(Day=day_factor), mean)
aggregate (original$Quantity, list(Day=day_factor), sum )
Type C-c C-x i
to launch a wizard for creating new dynamic blocks.
Then answer aggregate
for the type of block, and follow the instructions.
(There are several other dynamic blocks that can be built this way:
columnview
, clocktable
, propview
, invoice
, transpose
, and any
future block).
This is given by the file org-insert-dblock.el
, which extends
the original C-c C-x i
key-binding to any dynamic block.
If for any reason you do not want to load this file, you can still
create dynamic aggregate blocks by typing
M-x org-dblock-write:aggregate
.
Maybe we are just interested in the sum of Quantities
, regardless of
Days
. We just type:
#+BEGIN: aggregate :table "original" :cols "vsum(Quantity)" | vsum(Quantity) | |----------------| | 218 | #+END
Or we may want to count the number of rows for each combination of
Day
and Color
:
#+BEGIN: aggregate :table "original" :cols "count() Day Color" | count() | Day | Color | |---------+-----------+-------| | 1 | Monday | Red | | 1 | Monday | Blue | | 2 | Thuesday | Red | | 1 | Thuesday | Blue | | 1 | Wednesday | Red | | 2 | Wednesday | Blue | | 3 | Turdsday | Red | | 3 | Friday | Blue | #+END
If we want to get measurments for Colors
rather than Days
, we
type:
#+BEGIN: aggregate :table "original" :cols "Color vmean(Level) vsum(Quantity)" | Color | vmean(Level) | vsum(Quantity) | |-------+---------------+----------------| | Red | 40.2857142857 | 144 | | Blue | 15.5714285714 | 74 | #+END
The :cols
parameter lists the columns of the resulting table. It
contains in any order, grouping key columns and aggregation formulas.
The names of the columns in the original table may be:
- the names as they appear in the header of the source table,
- or $1, $2, $3 and so on (as in spreadsheet formulas),
- additionaly, the special column
hline
is used to group parts of the source table separated by horizontal lines.
The :cols
parameter may be a string or a list of strings. Examples:
:cols "Day vmean(Level);f3 vsum(Quantity);f2" :cols ("Day" "vmean(Level);f3" "vsum(Quantity);f2")
If a single string is used, it is split by spaces. Thus, a given formula, including its semicolon and modifiers, must not contain any space. If spaces are required within a formula, then use the paranthesis list.
Grouping is done on columns of the source table acting as key columns. Just name the key columns.
Additionally, the hline
specification means that rows between two
horizontal lines should be grouped.
Key columns and hline
are used to group rows of the source
table with uniq combinations of those columns.
Here is a source table containing 3 blocks separated by horizontal lines:
#+NAME: originalhl | Color | Level | Quantity | |-------+-------+----------| | Red | 30 | 11 | | Blue | 25 | 3 | | Red | 51 | 12 | | Red | 45 | 15 | | Blue | 33 | 18 | |-------+-------+----------| | Red | 27 | 23 | | Blue | 12 | 16 | | Blue | 15 | 15 | | Red | 39 | 24 | | Red | 41 | 29 | |-------+-------+----------| | Red | 49 | 30 | | Blue | 7 | 5 | | Blue | 6 | 8 | | Blue | 11 | 9 |
And here is the aggregation by those 3 blocks:
#+BEGIN: aggregate :table originalhl :cols "hline vmean(Level) vsum(Quantity)" | hline | vmean(Level) | vsum(Quantity) | |-------+--------------+----------------| | 0 | 36.8 | 59 | | 1 | 26.8 | 107 | | 2 | 18.25 | 52 | #+END:
If we want additionnal details with the Color
column, we just name it:
#+BEGIN: aggregate :table originalhl :cols "hline Color vmean(Level) vsum(Quantity)" | hline | Color | vmean(Level) | vsum(Quantity) | |-------+-------+---------------+----------------| | 0 | Red | 42 | 38 | | 0 | Blue | 29 | 21 | | 1 | Red | 35.6666666667 | 76 | | 1 | Blue | 13.5 | 31 | | 2 | Red | 49 | 30 | | 2 | Blue | 8 | 22 | #+END:
Aggregation formulas are applied for each of those groupings, on the specified columns.
We saw examples with sum
, mean
, count
aggregations. There are
many other aggregations. They are based on functions provided by Calc:
count()
orvcount()
- in Calc:
`u #' (`calc-vector-count') [`vcount'])
- gives the number of elements in the group being aggregated;
this function may or may not take a column parameter;
with a parameter, empty cells are not counted
(except with the
E
modifier)..
- in Calc:
sum
orvsum
- in Calc:
`u +' (`calc-vector-sum') [`vsum']
- computes the sum of elements being aggregated
- in Calc:
max
orvmax
- in Calc:
`u X' (`calc-vector-max') [`vmax']
- gives the largest of the elements being aggregated
- in Calc:
min
orvmin
- in Calc:
`u N' (`calc-vector-min') [`vmin']
- gives the smallest of the elements being aggregated
- in Calc:
mean
orvmean
- in Calc:
`u M' (`calc-vector-mean') [`vmean']
- computes the average (arithmetic mean) of elements being aggregated
- in Calc:
meane
orvmeane
- in Calc:
`I u M' (`calc-vector-mean-error') [`vmeane']
- computes the average (as mean) along with the estimated error of elements being aggregated
- in Calc:
median
orvmedian
- in Calc:
`H u M' (`calc-vector-median') [`vmedian']
- computes the median of elements being aggregated, by taken the middle element after sorting them
- in Calc:
hmean
orvhmean
- in Calc:
`H I u M' (`calc-vector-harmonic-mean') [`vhmean']
- computes the harmonic mean of elements being aggregated
- in Calc:
gmean
orvgmean
- in Calc:
`u G' (`calc-vector-geometric-mean') [`vgmean']
- computes the geometric mean of elements being aggregated
- in Calc:
sdev
orvsdev
- in Calc:
`u S' (`calc-vector-sdev') [`vsdev']
- computes the standard deviation of elements being aggregated
- in Calc:
psdev
orvpsdev
- in Calc:
`I u S' (`calc-vector-pop-sdev') [`vpsdev']
- computes the population standard deviation (divide by N instead of N-1)
- in Calc:
pvar
orvvar
- in Calc:
`H u S' (`calc-vector-variance') [`vvar']
- computes the variance of elements being aggregated
- in Calc:
pcov
orvpcov
- in Calc:
`I u C' (`calc-vector-pop-covariance') [`vpcov']
- computes the population covariance of elements being aggregated from two columns (divides by N)
- in Calc:
cov
orvcov
- in Calc:
`u C' (`calc-vector-covariance') [`vcov']
- computes the sample covariance of elements being aggregated from two columns (divides by N-1)
- in Calc:
corr
orvcorr
- in Calc:
`H u C' (`calc-vector-correlation') [`vcorr']
- computes the linear correlation coefficient of elements being aggregated in two columns
- in Calc:
prod
orvprod
- in Calc:
`u *' (`calc-vector-product') [`vprod']
- computes the product of elements being aggregated
- in Calc:
list
or(X)
- gives the list of elements being aggregated, without aggregation
The aggregation functions may be written with or without a leading
v
. sum
and vsum
are equivalent. The v
form should be
prefered, as it is the one used in the Org table spreadsheet, and in
Calc. The non-v names may be dropped in the future.
This parameter is optional. If present, it specifies a lisp expression which tells whether or not a row should be kept. When the expression evaluates to nil, the row is discarded.
Examples of useful expressions includes:
:cond (equal Color "Red")
- to keep only rows where
Color
isRed
- to keep only rows where
:cond (> (string-to-number Quantity) 19)
- to keep only rows for which
Quantity
is more than19
- note the call to
string-to-number
; without this call,Quantity
would be used as a string
- to keep only rows for which
:cond (> (* (string-to-number Level) 2.5) (string-to-number Quantity))
- to keep only rows for which
2.5*Level > Quantity
- to keep only rows for which
Two modes are available: pull & push.
In the pull mode, we use so called “dynamic blocks”. The resulting table knows how to build itself. Example:
We have a source table which is unaware that it will be derived in an aggregated table:
#+NAME: source1 | Day | Color | Level | Quantity | |-----------+-------+-------+----------| | Monday | Red | 30 | 11 | | Monday | Blue | 25 | 3 | | Thuesday | Red | 51 | 12 | | Thuesday | Red | 45 | 15 | | Thuesday | Blue | 33 | 18 | | Wednesday | Red | 27 | 23 | | Wednesday | Blue | 12 | 16 | | Wednesday | Blue | 15 | 15 | | Turdsday | Red | 39 | 24 | | Turdsday | Red | 41 | 29 | | Turdsday | Red | 49 | 30 | | Friday | Blue | 7 | 5 | | Friday | Blue | 6 | 8 | | Friday | Blue | 11 | 9 |
We create somewhere else a dynamic block which carries the specification of the aggregation:
#+BEGIN: aggregate :table "source1" :cols "Day vmean(Level) vsum(Quantity)" | Day | vmean(Level) | vsum(Quantity) | |-----------+--------------+----------------| | Monday | 27.5 | 14 | | Thuesday | 43 | 45 | | Wednesday | 18 | 54 | | Turdsday | 43 | 83 | | Friday | 8 | 22 | #+END
Typing C-c C-c
in the dynamic block recomputes it freshly.
In push mode, the source table drives the creation of derived
tables. We specify the wanted results in #+ORGTBL: SEND
directives
(as many as desired):
#+ORGTBL: SEND derived1 orgtbl-to-aggregated-table :cols "vmean(Level) vsum(Quantity)" #+ORGTBL: SEND derived2 orgtbl-to-aggregated-table :cols "Day vmean(Level) vsum(Quantity)" | Day | Color | Level | Quantity | |-----------+-------+-------+----------| | Monday | Red | 30 | 11 | | Monday | Blue | 25 | 3 | | Thuesday | Red | 51 | 12 | | Thuesday | Red | 45 | 15 | | Thuesday | Blue | 33 | 18 | | Wednesday | Red | 27 | 23 | | Wednesday | Blue | 12 | 16 | | Wednesday | Blue | 15 | 15 | | Turdsday | Red | 39 | 24 | | Turdsday | Red | 41 | 29 | | Turdsday | Red | 49 | 30 | | Friday | Blue | 7 | 5 | | Friday | Blue | 6 | 8 | | Friday | Blue | 11 | 9 |
We must create the receiving blocks somewhere else in the same file:
#+BEGIN RECEIVE ORGTBL derived1 #+END RECEIVE ORGTBL derived1
#+BEGIN RECEIVE ORGTBL derived2 #+END RECEIVE ORGTBL derived2
Then we come back to the source table and type C-c C-c
with the
cursor on the 1st pipe of the table, to refresh the derived tables:
#+BEGIN RECEIVE ORGTBL derived1 | vmean(Level) | vsum(Quantity) | |---------------+----------------| | 27.9285714286 | 218 | #+END RECEIVE ORGTBL derived1
#+BEGIN RECEIVE ORGTBL derived2 | Day | vmean(Level) | vsum(Quantity) | |-----------+--------------+----------------| | Monday | 27.5 | 14 | | Thuesday | 43 | 45 | | Wednesday | 18 | 54 | | Turdsday | 43 | 83 | | Friday | 8 | 22 | #+END RECEIVE ORGTBL derived2
Pull & push modes use the same engine in the background. Thus, using either is just a matter of convenience.
The computations are based on Calc, which is a symbolic calculator. Thus, symbolic computations are built-in. Example:
This is the source table:
#+NAME: symtable | Day | Color | Level | Quantity | |-----------+-------+--------+----------| | Monday | Red | 30+x | 11+a | | Monday | Blue | 25+3*x | 3 | | Thuesday | Red | 51+2*x | 12 | | Thuesday | Red | 45-x | 15 | | Thuesday | Blue | 33 | 18 | | Wednesday | Red | 27 | 23 | | Wednesday | Blue | 12+x | 16 | | Wednesday | Blue | 15 | 15-6*a | | Turdsday | Red | 39 | 24-5*a | | Turdsday | Red | 41 | 29 | | Turdsday | Red | 49+x | 30+9*a | | Friday | Blue | 7 | 5+a | | Friday | Blue | 6 | 8 | | Friday | Blue | 11 | 9 |
And here is the aggregated, symbolic result:
#+BEGIN: aggregate :table "symtable" :cols "Day vmean(Level) vsum(Quantity)" | Day | vmean(Level) | vsum(Quantity) | |-----------+-----------------------+----------------| | Monday | 2. x + 27.5 | a + 14 | | Thuesday | 0.333333333334 x + 43 | 45 | | Wednesday | x / 3 + 18 | 54 - 6 a | | Turdsday | x / 3 + 43. | 4 a + 83 | | Friday | 8 | a + 22 | #+END
Symbolic calculations are correctly performed on x
and a
, which
are symbolic (as opposed to numeric) expressions.
Note that if there are empty cells in the input, they will be changed to
nan
not a number, and the whole aggregation will yield nan
. This is
probably not the expected result. The N
modifier (see paragraph “modifiers
and formatters”) won’t help, because even though it will replace empty cells
with zero, it will do the same for anything which does not look like a
number. The best is to just avoid empty cells when dealing with symbolic
calculations.
Some aggregations work on two columns (rather than one column for
vsum()
, vmean()
).
Those aggregations are vcov(,)
, vpcov(,)
, vcorr(,)
.
vcorr(,)
computes the linear correlation between two columns.vcov(,)
andvpcov(,)
compute the covariance of two columns.
Example. We create a table where column y
is a noisy version of
column x
:
#+TBLNAME: noisydata | bin | x | y | |-------+----+---------| | small | 1 | 10.454 | | small | 2 | 21.856 | | small | 3 | 30.678 | | small | 4 | 41.392 | | small | 5 | 51.554 | | large | 6 | 61.824 | | large | 7 | 71.538 | | large | 8 | 80.476 | | large | 9 | 90.066 | | large | 10 | 101.070 | | large | 11 | 111.748 | | large | 12 | 121.084 | #+TBLFM: $3=$2*10+random(1000)/500;%.3f
#+BEGIN: aggregate :table noisydata :cols "bin vcorr(x,y) vcov(x,y) vpcov(x,y)" | bin | vcorr(x,y) | vcov(x,y) | vpcov(x,y) | |-------+----------------+---------------+---------------| | small | 0.999459736649 | 25.434 | 20.3472 | | large | 0.999542438688 | 46.4656666667 | 39.8277142857 | #+END
We see that the correlation between x
and y
is very close to 1
,
meaning that both columns are correlated. Indeed they are, as the y
is computed from x
with the formula y = 10*x +
noise_between_0_and_2
.
Some (limited) aggregation is possible on dates.
Example. Here is a source table containing dates:
#+tblname: datetable | Date | |-------------------------| | [2013-12-22 dim. 09:01] | | [2013-11-23 sam. 13:04] | | [2011-09-24 sam. 13:54] | | [2013-09-25 mer. 03:54] | | [2014-02-26 mer. 16:11] | | [2014-01-18 sam. 03:51] | | [2013-12-25 mer. 00:00] | | [2012-12-25 mar. 00:00] |
#+BEGIN: aggregate :table datetable :cols "vmin(Date) vmax(Date) vmean(Date)" | vmin(Date) | vmax(Date) | vmean(Date) | |----------------------------+----------------------------+---------------| | <2011-09-24 sat. 13:54:00> | <2014-02-26 wed. 16:11:00> | 735074.937066 | #+END:
The input table may contain malformed mathematical text. For
instance, a cell containing 5+
is malformed, because an expression
is missing after the +
symbol. In this case, the value will be
replaced by error(2, '"Expected a number")
which will appear in the
aggregated table, signaling the problem.
An input cell may be empty. In this case, it may be ignored or
converted to zero, depending on modifier flags E
and N
.
The empty cells treatment
- makes no difference for
vsum
andcount
. - may result in zero for
prod
, - change
vmean
result, - change
vmin
andvmax
, a possibly empty list of values resulting ininf
or-inf
Some aggregation functions operate on two columns. If the two columns
have empty values at different locations, then they should be
interpreted as zero with the NE
modifier, othervise the result will
be inconsistent.
Virtually any Calc formula can be specified as an aggregation formula.
Single column name (as they appear in the header of the source table,
or in the form of $1
, $2
, …, or the virtual column hline
) are
key columns. Everything else is given to Calc, to be computed as an
aggregation.
For instance:
(3) ;; a constant vmean(2*X+1) ;; aggregate a complex expression exp(vmean(map(log,N))) ;; the exponential average vsum((X-vmean(X))^2) ;; X-vmean(X) centers the sample on zero
Arguably, the first expression is useless, but legal. The aggregation can apply to a computed list of values. The result of an aggregation can be further processed in a formula. An aggreagation can even apply to an expression containing another aggregation.
In an expression, if a variable has the name of a column, then it is replaced by a Calc vector containing values from this column.
The special expression (C)
(a column name within parenthesis)
yields a list of values to be aggregated from this column, except they
are not aggregated. Note that parenthesis are required, otherwise, C
would act as a key grouping column.
An expression may optionally be followed by modifiers and formatters, after a semicolon. Examples:
vsum(X);p20 ;; increase Calc internal precision to 20 digits vsum(X);f3 ;; output the result with 3 digits after the decimal dot vsum(X);%.3f ;; output the result with 3 digits after the decimal dot
The modifiers and formatters are fully compatible with those of the Org spreadsheet.
Additional columns can be specified for the resulting table. With a
previous example, adding a :formula
parameter, we specify a new
column $4
which uses aggregated columns. It is translated into a
usual #+TBLFM:
spreadsheet line.
#+BEGIN: aggregate :table original :cols "Day vmean(Level) vsum(Quantity)" :formula "$4=$2*$3" | Day | vmean(Level) | vsum(Quantity) | | |-----------+--------------+----------------+------| | Monday | 27.5 | 14 | 385. | | Thuesday | 43 | 45 | 1935 | | Wednesday | 18 | 54 | 972 | | Turdsday | 43 | 83 | 3569 | | Friday | 8 | 22 | 176 | #+TBLFM: $4=$2*$3 #+END:
Moreover, if a #+TBLFM:
was already there, it survives aggregation re-computations.
This happens in pull mode only.
This feature may be used to change the headers. Usually, the aggregated
table headers are the aggragation formulas. To change that, enter a cell
formula, which will end up in the #+TBLFM:
tag. Example:
#+BEGIN: aggregate :table original :cols "vsum(Quantity)" :formula "@1$1=SQuantity" | SQuantity | |-----------| | 218 | #+TBLFM: @1$1=SQuantity #+END:
The result of an aggregation may become the source of further
processing. To do that, just add a #+NAME:
or #+TBLNAME:
line
just above the aggragated table. Here is an example of a double
aggregation:
#+NAME: squantity #+BEGIN: aggregate :table original :cols "Day vsum(Quantity)" | Day | SQuantity | |-----------+-----------| | Monday | 14 | | Thuesday | 45 | | Wednesday | 54 | | Turdsday | 83 | | Friday | 22 | #+TBLFM: @1$2=SQuantity #+END: #+BEGIN: aggregate :table "squantity" :cols "vsum(SQuantity)" | vsum(SQuantity) | |-----------------| | 218 | #+END:
Note the spreadsheet cell formula @1$2=SQuantity
, which changes the
column heading from it default vsum(Quantity)
to SQuantity
. This
new heading will survive any refresh.
Emacs package on Melpa: add the following lignes to your .emacs
file, and reload it.
(add-to-list 'package-archives '("melpa" . "http://melpa.milkbox.net/packages/") t) (package-initialize)
Then browse the list of available packages and install orgtbl-aggregate
M-x package-list-packages
Alternatively, you can download the lisp files, and load them:
(load-file "orgtbl-aggregate.el") (load-file "org-insert-dblock.el") ;; optional, extends C-c C-c i
Authors
- Thierry Banel, tbanelwebmin at free dot fr, first version, implementations.
- Michael Brand, Calc unleached,
#+TBLFM
survival, empty input cells, formatters.
Contributors
- Eric Abrahamsen, non-ascii column names
- Wizard now correctly ask for columns with
$1, $2...
names when table header is missing - Handle tables begining with hlines
- Handle non-ascii column names
:formula
parameter and#+TBLFM
survival- Empty cells are ignored.
- Empty output upon too small input set
- Fix ordering of output values
- Aggregations formulas may now be arbitrary expressions
- Table headers (and the lack of) are better handled
- Modifiers and formatters can now be specified as in the spreadsheet
- Aggregation function names can optionally have a leading
v
, likesum
&vsum
- Increased performance on large data sets
- Tables can be named with
#+NAME:
besides#+TBLNAME:
- Document Melpa installation