Skip to content

Commit

Permalink
Documentation for dimension stripping.
Browse files Browse the repository at this point in the history
  • Loading branch information
jeromekelleher committed Aug 22, 2019
1 parent 59ceea2 commit c8cb6d6
Show file tree
Hide file tree
Showing 3 changed files with 317 additions and 18 deletions.
53 changes: 52 additions & 1 deletion docs/examples.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,58 @@ def missing_data():
tree = ts.first()
tree.draw("_static/missing_data1.svg")


def stats():
ts = msprime.simulate(
10**4, Ne=10**4, recombination_rate=1e-8, mutation_rate=1e-8, length=10**7,
random_seed=42)
print("num_trees = ", ts.num_trees, ", num_sites = ", ts.num_sites, sep="")

x = ts.diversity()
print("Average diversity per unit sequence length = {:.3G}".format(x))

windows = np.linspace(0, ts.sequence_length, num=5)
x = ts.diversity(windows=windows)
print(windows)
print(x)

A = ts.samples()[:100]
x = ts.diversity(sample_sets=A)
print(x)

B = ts.samples()[100:200]
C = ts.samples()[200:300]
x = ts.diversity(sample_sets=[A, B, C])
print(x)

x = ts.diversity(sample_sets=[A, B, C], windows=windows)
print("shape = ", x.shape)
print(x)

A = ts.samples()[:100]
B = ts.samples()[:100]
x = ts.divergence([A, B])
print(x)

x = ts.divergence([A, B], windows=windows)
print(x)

x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)])
print(x)

x = ts.divergence([A, B, C], indexes=(0, 1))
print(x)

x = ts.divergence([A, B, C], indexes=[(0, 1)])
print(x)

x = ts.divergence([A, B, C], indexes=[(0, 1), (0, 2)], windows=windows)
print(x)

# moving_along_tree_sequence()
# parsimony()
# allele_frequency_spectra()
missing_data()
# missing_data()

stats()

78 changes: 62 additions & 16 deletions docs/stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,8 @@ and of the underlying trees (whose definition does not depend on the genome sequ
Furthermore, these statistics have a common interface to easily compute
(a) averaged statistics in windows along the genome,
and (b) summary statistics of many combinations of sets of samples simultaneously.
All methods return numpy arrays,
whose rows correspond to the windows along the genome,
and whose remaining dimensions are determined by the statistic.
All methods return numpy arrays whose dimensions are
determined by the parameters (see :ref:`sec_general_stats_output_dimensions`).

.. warning:: :ref:`sec_data_model_missing_data` is not currently
handled correctly by site statistics defined here, as we always
Expand Down Expand Up @@ -70,12 +69,6 @@ e.g., the sites of the SNPs.
Windowing
*********

By default, statistics

``windows = None``
This is the default, and equivalent to passing ``windows = [0.0, ts.sequence_length]``.
The output will still be a two-dimensional array, but with only one row.

Each statistic has an argument, ``windows``,
which defines a collection of contiguous windows along the genome.
If ``windows`` is a list of ``n+1`` increasing numbers between 0 and the ``sequence_length``,
Expand All @@ -95,6 +88,14 @@ obtaining ``((b - a) * S[0] + (c - b) * S[1]) / (c - a)``.

There are some shortcuts to other useful options:

``windows = None``
This is the default and computes statistics in single window over the whole
sequence. As the first returned array contains only a single
value, we drop this dimension as described in the :ref:`output dimensions
<sec_general_stats_output_dimensions>` section. **Note:** if you really do
want to have an array with a single value as the result, please use
``windows = [0.0, ts.sequence_length]``.

``windows = "trees"``
This says that you want statistics computed separately on the portion of the genome
spanned by each tree, so is equivalent to passing ``windows = ts.breakpoints()``.
Expand Down Expand Up @@ -191,7 +192,9 @@ Here are some additional special cases:
If the statistic takes ``k`` inputs for ``k > 1``,
and there are exactly ``k`` lists in ``sample_sets``,
then this will compute just one statistic, and is equivalent to passing
``indexes = [(0, 1, ..., k-1)]``.
``indexes = (0, 1, ..., k-1)``. Note that this also drops the last
dimension of the output, as described in the :ref:`sec_general_stats_output_dimensions`
section.
If there are not exactly ``k`` sample sets, this will throw an error.

``k=1`` does not allow ``indexes``:
Expand All @@ -203,16 +206,16 @@ Here are some additional special cases:
were that allowed.)


.. _sec_general_stats_output:
.. _sec_general_stats_output_format:

******
Output
******
*************
Output format
*************

Each of the statistics methods returns a ``numpy`` ndarray.
Suppose that the output is name ``out``.
In all cases, the number of rows of the output is equal to the number of windows,
so that ``out.shape[0]`` is equal to ``len(windows) - 1``
If ``windows`` has been specified, the number of rows of the output is equal to the
number of windows, so that ``out.shape[0]`` is equal to ``len(windows) - 1``
and ``out[i]`` is an array of statistics describing the portion of the tree sequence
from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right endpoint).

Expand All @@ -229,6 +232,9 @@ from ``windows[i]`` to ``windows[i + 1]`` (including the left but not the right

The final dimension of the arrays in other cases is specified by the method.

Note, however, that empty dimensions can optionally be dropped,
as described in the :ref:`sec_general_stats_output_dimensions` section.

A note about **default values** and **division by zero**:
Under the hood, statistics computation fills in zeros everywhere, then updates these
(since statistics are all **additive**, this makes sense).
Expand All @@ -243,6 +249,46 @@ For ``branch`` statistics, any windows with **no branches** will similarly remai
That said, you should **not** rely on the specific behavior of whether ``0`` or ``nan`` is returned
for "empty" cases like these: it is subject to change.

.. _sec_general_stats_output_dimensions:

*****************
Output dimensions
*****************

In the general case, tskit outputs two dimensional (or three dimensional, in the case of node
stats) numpy arrays, as described in the :ref:`sec_general_stats_output_format` section.
The first dimension corresponds to the window along the genome
such that for some result array ``x``, ``x[j]`` contains information about the jth window.
The last dimension corresponds to the statistics being computed, so that ``x[j, k]`` is the
value of the kth statistic in the jth window (in the two dimensional case). This is
a powerful and general interface, but in many cases we will not use this full generality
and the extra dimensions in the numpy arrays are inconvenient.

Tskit optionally removes empty dimensions from the output arrays following a few
simple rules.

1. If ``windows`` is None we are computing over the single window covering the
full sequence. We therefore drop the first dimension of the array.

2. In one-way stats, if the ``sample_sets`` argument is a 1D array we interpret
this as specifying a single sample set (and therefore a single statistic), and
drop the last dimension of the output array. If ``sample_sets`` is None
(the default), we use the sample set ``ts.samples()``, invoking
this rule (we therefore drop the last dimension by default).

3. In k-way stats, if the ``indexes`` argument is a 1D array of length k
we intepret this as specifying a single statistic and drop the last
dimension of the array. If ``indexes`` is None (the default) and
there are k sample sets, we compute the statistic from these sample sets
and drop the last dimension.

Rules 2 and 3 can be summarised by "the dimensions of the input determines
the dimensions of the output". Note that dropping these dimensions is
**optional**: it is always possible to keep the full dimensions of the
output arrays.

Please see the :ref:`tutorial <sec_tutorial_stats>` for examples of the
various output dimension options.

********************
Available statistics
Expand Down
Loading

0 comments on commit c8cb6d6

Please sign in to comment.