diff --git a/docs/getting_started/static/output_13_0.png b/docs/getting_started/static/output_13_0.png
new file mode 100644
index 00000000..d24cee8a
Binary files /dev/null and b/docs/getting_started/static/output_13_0.png differ
diff --git a/docs/getting_started/static/output_16_0.png b/docs/getting_started/static/output_16_0.png
new file mode 100644
index 00000000..ed5e5182
Binary files /dev/null and b/docs/getting_started/static/output_16_0.png differ
diff --git a/docs/getting_started/static/output_19_0.png b/docs/getting_started/static/output_19_0.png
new file mode 100644
index 00000000..701b300c
Binary files /dev/null and b/docs/getting_started/static/output_19_0.png differ
diff --git a/docs/getting_started/static/output_22_0.png b/docs/getting_started/static/output_22_0.png
new file mode 100644
index 00000000..0b2ac5fb
Binary files /dev/null and b/docs/getting_started/static/output_22_0.png differ
diff --git a/docs/getting_started/static/output_27_0.png b/docs/getting_started/static/output_27_0.png
new file mode 100644
index 00000000..02ef235b
Binary files /dev/null and b/docs/getting_started/static/output_27_0.png differ
diff --git a/docs/getting_started/static/output_30_0.png b/docs/getting_started/static/output_30_0.png
new file mode 100644
index 00000000..6e1a5f8d
Binary files /dev/null and b/docs/getting_started/static/output_30_0.png differ
diff --git a/docs/getting_started/static/output_33_0.png b/docs/getting_started/static/output_33_0.png
new file mode 100644
index 00000000..39228816
Binary files /dev/null and b/docs/getting_started/static/output_33_0.png differ
diff --git a/docs/getting_started/static/output_36_0.png b/docs/getting_started/static/output_36_0.png
new file mode 100644
index 00000000..678bb654
Binary files /dev/null and b/docs/getting_started/static/output_36_0.png differ
diff --git a/docs/getting_started/static/output_6_0.png b/docs/getting_started/static/output_6_0.png
new file mode 100644
index 00000000..4e6ec6db
Binary files /dev/null and b/docs/getting_started/static/output_6_0.png differ
diff --git a/docs/getting_started/static/output_9_0.png b/docs/getting_started/static/output_9_0.png
new file mode 100644
index 00000000..6eee5a50
Binary files /dev/null and b/docs/getting_started/static/output_9_0.png differ
diff --git a/docs/getting_started/theory b/docs/getting_started/theory
new file mode 100644
index 00000000..e69de29b
diff --git a/docs/getting_started/theory.rst b/docs/getting_started/theory.rst
new file mode 100644
index 00000000..6141a80d
--- /dev/null
+++ b/docs/getting_started/theory.rst
@@ -0,0 +1,1095 @@
+Theory of CSEP Tests
+====================
+
+This page describes the theory of each of the forecast tests
+included in pyCSEP along with working code examples. You will find
+information on the goals of each test, the theory behind the tests, how
+the tests are applied in practice, and how forecasts are ‘scored’ given
+the test results. Also, we include the code required to run in each test
+and a description of how to interpret the test results.
+
+.. code:: ipython3
+
+    import csep
+    from csep.core import (
+        regions,
+        catalog_evaluations,
+        poisson_evaluations as poisson
+    )
+    from csep.utils import (
+        datasets,
+        time_utils,
+        comcat,
+        plots,
+        readers
+    )
+
+    # Filters matplotlib warnings
+    import warnings
+    warnings.filterwarnings('ignore')
+
+Grid-based Forecast Tests
+-------------------------
+
+These tests are designed for grid-based forecasts (e.g., Schorlemmer et
+al., 2007), where expected rates are provided in discrete Poisson
+space-magnitude cells covering the region of interest. The region
+:math:`\boldsymbol{R}` is then the product of the spatial rate
+:math:`\boldsymbol{S}` and the binned magnitude rate
+:math:`\boldsymbol{M}`,
+
+.. math::  \boldsymbol{R} = \boldsymbol{M} \times \boldsymbol{S}.
+
+A forecast :math:`\boldsymbol{\Lambda}` can be fully specified as the
+expected number of events (or rate) in each space-magnitude bin
+(:math:`m_i, s_j`) covering the region :math:`\boldsymbol{R}` and
+therefore can be written as
+
+.. math::  \boldsymbol{\Lambda} = \{ \lambda_{m_i, s_j}| m_i \in \boldsymbol{M}, s_j \in \boldsymbol{S} \},
+
+where :math:`\lambda_{m_i, s_j}` is the expected rate of events in
+magnitude bin :math:`m_i` and spatial bin :math:`s_j`. The observed
+catalogue of events :math:`\boldsymbol{\Omega}` we use to evaluate the
+forecast is similarly discretised into the same space-magnitude bins,
+and can be described as
+
+.. math::  \boldsymbol{\Omega} = \{ \omega_{m_i, s_j}| m_i \in \boldsymbol{M}, s_j \in \boldsymbol{S} \},
+
+where :math:`\omega_{m_i, s_j}` is the observed number of
+events in spatial cell :math:`s_j` and magnitude bin :math:`m_i`. The
+magnitude bins are specified in the forecast: typically these are in 0.1
+increments and this is the case in the examples we use here. These
+examples use the Helmstetter et al (2007) smoothed seismicity forecast
+(including aftershocks), testing over a 5 year period between 2010 and
+2015.
+
+.. code:: ipython3
+
+    # Set up experiment parameters
+    start_date = time_utils.strptime_to_utc_datetime('2010-01-01 00:00:00.0')
+    end_date = time_utils.strptime_to_utc_datetime('2015-01-01 00:00:00.0')
+
+    # Loads from the PyCSEP package
+    helmstetter = csep.load_gridded_forecast(
+        datasets.helmstetter_aftershock_fname,
+        start_date=start_date,
+        end_date=end_date,
+        name='helmstetter_aftershock'
+    )
+
+    # Set up evaluation catalog
+    catalog = csep.query_comcat(helmstetter.start_time, helmstetter.end_time,
+                                min_magnitude=helmstetter.min_magnitude)
+
+    # Filter evaluation catalog
+    catalog = catalog.filter_spatial(helmstetter.region)
+
+    # Add seed for reproducibility in simulations
+    seed = 123456
+
+    # Number of simulations for Poisson consistency tests
+    nsim = 100000
+
+
+.. parsed-literal::
+
+    Fetched ComCat catalog in 5.9399449825286865 seconds.
+
+    Downloaded catalog from ComCat with following parameters
+    Start Date: 2010-01-10 00:27:39.320000+00:00
+    End Date: 2014-08-24 10:20:44.070000+00:00
+    Min Latitude: 31.9788333 and Max Latitude: 41.1431667
+    Min Longitude: -125.3308333 and Max Longitude: -115.0481667
+    Min Magnitude: 4.96
+    Found 24 events in the ComCat catalog.
+
+
+Consistency tests
+~~~~~~~~~~~~~~~~~
+
+The consistency tests evaluate the consistency of a forecast against
+observed earthquakes. These tests were developed across a range of
+experiments and publications (Schorlemmer et al, 2007; Zechar et al
+2010; Werner et al, 2011a). The consistency tests are based on the
+likelihood of observing the catalogue (actual recorded events) given the
+forecast. Since the space-magnitude bins are assumed to be independent,
+the joint-likelihood of observing the events in each individual bin
+given the specified forecast can be written as
+
+.. math::  Pr(\omega_1 | \lambda_1) Pr(\omega_2 | \lambda_2)...Pr(\omega_n | \lambda_n) = \prod_{m_i , s_j \in \boldsymbol{R}} f_{m_i, s_j}(\omega(m_i, s_j)),
+
+where :math:`f_{m_i, s_j}` specifies the probability distribution in
+each space-magnitude bin. We prefer to use the joint log-likelihood in
+order to sum log-likelihoods rather than multiply the likelihoods. The
+joint log-likelihood can be written as:
+
+.. math::  L(\boldsymbol{\Omega} | \boldsymbol{\Lambda}) = \sum_{m_i , s_j \in \boldsymbol{R}} log(f_{m_i, s_j}(\omega(m_i, s_j)).
+
+The likelihood of the observations, :math:`\boldsymbol{\Omega}`, given
+the forecast :math:`\boldsymbol{\Lambda}` is the sum over all
+space-magnitude bins of the log probabilities in individual cells of the
+forecast. Grid-based forecasts are specified by the expected number of
+events in a discrete space-magnitude bin. From the maximum entropy
+principle, we assign a Poisson distribution in each bin. In this case,
+the probability of an event occurring is independent of the time since
+the last event, and events occur at a rate :math:`\lambda`. The
+Poissonian joint log-likelihood can be written as
+
+.. math::  L(\boldsymbol{\Omega} | \boldsymbol{\Lambda}) = \sum_{m_i , s_j \in \boldsymbol{R}} -\lambda(m_i, s_j) + \omega(m_i, s_j)\log(\lambda(m_i, s_j)) - log(\omega(m_i, s_j)!),
+
+where :math:`\lambda(m_i, s_j)` and :math:`\omega(m_i, s_j)` are the
+expected counts from the forecast and observed counts in cell
+:math:`m_i, s_j` respectively. We can calculate the likelihood directly
+given the forecast and discretised observations.
+
+Forecast uncertainty
+
+A simulation based approach is used to account for uncertainty in the
+forecast. We simulate realizations of catalogs that are consistent with
+the forecast to obtain distributions of scores. In the pyCSEP package,
+as in the original CSEP tests, simulation is carried out using the
+cumulative probability density of the forecast obtained by ordering the
+rates in each bin. We shall call :math:`F_{m_is_j}` the cumulative
+probability density in cell :math:`(m_i, s_j)`. The simulation approach
+then works as follows:
+
+-  For each forecast bin, draw a random number :math:`z` from a uniform
+   distribution between 0 and 1
+-  Assign this event to a space-magnitude bin through the inverse
+   cumulative density distribution at this point
+   :math:`F^{-1}_{m_i, s_j}(z)`
+-  Iterate over all simulated events to generate a catalog containing
+   :math:`N_{sim}` events consistent with the forecast
+
+For each of these tests, we can plot the distribution of likelihoods
+computed from theses simulated catalogs relative to the observations
+using the ``plots.plot_poisson_consistency_test`` function. We also
+calculate a quantile score to diagnose a particular forecast with
+repsect. The number of simulations can be supplied to the Poisson
+consistency test functions using the ``num_simulations`` argument: for
+best results we suggest 100,000 simulations to ensure convergence.
+
+Scoring the tests
+
+Through simulation (as described above), we obtain a set of simulated
+catalogs :math:`\{\hat{\boldsymbol{\Omega}}\}`. Each catalogue can be
+written as
+
+.. math:: \hat{\boldsymbol{\Omega}}_x =\{ \hat{\lambda}_x(m_i, s_j)|(m_i, s_j) \in \boldsymbol{R}\},
+
+where :math:`\hat{\lambda}_x(m_i, s_j)` is the number of
+simulated earthquakes in cell :math:`(m_i, s_j)` of (simulated) catalog
+:math:`x` that is consistent with the forecast :math:`\Lambda`. We then
+compute the joint log-likelihood for each simulated catalogue
+:math:`\hat{L}_x = L(\hat{\Omega}_x|\Lambda)`. The joint log-likelihood
+for each simulated catalogue given the forecast gives us a set of
+log-likelihoods :math:`\{\hat{\boldsymbol{L}}\}` that represents the
+range of log-likelihoods consistent with the forecast. We then compare
+our simulated log-likelihoods with the observed log-likelihood
+:math:`L_{obs} = L(\boldsymbol{\Omega}|\boldsymbol{\Lambda})` using a
+quantile score.
+
+The quantile score is defined by the fraction of simulated joint
+log-likelihoods less than or equal to the observed likelihood.
+
+.. math:: \gamma = \frac{ |\{ \hat{L}_x | \hat{L}_x \le L_{obs}\} |}{|\{ \hat{\boldsymbol{L}} \}|}
+
+Whether a forecast can be said to pass an evaluation depends on the
+significance level chosen for the testing process. The quantile score
+explicitly tells us something about the significance of the result: the
+observation is consistent with the forecast with :math:`100(1-\gamma)\%`
+confidence (Zechar, 2011). Low :math:`\gamma` values demonstrate that
+the observed likelihood score is less than most of the simulated
+catalogs. The consistency tests, excluding the N-test, are considered to
+be one-sided tests: values which are too small are ruled inconsistent
+with the forecast, but very large values may not necessarily be
+inconsistent with the forecast and additional testing should be used to
+further clarify this (Schorlemmer et al, 2007).
+
+Different CSEP experiments have used different sensitivity values.
+Schorlemmer et al (2010b) consider :math:`\gamma \lt 0.05` while the
+implementation in the Italian CSEP testing experiment uses
+:math:`\gamma` < 0.01 (Taroni et al, 2018). However, the consistency
+tests are most useful as diagnostic tools where the quantile score
+assesses the level of consistency between observations and data.
+Temporal variations in seismicity make it difficult to formally reject a
+model from a consistency test over a single evaluation period.
+
+Likelihood-test (L-test)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Aim: Evaluate the likelihood of observed events given the provided
+forecast - this includes the rate, spatial distribution and magnitude
+components of the forecast.
+
+Method: The L-test is one of the original forecast tests described in
+Schorlemmer et al, 2007. The likelihood of the observation given the
+model is described by a Poisson likelihood function in each cell and the
+total joint likelihood described by the product over all bins, or the
+sum of the log-likelihoods (see above, or Zechar 2011 for more details).
+
+Note: The likelihood scores are dominated by the rate-component of the
+forecast. This causes issues in scoring forecasts where the expected
+number of events are different from the observed number of events. We
+suggest to use the N-test (below) and CL-test (below) independently to
+score the rate component, and spatial-magnitude components of the
+forecast. This behavior can be observed by comparing the CL-test and
+N-test results with the L-test results in this notebook. Since the
+forecast overpredicts the rate of events during this testing period, the
+L-test provides a passing score even though the space-magnitude and rate
+components perform poorly during this evaluation period.
+
+pyCSEP implementation
+
+pyCSEP uses the forecast and catalog and returns the test distribution,
+observed statistic and quantile score, which can be accessed from the
+``likelihood_test_result`` object. We can pass this directly to the
+plotting function, specifying that the test should be one-sided.
+
+.. code:: ipython3
+
+    likelihood_test_result = poisson.likelihood_test(
+        helmstetter,
+        catalog,
+        seed=seed,
+        num_simulations=nsim
+    )
+    ax = plots.plot_poisson_consistency_test(
+        likelihood_test_result,
+        one_sided_lower=True,
+        plot_args={'title': r'$\mathcal{L}-\mathrm{test}$', 'xlabel': 'Log-likelihood'}
+    )
+
+
+
+.. image:: output_6_0.png
+
+
+pyCSEP plots the resulting :math:`95\%` range of likelihoods returned by
+the simulation with the black bar by default. The observed likelihood
+score is shown by a green square where the forecast passes the test and
+a red circle where the observed likelihood is outside the likelihood
+distribution.
+
+ CL-test
+^^^^^^^
+
+Aim: The original likelihood test described above gives a result that
+combines the spatial, magnitude and number components of a forecast. The
+conditional likelihood or CL-Test was developed to test the spatial and
+magnitude performance of a forecast without the influence of the number
+of events (Werner et al. 2011a, 2011b). By conditioning the test
+distribution on the observed number of events we elimiate the dependency
+with the forecasted number of events as described above.
+
+| Method
+| The CL-test is computed in the same way as the L-test, but with the
+  number of events normalised to the observed catalog :math:`N_{obs}`
+  during the simulation stage. The quantile score is then calculated
+  similarly such that
+
+.. math:: \gamma_{CL} = \frac{ |\{ \hat{CL}_x | \hat{CL}_x \le CL_{obs}\} |}{|\{ \hat{\boldsymbol{CL}} \}|}.
+
+Implementation in pyCSEP
+
+.. code:: ipython3
+
+    cond_likelihood_test_result = poisson.conditional_likelihood_test(
+        helmstetter,
+        catalog,
+        seed=seed,
+        num_simulations=nsim
+    )
+    ax = plots.plot_poisson_consistency_test(
+        cond_likelihood_test_result,
+        one_sided_lower=True,
+        plot_args = {'title': r'$CL-\mathrm{test}$', 'xlabel': 'conditional log-likelihood'}
+    )
+
+
+
+.. image:: output_9_0.png
+
+
+Again, the :math:`95\%` confidence range of likelihoods is shown by the
+black bar, and the symbol reflects the observed conditional-likelihood
+score. In this case, the observed conditional-likelihood is shown with
+the red circle, which falls outside the range of likelihoods simulated
+from the forecast. To understand why the L- and CL-tests give different
+results, consider the results of the N-test and S-test in the following
+sections.
+
+N-test
+^^^^^^
+
+Aim: The number or N-test is the most conceptually simple test of a
+forecast: To test whether the number of observed events is consistent
+with that of the forecast.
+
+Method: The originial N-test was introduced by Schorlemmer et al (2007)
+and modified by Zechar et al (2010). The observed number of events is
+given by,
+
+.. math:: N_{obs} = \sum_{m_i, s_j \in R} \omega(m_i, s_j).
+
+Using the simulations described above, the expected number of events is
+calculated by summing the simulated number of events over all grid cells
+
+.. math:: \hat{N_x} = \sum_{m_i, s_j \in R} \hat{\omega}_x(m_i, s_j),
+
+where :math:`\hat{\omega}_x(m_i, s_j)` is the simulated number of events
+in catalog :math:`x` in spatial cell :math:`s_j` and magnitude cell
+:math:`m_i`, generating a set of simulated rates :math:`\{ \hat{N} \}`.
+We can then calculate the probability of i) observing at most
+:math:`N_{obs}` events and ii) of observing at least :math:`N_{obs}`
+events. These probabilities can be written as:
+
+.. math:: \delta_1 =  \frac{ |\{ \hat{N_x} | \hat{N_x} \le N_{obs}\} |}{|\{ \hat{N} \}|}
+
+and
+
+.. math:: \delta_2 =  \frac{ |\{ \hat{N_x} | \hat{N_x} \ge N_{obs}\} |}{|\{ \hat{N} \}|}
+
+If a forecast is Poisson, the expected number of events in the forecast
+follows a Poisson distribution with expectation
+:math:`N_{fore} = \sum_{m_i, s_j \in R} \lambda(m_i, s_j)`. The
+cumulative distribution is then a Poisson cumulative distribution:
+
+.. math:: F(x|N_{fore}) = \exp(-N_{fore}) \sum^{x}_{i=0} \frac{(N_{fore})^i}{i!}
+
+which can be used directly without the need for simulations. The N-test
+quantile score is then
+
+.. math:: \delta_1 =  1 - F((N_{obs}-1)|N_{fore}),
+
+and
+
+.. math:: \delta_2 = F(N_{obs}|N_{fore}).
+
+The original N-test considered only :math:`\delta_2` and it’s complement
+:math:`1-\delta_2`, which effectively tested the probability of at most
+:math:`N_{obs}` events and more than :math:`N_{obs}` events. Very small
+or very large values (<0.025 or > 0.975 respectively) were considered to
+be inconsistent with the forecast in Schorlemmer et al (2010). However
+the approach above aims to test something subtely different, that is at
+least :math:`N_{obs}` events and at most :math:`N_{obs}` events. Zechar
+et al (2010a) recommends testing both :math:`\delta_1` and
+:math:`\delta_2` with an effective significance of have the required
+significance level, so for a required significance level of 0.05, a
+forecast is consistent if both :math:`\delta_1` and :math:`\delta_2` are
+greater than 0.025. A very small :math:`\delta_1` suggest the rate is
+too low while a very low :math:`\delta_2` suggests a rate which is too
+high to be consistent with observations.
+
+Implementation in pyCSEP
+
+pyCSEP uses the Zechar et al (2010) version of the N-test and the
+cumulative Poisson approach to estimate the range of expected events
+from the forecasts, so does not implement a simulation in this case. The
+upper and lower bounds for the test are determined from the cumulative
+Poisson distribution. ``number_test_result.quantile`` will return both
+:math:`\delta_1` and :math:`\delta_2` values.
+
+.. code:: ipython3
+
+    number_test_result = poisson.number_test(helmstetter, catalog)
+    ax = plots.plot_poisson_consistency_test(
+        number_test_result,
+        plot_args={'xlabel':'Number of events'}
+    )
+
+
+
+.. image:: output_13_0.png
+
+
+In this case, the black bar shows the :math:`95\%` interval for the
+number of events in the forecast. The actual observed number of events
+is shown by the green box, which just passes the N-test in this case:
+the forecast generallly expects more events than are observed in
+practice, but the observed number falls just within the lower limits of
+what is expected so the forecast (just!) passes the N-test.
+
+M-test
+^^^^^^
+
+Aim: Establish consistency (or lack thereof) of observed event
+magnitudes with forecast magnitudes.
+
+Method: The M-test is first described in Zechar et al. (2010) and aims
+to isolate the magnitude component of a forecast. To do this, we sum
+over the spatial bins and normalise so that the sum of events matches
+the observations.
+
+.. math:: \hat{\boldsymbol{\Omega}}^m = \big{\{}\omega^{m}(m_i)| m_i \in \boldsymbol{M}\big{\}},
+
+where
+
+.. math::  \omega^m(m_i) = \sum_{s_j \in \boldsymbol{S}} \omega(m_i, s_j),
+
+and
+
+.. math:: \boldsymbol{\Lambda}^m = \big{\{} \lambda^m(m_i)| m_i \in \boldsymbol{M} \big{\}},
+
+where
+
+.. math::  \lambda^m(m_i) = \frac{N_{obs}}{N_{fore}}\sum_{s_j \in \boldsymbol{S}} \lambda\big{(}m_i, s_j\big{)}.
+
+Then we compute the joint log-likelihood as we did for the L-test:
+
+.. math::  M = L(\boldsymbol{\Omega}^m | \boldsymbol{\Lambda}^m)
+
+We then wish to compare this with the distribution of simulated
+log-likelihoods, this time keep the number of events fixed to
+
+:math:`N_{obs}`. Then for each simulated catalogue,
+:math:`\hat{M}_x = L(\hat{\boldsymbol{\Omega}}^m | \boldsymbol{\Lambda}^m)`
+
+Quantile score: The final test statistic is again the fraction of
+observed log likelihoods within the range of the simulated log
+likelihood values:
+
+.. math:: \kappa =  \frac{ |\{ \hat{M_x} | \hat{M_x} \le M\} |}{|\{ \hat{M} \}|}
+
+and the observed magnitudes are inconsistent with the forecast if
+:math:`\kappa` is less than the significance level.
+
+pyCSEP implementation
+
+.. code:: ipython3
+
+    mag_test_result = poisson.magnitude_test(
+        helmstetter,
+        catalog,
+        seed=seed,
+        num_simulations=nsim
+    )
+    ax = plots.plot_poisson_consistency_test(
+        mag_test_result,
+        one_sided_lower=True,
+        plot_args={'xlabel':'Normalized likelihood'}
+    )
+
+
+
+.. image:: output_16_0.png
+
+
+In this example, the forecast passes the M-test, demonstrating that the
+magnitude distribution in the forecast is consistent with observed
+events. This is shown by the green square marking the joint
+log-likelihood for the observed events.
+
+S-test
+^^^^^^
+
+Aim: The spatial or S-test aims to establish consistency (or lack
+thereof) of observed event locations with a forecast. It is originally
+defined in Zechar et al (2010).
+
+Method: Similar to the M-test, but in this case we sum over all
+magnitude bins.
+
+.. math:: \hat{\boldsymbol{\Omega}^s} = \{\omega^s(s_j)| s_j \in \boldsymbol{S}\},
+
+where
+
+.. math::  \omega^s(s_j) = \sum_{m_i \in \boldsymbol{M}} \omega(m_i, s_j),
+
+and
+
+.. math:: \boldsymbol{\Lambda}^s = \{ \lambda^s(s_j)| s_j \in \boldsymbol{S} \},
+
+where
+
+.. math::  \lambda^s(s_j) = \frac{N_{obs}}{N_{fore}}\sum_{m_i \in M} \lambda(m_i, s_j).
+
+Then we compute the joint log-likelihood as we did for the L-test or the
+M-test:
+
+.. math::  S = L(\boldsymbol{\Omega}^s | \boldsymbol{\Lambda}^s)
+
+We then wish to compare this with the distribution of simulated
+log-likelihoods, this time keeping the number of events fixed to
+:math:`N_{obs}`. Then for each simulated catalogue,
+:math:`\hat{S}_x = L(\hat{\boldsymbol{\Omega}}^s | \boldsymbol{\Lambda}^s)`
+
+The final test statistic is again the fraction of observed log
+likelihoods within the range of the simulated log likelihood values:
+
+.. math:: \zeta =  \frac{ |\{ \hat{S_x} | \hat{S_x} \le S\} |}{|\{ \hat{S} \}|}
+
+and again the distinction between a forecast passing or failing the test
+depends on our significance level.
+
+pyCSEP implementation
+
+The S-test is again a one-sided test, so we specify this when plotting
+the result.
+
+.. code:: ipython3
+
+    spatial_test_result = poisson.spatial_test(
+        helmstetter,
+        catalog,
+        seed=seed,
+        num_simulations=nsim
+    )
+    ax = plots.plot_poisson_consistency_test(
+        spatial_test_result,
+        one_sided_lower=True,
+        plot_args = {'xlabel':'normalized spatial likelihood'}
+    )
+
+
+
+.. image:: output_19_0.png
+
+
+The Helmstetter model fails the S-test as the observed spatial
+likelihood falls in the tail of the simulated likelihood distribution.
+Again this is shown by a coloured symbol which highlights whether the
+forecast model passes or fails the test.
+
+Forecast comparison tests
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The consistency tests above check whether a forecast is consistent with
+observations, but do not provide a straightforward way to compare two
+different forecasts. A few suggestions for this focus on the information
+gain of one forecast relative to another (Harte and Vere-Jones 2005,
+Imoto and Hurukawa, 2006, Imoto and Rhoades, 2010, Rhoades et al 2011).
+The T-test and W-test implementations for earthquake forecast comparison
+are first described in Rhoades et al. (2011).
+
+The information gain per earthquake (IGPE) of model A compared to model
+B is defined by :math:`I_{N}(A, B) = R/N` where R is the rate-corrected
+log-likelihood ratio of models A and B gven by
+
+.. math::  R = \sum_{k=1}^{N}\big{(}\log\lambda_A(i_k) - \log \lambda_B(i_k)\big{)} - \big{(}\hat{N}_A - \hat{N}_B\big{)}
+
+If we set :math:`X_i=\log\lambda_A(k_i)` and
+:math:`Y_i=\log\lambda_B(k_i)` then we can define the information gain
+per earthquake (IGPE) as
+
+.. math:: I_N(A, B) = \frac{1}{N}\sum^N_{i=1}\big{(}X_i - Y_i\big{)} - \frac{\hat{N}_A - \hat{N}_B}{N}
+
+If :math:`I(A, B)` differs significantly from 0, the model with the
+lower likelihood can be rejected in favour of the other.
+
+t-test
+
+If :math:`X_i - Y_i` are independent and come from the same normal
+population with mean :math:`\mu` then we can use the classic paired
+t-test to evaluate the null hypothesis that
+:math:`\mu = (\hat{N}_A - \hat{N}_B)/N` against the alternative
+hypothesis :math:`\mu \ne (\hat{N}_A - \hat{N}_B)/N`. To implement this,
+we let :math:`s` denote the sample variance of :math:`(X_i - Y_i)` such
+that
+
+.. math::  s^2 = \frac{1}{N-1}\sum^N_{i=1}\big{(}X_i - Y_i\big{)}^2 - \frac{1}{N^2 - N}\bigg{(}\sum^N_{i=1}\big{(}X_i - Y_i\big{)}\bigg{)}^2
+
+Under the null hypothesis
+:math:`T = I_N(A, B)\big{/}\big{(}s/\sqrt{N}\big{)}` has a
+t-distribution with :math:`N-1` degrees of freedom and the null
+hypothesis can be rejected if :math:`|T|` exceeds a critical value of
+the :math:`t_{N-1}` distribution. The confidence intervals for
+:math:`\mu - (\hat{N}_A - \hat{N}_B)/N` can then be constructed with the
+form :math:`I_N(A,B) \pm ts/\sqrt{N}` where t is the appropriate
+quantile of the :math:`t_{N-1}` distribution.
+
+W-test
+
+An alternative to the t-test is the Wilcoxan signed-rank test or W-test.
+This is a non-parameteric alternative to the t-test which can be used if
+we do not feel the assumption of normally distributed differences in
+:math:`X_i - Y_i` is valid. This assumption might b particularly poor
+when we have small sample sizes. The W-test instead depends on the
+(weaker) assumption that :math:`X_i - Y_i` is symmetric and tests
+whether the meadian of :math:`X_i - Y_i` is equal to
+:math:`(\hat{N}_A - \hat{N}_B)/N`. The W-test is less powerful than the
+T-test for normally distributed differences and cannot reject the null
+hypothesis (with :math:`95\%` confidence) for very small sample sizes
+(:math:`N \leq 5`).
+
+The t-test becomes more accurate as :math:`N \rightarrow \infty` due to
+the central limit theorem and therefore the t-test is considered
+dependable for large :math:`N`. Where :math:`N` is small, a model might
+only be considered more informative if both the t- and W-test results
+agree.
+
+Implementation in pyCSEP
+
+The t-test and W-tests are implemented in pyCSEP as below.
+
+.. code:: ipython3
+
+    helmstetter_ms = csep.load_gridded_forecast(
+        datasets.helmstetter_mainshock_fname,
+        name = "Helmstetter Mainshock"
+    )
+
+    t_test = poisson.paired_t_test(helmstetter, helmstetter_ms, catalog)
+    w_test = poisson.w_test(helmstetter, helmstetter_ms, catalog)
+    comp_args = {'title': 'Paired T-test Result',
+                 'ylabel': 'Information gain',
+                 'xlabel': '',
+                 'xticklabels_rotation': 0,
+                 'figsize': (6,4)}
+
+    ax = plots.plot_comparison_test([t_test], [w_test], plot_args=comp_args)
+
+
+
+.. image:: output_22_0.png
+
+
+The first argument to the ``paired_t_test`` function is taken as model A
+and the second as our basline model, or model B. When plotting the
+result, the horizontal dashed line indicates the performance of model B
+and the vertical bar shows the confidence bars for the information gain
+:math:`I_N(A, B)` associated with model A relative to model B. In this
+case, the model with aftershocks performs statistically worse than the
+benchmark model. We note that this comparison is used for demonstation
+purposes only.
+
+Catalog-based forecast tests
+----------------------------
+
+Catalog-based forecast tests evaluate forecasts using simulated outputs
+in the form of synthetic earthquake catalogs. Thus, removing the need
+for the Poisson approximation and simulation procedure used with
+grid-based forecasts. We know that observed seismicity is overdispersed
+with respect to a Poissonian model due to spatio-temporal clustering.
+Overdispersed models are more likely to be rejected by the original
+Poisson-based CSEP tests (Werner et al, 2011a). This modification of the
+testing framework allows for a broader range of forecast models. The
+distribution of realizations is then compared with observations, similar
+to in the grid-based case. These tests were developed by Savran et al
+2020, who applied them to test forecasts following the 2019 Ridgecrest
+earthquake in Southern California.
+
+In the following text, we show how catalog-based forecasts are defined.
+Again we begin by defining a region :math:`\boldsymbol{R}` as a function
+of some magnitude range :math:`\boldsymbol{M}`, spatial domain
+:math:`\boldsymbol{S}` and time period :math:`\boldsymbol{T}`
+
+.. math::  \boldsymbol{R} = \boldsymbol{M} \times \boldsymbol{S} \times \boldsymbol{T}.
+
+An earthquake :math:`e` can be described by a magnitude :math:`m_i` at
+some location :math:`s_j` and time :math:`t_k`. A catalog is simply a
+collection of earthquakes, thus the observed catalog can be written as
+
+.. math:: \Omega = \big{\{}e_n \big{|} n= 1...N_{obs}; e_n \in \boldsymbol{R} \big{\}},
+
+and a forecast is then specified as a collection of synthetic catalogs
+containing events :math:`\hat{e}_{nj}` in domain :math:`\boldsymbol{R}`,
+as
+
+.. math::  \boldsymbol{\Lambda} \equiv \Lambda_j = \{\hat{e}_{nj} | n = 1... N_j, j= 1....J ;\hat{e}_{nj} \in \boldsymbol{R} \}.
+
+That is, a forecast consists of :math:`J` simulated catalogs each
+containing :math:`N_j` events, described in time, space and magnitude
+such that :math:`\hat{e}_{nj}` describes the :math:`n`\ th synthetic
+event in the :math:`j`\ th synthetic catalog :math:`\Lambda_j`
+
+When using simulated forecasts in pyCSEP, we must first explicitly
+specify the forecast region by specifying the spatial domain and
+magnitude regions as below. In effect, these are filters applied to the
+forecast and observations to retain only the events in
+:math:`\boldsymbol{R}`. The examples in this section are catalog-based
+forecast simulations for the Landers earthquake and aftershock sequence
+generated using UCERF3-ETAS (Field et al, 2017).
+
+.. code:: ipython3
+
+    # Define the start and end times of the forecasts
+    start_time = time_utils.strptime_to_utc_datetime("1992-06-28 11:57:35.0")
+    end_time = time_utils.strptime_to_utc_datetime("1992-07-28 11:57:35.0")
+
+    # Magnitude bins properties
+    min_mw = 4.95
+    max_mw = 8.95
+    dmw = 0.1
+
+    # Create space and magnitude regions.
+    magnitudes = regions.magnitude_bins(min_mw, max_mw, dmw)
+    region = regions.california_relm_region()
+    space_magnitude_region = regions.create_space_magnitude_region(
+        region,
+        magnitudes
+    )
+
+    # Load forecast
+    forecast = csep.load_catalog_forecast(
+        datasets.ucerf3_ascii_format_landers_fname,
+        start_time = start_time,
+        end_time = end_time,
+        region = space_magnitude_region,
+        apply_filters = True
+    )
+
+    # Compute expected rates
+    forecast.filters = [
+        f'origin_time >= {forecast.start_epoch}',
+        f'origin_time < {forecast.end_epoch}'
+    ]
+    _ = forecast.get_expected_rates(verbose=False)
+
+    # Obtain Comcat catalog and filter to region
+    comcat_catalog = csep.query_comcat(
+        start_time,
+        end_time,
+        min_magnitude=forecast.min_magnitude
+    )
+
+    # Filter observed catalog using the same region as the forecast
+    comcat_catalog = comcat_catalog.filter_spatial(forecast.region)
+
+
+.. parsed-literal::
+
+    Fetched ComCat catalog in 0.31937098503112793 seconds.
+
+    Downloaded catalog from ComCat with following parameters
+    Start Date: 1992-06-28 12:00:45+00:00
+    End Date: 1992-07-24 18:14:36.250000+00:00
+    Min Latitude: 33.901 and Max Latitude: 36.705
+    Min Longitude: -118.067 and Max Longitude: -116.285
+    Min Magnitude: 4.95
+    Found 19 events in the ComCat catalog.
+
+
+Number Test
+~~~~~~~~~~~
+
+Aim: As above, the number test aims to evaluate if the number of
+observed events is consistent with the forecast.
+
+Method: The observed statistic in this case is given by
+:math:`N_{obs} = |\Omega|`, which is simply the number of events in the
+observed catalog. To build the test distribution from the forecast, we
+simply count the number of events in each simulated catalog.
+
+.. math::  N_{j} = |\Lambda_c|; j = 1...J
+
+As in the gridded test above, we can then evaluate the probabilities of
+at least and at most N events, in this case using the empirical
+cumlative distribution function of :math:`F_N`:
+
+.. math:: \delta_1 = P(N_j \geq N_{obs}) = 1 - F_N(N_{obs}-1)
+
+and
+
+.. math:: \delta_2 = P(N_j \leq N_{obs}) = F_N(N_{obs})
+
+Implementation in pyCSEP
+
+.. code:: ipython3
+
+    number_test_result = catalog_evaluations.number_test(
+        forecast,
+        comcat_catalog,
+        verbose=False
+    )
+    ax = number_test_result.plot()
+
+
+
+.. image:: output_27_0.png
+
+
+Plotting the number test result of a simulated catalog forecast displays
+a histogram of the numbers of events :math:`\hat{N}_j` in each simulated
+catalog :math:`j`, which makes up the test distribution. The test
+statistic is shown by the dashed line - in this case it is the number of
+observed events in the catalog :math:`N_{obs}`.
+
+Magnitude Test
+~~~~~~~~~~~~~~
+
+Aim: The magnitude test aims to test the consistency of the observed
+frequency-magnitude distribution with that in the simulated catalogs
+that make up the forecast.
+
+Method: The catalog-based magnitude test is implemented quite
+differently to the grid-based equivalent. We first define the union
+catalog :math:`\Lambda_U` as the union of all simulated catalogs in the
+forecast. Formally:
+
+.. math::  \Lambda_U = \{ \lambda_1 \cup \lambda_2 \cup ... \cup \lambda_j \}
+
+| so that the union catalog contains all events across all simulated
+  catalogs for a total of
+  :math:`N_U = \sum_{j=1}^{J} \big{|}\lambda_j\big{|}` events.
+| We then compute the following histograms discretised to the magnitude
+  range and magnitude step size (specified earlier for pyCSEP): 1. the
+  histogram of the union catalog magnitudes :math:`\Lambda_U^{(m)}` 2.
+  Histograms of magnitudes in each of the individual simulated catalogs
+  :math:`\lambda_j^{(m)}` 3. the histogram of the observed catalog
+  magnitudes :math:`\Omega^{(m)}`
+
+The histograms are normalized so that the total number of events across
+all bins is equal to the observed number. The observed statistic is then
+calculated as the sum of squared logarithmic residuals between the
+normalised observed magnitudes and the union histograms. This statistic
+is related to the Kramer von-Mises statistic.
+
+.. math:: d_{obs}= \sum_{k}\Bigg(\log\Bigg[\frac{N_{obs}}{N_U} \Lambda_U^{(m)}(k) + 1\Bigg]- \log\Big[\Omega^{(m)}(k) + 1\Big]\Bigg)^2
+
+where :math:`\Lambda_U^{(m)}(k)` and :math:`\Omega^{(m)}(k)`
+represent the count in the :math:`k`\ th bin of the magnitude-frequency
+distribution in the union and observed catalogs respectively. We add
+unity to each bin to avoid :math:`\log(0)`. We then build the test
+distribution from the catalogs in :math:`\boldsymbol{\Lambda}`:
+
+.. math::  D_j =  \sum_{k}\Bigg(\log\Bigg[\frac{N_{obs}}{N_U} \Lambda_U^{(m)}(k) + 1\Bigg]- \log\Bigg[\frac{N_{obs}}{N_j}\Lambda_j^{(m)}(k) + 1\Bigg]\Bigg)^2; j= 1...J
+
+where :math:`\lambda_j^{(m)}(k)` represents the count in the
+:math:`k`\ th bin of the magnitude-frequency distribution of the
+:math:`j`\ th catalog.
+
+The quantile score can then be calculated using the empirical CDF such
+that
+
+.. math::  \gamma_m = F_D(d_{obs})= P(D_j \leq d_{obs})
+
+|  Implementation in pyCSEP
+| Hopefully you now see why it was necessary to specify our magnitude
+  range explicitly when we set up the catalog-type testing - we need to
+  makes sure the magnitudes are properly discretised for the model we
+  want to test.
+
+.. code:: ipython3
+
+    magnitude_test_result = catalog_evaluations.magnitude_test(
+        forecast,
+        comcat_catalog,verbose=False
+    )
+    ax = magnitude_test_result.plot(plot_args={'xy': (0.6,0.7)})
+
+
+
+.. image:: output_30_0.png
+
+
+The histogram shows the resulting test distribution with :math:`D^*`
+calculated for each simulated catalog as described in the method above.
+The test statistic :math:`\omega = d_{obs}` is shown with the dashed
+horizontal line. The quantile score for this forecast is
+:math:`\gamma = 0.29`.
+
+Pseudo-likelihood test
+~~~~~~~~~~~~~~~~~~~~~~
+
+Aim : The pseudo-likelihood test aims to evaluate the likelihood of a
+forecast given an observed catalog.
+
+Method : The pseudo-likelihood test has similar aims to the grid-based
+likelihood test above, but its implementation differs in a few
+significant ways. Firstly, it does not compute an actual likelihood
+(hence the name pseudo-likelihood), and instead of aggregating over
+cells as in the grid-based case, the pseudo-likelihood test aggregates
+likelihood over target event likelihood scores (so likelihood score per
+target event, rather than likelihood score per grid cell). The most
+important difference, however, is that the pseudo-likelihood tests do
+not use a Poisson likelihood.
+
+The pseudo-likelihood approach is based on the continuous point process
+likelihood function. A continuous marked space-time point process can be
+specified by a conditional intensity function
+:math:`\lambda(\boldsymbol{e}|H_t)`, in which :math:`H_t` describes the
+history of the process in time. The log-likelihood function for any
+point process in :math:`\boldsymbol{R}` is given by
+
+.. math::  L = \sum_{i=1}^{N} \log \lambda(e_i|H_t) - \int_{\boldsymbol{R}}\lambda(\boldsymbol{e}|H_t)d\boldsymbol{R}
+
+Not all models will have an explicit likelihood function, so instead we
+approximate the expectation of :math:`\lambda(e|H_t)` using the forecast
+catalogs. The approximate rate density is defined as the conditional
+expectation given a discretised region :math:`R_d` of the continuous
+rate
+
+.. math:: \hat{\lambda}(\boldsymbol{e}|H_t) = E\big[\lambda(\boldsymbol{e}|H_t)|R_d\big]
+
+We still regard the model as continuous, but the rate density is
+approximated within a single cell. This is analogous to the gridded
+approach where we count the number of events in discrete cells. The
+pseudo-loglikelihood is then
+
+.. math:: \hat{L} = \sum_{i=1}^N \log \hat{\lambda}(e_i|H_t) - \int_R \hat{\lambda}(\boldsymbol{e}|H_t) dR
+
+and we can write the approximate rate density as
+
+.. math:: \hat{\lambda}(\boldsymbol{e}|H_t) = \sum_M \hat{\lambda}(\boldsymbol{e}|H_t),
+
+where we take the sum over all magnitude bins :math:`M`. We can
+calculate observed pseudolikelihood as
+
+.. math::  \hat{L}_{obs} = \sum_{i=1}^{N_{obs}} \log \hat{\lambda}_s(k_i) - \bar{N},
+
+where :math:`\hat{\lambda}_s(k_i)` is the approximate rate density in
+the :math:`k`\ th spatial cell and :math:`k_i` denotes the spatil cell
+in which the :math:`i`\ th event occurs. :math:`\bar{N}` is the expected
+number of events in :math:`R_d`. Similarly, we calculate the test
+distribution as
+
+.. math:: \hat{L}_{j} = \Bigg[\sum_{i=1}^{N_{j}} \log\hat{\lambda}_s(k_{ij}) - \bar{N}\Bigg]; j = 1....J,
+
+where :math:`\hat{\lambda}_s(k_{ij})` describes the approximate rate
+density of the :math:`i`\ th event in the :math:`j`\ th catalog. We can
+then calculate the quantile score as
+
+.. math::  \gamma_L = F_L(\hat{L}_{obs})= P(\hat{L}_j \leq \hat{L}_{obs}).
+
+Implementation in pyCSEP
+
+.. code:: ipython3
+
+    pseudolikelihood_test_result = catalog_evaluations.pseudolikelihood_test(
+        forecast,
+        comcat_catalog,
+        verbose=False
+    )
+    ax = pseudolikelihood_test_result.plot()
+
+
+
+.. image:: output_33_0.png
+
+
+The histogram shows the test distribution of pseudolikelihood as
+calculated above for each catalog :math:`j`. The dashed vertical line
+shows the observed statistic :math:`\hat{L}_{obs} = \omega`. It is clear
+that the observed statistic falls within the critical region of test
+distribution, as reflected in the quantile score of
+:math:`\gamma_L = 0.02`.
+
+Spatial test
+~~~~~~~~~~~~
+
+Aim: The spatial test again aims to isolate the spatial component of the
+forecast and test the consistency of spatial rates with observed events.
+
+Method We perform the spatial test in the catalog-based approach in a
+similar way to the grid-based spatial test approach: by normalising the
+approximate rate density. In this case, we use the normalisation
+:math:`\hat{\lambda}_s = \hat{\lambda}_s \big/ \sum_{R} \hat{\lambda}_s`.
+Then the observed spatial test statistic is calculated as
+
+.. math::  S_{obs} = \Bigg[\sum_{i=1}^{N_{obs}} \log \hat{\lambda}_s^*(k_i)\Bigg]N_{obs}^{-1}
+
+in which :math:`\hat{\lambda}_s^*(k_i)` is the normalised approximate
+rate density in the :math:`k`\ th cell corresponding to the
+:math:`i`\ th event in the observed catalog :math:`\Omega`. Similarly,
+we define the test distribution using
+
+.. math::  S_{c} = \bigg[\sum_{i=1}^{N_{j}} \log \hat{\lambda}_s^*(k_{ij})\bigg]N_{j}^{-1}; j= 1...J
+
+for each catalog j. Finally, the quantile score for the spatial test is
+determined by once again comparing the observed and test distribution
+statistics:
+
+.. math:: \gamma_s = F_s(\hat{S}_{obs}) = P (\hat{S}_j \leq \hat{S}_{obs})
+
+Implementation in pyCSEP
+
+.. code:: ipython3
+
+    spatial_test_result = catalog_evaluations.spatial_test(
+        forecast,
+        comcat_catalog,
+        verbose=False
+    )
+    ax = spatial_test_result.plot()
+
+
+
+.. image:: output_36_0.png
+
+
+The histogram shows the test distribution of normalised
+pseduo-likelihood computed for each simulated catalog :math:`j`. The
+dashed vertical line shows the observed test statistic
+:math:`s_{obs} = \omega = -5.88`, which is clearly within the test
+distribution. The quantile score :math:`\gamma_s = 0.36` is also printed
+on the figure by default.
+
+References
+----------
+
+Field, E. H., K. R. Milner, J. L. Hardebeck, M. T. Page, N. J. van der
+Elst, T. H. Jordan, A. J. Michael, B. E. Shaw, and M. J. Werner (2017).
+A spatiotemporal clustering model for the third Uniform California
+Earthquake Rupture Forecast (UCERF3-ETAS): Toward an operational
+earthquake forecast, Bull. Seismol. Soc. Am. 107, 1049–1081.
+
+Harte, D., and D. Vere-Jones (2005), The entropy score and its uses in
+earthquake forecasting, Pure Appl. Geophys. 162 , 6-7, 1229-1253, DOI:
+10.1007/ s00024-004-2667-2.
+
+Helmstetter, A., Y. Y. Kagan, and D. D. Jackson (2006). Comparison of
+short-term and time-independent earthquake forecast models for southern
+California, Bulletin of the Seismological Society of America 96 90-106.
+
+Imoto, M., and N. Hurukawa (2006), Assessing potential seismic activity
+in Vrancea, Romania, using a stress-release model, Earth Planets Space
+58 , 1511-1514.
+
+Imoto, M., and D.A. Rhoades (2010), Seismicity models of moderate
+earthquakes in Kanto, Japan utilizing multiple predictive parameters,
+Pure Appl. Geophys. 167, 6-7, 831-843, DOI: 10.1007/s00024-010-0066-4.
+
+Rhoades, D.A, D., Schorlemmer, M.C.Gerstenberger, A. Christophersen, J.
+D. Zechar & M. Imoto (2011) Efficient testing of earthquake forecasting
+models, Acta Geophysica 59
+
+Savran, W., M. J. Werner, W. Marzocchi, D. Rhoades, D. D. Jackson, K. R.
+Milner, E. H. Field, and A. J. Michael (2020). Pseudoprospective
+evaluation of UCERF3-ETAS forecasts during the 2019 Ridgecrest Sequence,
+Bulletin of the Seismological Society of America.
+
+Schorlemmer, D., and M.C. Gerstenberger (2007), RELM testing center,
+Seismol. Res. Lett. 78, 30–36.
+
+Schorlemmer, D., M.C. Gerstenberger, S. Wiemer, D.D. Jackson, and D.A.
+Rhoades (2007), Earthquake likelihood model testing, Seismol. Res. Lett.
+78, 17–29.
+
+Schorlemmer, D., A. Christophersen, A. Rovida, F. Mele, M. Stucci and W.
+Marzocchi (2010a). Setting up an earthquake forecast experiment in
+Italy, Annals of Geophysics, 53, no.3
+
+Schorlemmer, D., J.D. Zechar, M.J. Werner, E.H. Field, D.D. Jackson, and
+T.H. Jordan (2010b), First results of the Regional Earthquake Likelihood
+Models experiment, Pure Appl. Geophys., 167, 8/9,
+doi:10.1007/s00024-010-0081-5.
+
+M. Taroni, W. Marzocchi, D. Schorlemmer, M. J. Werner, S. Wiemer, J. D.
+Zechar, L. Heiniger, F. Euchner; Prospective CSEP Evaluation of 1‐Day,
+3‐Month, and 5‐Yr Earthquake Forecasts for Italy. Seismological Research
+Letters 2018;; 89 (4): 1251–1261. doi:
+https://doi.org/10.1785/0220180031
+
+Werner, M. J., A. Helmstetter, D. D. Jackson, and Y. Y. Kagan (2011a).
+High-Resolution Long-Term and Short-Term Earthquake Forecasts for
+California, Bulletin of the Seismological Society of America 101
+1630-1648
+
+Werner, M.J. J.D. Zechar, W. Marzocchi, and S. Wiemer (2011b),
+Retrospective evaluation of the five-year and ten-year CSEP-Italy
+earthquake forecasts, Annals of Geophysics 53, no. 3, 11–30,
+doi:10.4401/ag-4840.
+
+Zechar, 2011: Evaluating earthquake predictions and earthquake
+forecasts: a guide for students and new researchers, CORSSA
+(http://www.corssa.org/en/articles/theme_6/)
+
+Zechar, J.D., M.C. Gerstenberger, and D.A. Rhoades (2010a),
+Likelihood-based tests for evaluating space-rate-magnitude forecasts,
+Bull. Seis. Soc. Am., 100(3), 1184—1195, doi:10.1785/0120090192.
+
+Zechar, J.D., D. Schorlemmer, M. Liukis, J. Yu, F. Euchner, P.J.
+Maechling, and T.H. Jordan (2010b), The Collaboratory for the Study of
+Earthquake Predictability perspective on computational earthquake
+science, Concurr. Comp-Pract. E., doi:10.1002/cpe.1519.
diff --git a/docs/index.rst b/docs/index.rst
index 83e37b76..3a91d53d 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -8,6 +8,7 @@ pyCSEP: Tools for Earthquake Forecast Developers
 
     getting_started/installing
     getting_started/core_concepts
+    getting_started/theory
 
 .. toctree::
     :maxdepth: 2