Skip to content

SimonVaughanDataAndCode/gin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
Simon Vaughan
Mar 5, 2021
799df93 · Mar 5, 2021

History

30 Commits
Apr 30, 2019
Apr 30, 2019
Mar 14, 2017
Apr 26, 2019
Apr 30, 2019
Apr 30, 2019
Apr 30, 2019
Mar 8, 2017
Mar 8, 2017
Apr 26, 2019
Mar 8, 2017
Apr 30, 2019
Mar 5, 2021
Apr 26, 2019
Mar 8, 2017
Apr 26, 2019

Repository files navigation

Gin (Gaussian process Inference)

Gin is a collection of pure R tools for generating and manipulating Gaussian Process models (GPs). It is also a nice spirit gin

Outline

The current version includes the top-level functions:

  • gp_simulate - generate N psuedo-random GP realisations.

  • gp_conditional - compute the mean and covariance of a GP given data points.

  • gp_fit - fit for (hyper-)parameters of the autocovariance function (ACV)

  • plot_ts - plot a time series data.frame.

  • plot_snake - plot a 'snake' representing the mean and std.dev of a GP.

  • ts_load - load a plain text data file into an array ready for analysis.

And some lower-level functions that do the heavy-lifting

  • gp_logLikelihood - compute Gaussian log likelihood given data and model.

  • gp_logPosterior - compute log posterior, given log likelihood and prior.

  • gp_lagMatrix - compute matrix of lags tau_ij = |t_i - t_j|

Installation

Gin is an R package, but is still in development. To set up from GitHub first install (if you haven't already) Hadley Wickham's devtools package.

   install.packages("devtools")

Now you can install gin straight from GitHub:

   devtools::install_github("svdataman/gin")

Now you're good to go.

Simulation

First we must define an ACV function. This is the exponential ACV for a Damped Randon Walk (DRW) process:

   # define an ACV function
   acv <- function(theta, tau) {
     A <- abs(theta[1])
     l <- abs(theta[2])
     acov <- A * exp(-(tau / l))
     return(acov)
   }

This accepts a vector of parameters (theta), and a matrix of vector of lags (tau) and returns the ACV value at each lag.

Now we define some initial parameters

   # define parameters of ACV
   # theta[1] = mu (mean)
   # theta[2] = nu (error scale factor) 
   # theta[3:p] = parameters of ACV
   theta <- c(0.0, 1.0, 1.0, 1.0)

Next, we define a list of times at which to compute the simulated data

   # define vector of times for reconstruction
   m <- 1000
   t <- seq(-0.5, 10.5, length = m)

Using the ACV model and parameters (theta) as above, and the grid of times (t) we can generate a random realisation of the GP

  # produce Gaussian vector (simulation)
  y <- gin::gp_sim(theta, acv.model = acv, t.star = t)
  y <- as.vector(y)

  # plot the 'true' curve
  dat <- data.frame(t = t, y = y)
  gin::plot_ts(dat, type = "l")

random

Fitting data

The package comes with a dataset called drw. This shows 40 observations of a random process. The drw data.frame has 3 columns: t (time), y (value), dy (error). Let's plot the data

  gin::plot_ts(drw, col = "black", cex.lab=1.4)

data

Using the ACV model above, and the values for theta as our starting guess, we fit:

   # fit the model to the data, find Max.Like parameter values
   result <- gin::gp_fit(theta, acv.model = acv, dat = drw)

This should find the maximum likelihood solution. The ACV parameters (for of them in this case) are in result$par. We can use this to reconstruct the GP as follows.

We can use the gp_conditional function to compute the mean and covariance of the GP, at the simulation time t (above) with the specified ACV model, conditional on the given data.

   # reconstruct process: compute 'conditional' mean and covariance
   gp <- gin::gp_conditional(result$par, acv.model = acv, dat = drw, t.star = t)

and now we can overlay the conditional model

   # plot a 'snake' showing mean +/- std.dev
   gin::plot_snake(gp, add = TRUE, col.line = 3)

model

We can also add psuedo-random realisations of this process (conditional on the data)

   y.sim <- gin::gp_sim(result$par, dat = drw, acv.model = acv, t.star = t, 
                   N.sim = 5, plot = FALSE)
  for (i in 1:5) lines(t, y.sim[, i], col = i)

model

Bayesian inference

Rather than Maximum Likelihood, we could specify priors on the (hyper-)parameters of the ACV, and peform Bayesian inference.

   # define the 'priors' for the parameter values
   logPrior <- function(theta) {
     mu.d <- dnorm(theta[1], sd = 5, log = TRUE)
     nu.d <- dlnorm(theta[2], meanlog = 0, sdlog = 1, log = TRUE)
     A.d <- dlnorm(theta[3], meanlog = 0, sdlog = 1, log = TRUE)
     l.d <- dlnorm(theta[4], meanlog = 0, sdlog = 1, log = TRUE)
     return(mu.d + nu.d + A.d + l.d)
   }

In general, to do this we need an MCMC tool to sample from the posterior. Here I use the GW sampler from tonic.

   # Use gw.mcmc to generate parameter samples
   chain <- tonic::gw_sampler(gp_logPosterior, theta.0 = theta,
                              acv.model = acv, logPrior = logPrior,
                              dat = drw, burn.in = 1e4,
                              nsteps = 40e4,
                              chatter = 1, thin = 10)

This takes a minute or two to run. It should produce 2,000 samples after `thinning' by a factor 10 (it generates 20,000 samples but keeps only 1 in 10). First, we inspect the traces and autocorrelation of the chains.

   # plot MCMC diagnostics
   tonic::chain_diagnosis(chain)

Now we can visualise the posterior by plotting the 1 and 2 parameter marginal distributions.

   # name the parameters
   colnames(chain$theta) <- c("mu", "nu", "A", "l")

   # plot scatter diagrams
   tonic::contour_matrix(chain$theta, prob.levels = c(1,2,3), sigma = TRUE,
                         prob1d = 0)

The contours represent 1-, 2- and 3-sigma levels (in the sense of 68.3%, 95.4% and 99.7% of the mass).

bayesian

To do

  • Inclide a range of ready-to-use ACV functions
  • Compute ACFs based on FT of specified PS
  • Allow for time binning (not point samples in time)
  • Allow for lognormal process
  • use matrix rather than data.frame format for data (speed)

References

For more on GPs, the best reference is:

  • C. E. Rasmussen & K. I. Williams, Gaussian Processes for Machine Learning, online here

Excellent online resources inclide

A Visual Exploration of Gaussian Processes

Intuition behind Gaussian Processes

Robust Gaussian Process Modeling