Skip to content

Commit

Permalink
Add post draft.
Browse files Browse the repository at this point in the history
  • Loading branch information
aterenin committed Nov 15, 2024
1 parent dde0961 commit d8fa13e
Show file tree
Hide file tree
Showing 9 changed files with 251 additions and 18 deletions.
4 changes: 2 additions & 2 deletions content/2020-03-01-Variational-Integrator-Networks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Using VINs allows us to easily learn models with physical forecasting behaviour

# From Residual Networks to Variational Integrator Networks

The idea is simple: if we view neural networks as dynamical systems[^haber][^E][^chen]---and discretize them in a manner that preserves qualitative physical properties[^marsden]---we can define network architectures that obey the laws of physics.
The idea is simple: if we view neural networks as dynamical systems[^haber][^E][^chen]---and discretize them in a manner that preserves qualitative physical properties[^marsden]---we can define network architectures that obey the laws of physics.
A particularly salient example of the kind of inductive bias we are interested in is the presence of conservation laws, for instance conservation of energy or conservation of momentum.

A canonical description of classical physical dynamical systems is Lagrangian mechanics, where a system is completely characterized by its Lagrangian `$L(q, \dot{q}, t)$`, a scalar function that encodes underlying physical properties.
Expand All @@ -57,7 +57,7 @@ $$
$$
```

discretized using an Euler scheme,[^haber][^E][^chen] giving
discretized using an Euler scheme,[^haber][^E][^chen] giving

```
$$
Expand Down
6 changes: 3 additions & 3 deletions content/2020-09-25-Riemannian-Matern-GP/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ where `$K_\nu$` is the modified Bessel function of the second kind, and `$\sigma
As `$\nu\to\infty$`, the Matérn kernel converges to the widely-used squared exponential kernel.

To generalize this class of Gaussian processes to the Riemannian setting, one might consider replacing Euclidean distances `$\Vert x-x' \Vert$` with the geodesic distance `$d_g(x, x')$`.
Unfortunately, this doesn't necessarily define a valid kernel: in particular, the geodesic squared exponential kernel already fails to be positive semi-definite for most manifolds, due to a recent no-go result.[^nogo][^nogo2]
Unfortunately, this doesn't necessarily define a valid kernel: in particular, the geodesic squared exponential kernel already fails to be positive semi-definite for most manifolds, due to a recent no-go result.[^nogo][^nogo2]
We therefore adopt a different approach, which is not based on geodesics.

# Stochastic partial differential equations
Expand Down Expand Up @@ -75,7 +75,7 @@ $$
```

where `$C$` is a constant chosen so that the variance is `$\sigma^2$` on average.[^sqexp]
By truncating this sum, we obtain a workable approximation for the kernel,[^sm] allowing us to train the process on data using standard methods, such as sparse inducing point techniques.[^vfe][^gpbd]
By truncating this sum, we obtain a workable approximation for the kernel,[^sm] allowing us to train the process on data using standard methods, such as sparse inducing point techniques.[^vfe][^gpbd]
The resulting posterior Gaussian processes are visualized below.


Expand All @@ -88,7 +88,7 @@ This equation is very well-studied, and a number of scalable techniques for solv
# Concluding remarks

We present techniques for computing the kernels, spectral measures, and Fourier feature approximations of Riemannian Matérn and squared exponential Gaussian processes, using spectral techniques via the Laplace--Beltrami operator.
This allows us to train these processes via standard techniques, such as variational inference via sparse inducing point methods,[^vfe][^gpbd] or Fourier feature methods.[^rff]
This allows us to train these processes via standard techniques, such as variational inference via sparse inducing point methods,[^vfe][^gpbd] or Fourier feature methods.[^rff]
In turn, this allows Riemannian Matérn Gaussian processes to easily be deployed in mini-batch, online, and non-conjugate settings.
We hope this work enables practitioners to easily deploy techniques such as Bayesian optimization in this setting.

Expand Down
10 changes: 5 additions & 5 deletions content/2023-12-10-Stochastic-Gradient-Descent-GP/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Let's see a simple comparison between standard large-scale Gaussian process appr
{% end %}

From this comparison, one can see that different large-scale Gaussian process approximations work well in different regimes.
Conjugate-gradient-based Gaussian processes[^cg] work well under large-domain asymptotics, whereas sparse Gaussian processes trained via variational inference[^ip-s][^ip-v] work well under infill asymptotics.
Conjugate-gradient-based Gaussian processes[^cg] work well under large-domain asymptotics, whereas sparse Gaussian processes trained via variational inference[^ip-s][^ip-v] work well under infill asymptotics.
One can show theory which suggests this this distinction holds beyond one-dimensional problems.[^ip-theory]
In contrast, the stochastic gradient descent variant we present looks very reasonable in both cases: it empirically converges in most regions of state space under infill asymptotics, and converges everywhere under large-domain asymptotics.
Let's look at this algorithm in more details.
Expand All @@ -48,7 +48,7 @@ Let's look at this algorithm in more details.

To formulate stochastic gradient descent for posterior sampling, let's begin by writing down a random quadratic optimization problem for computing posterior samples.
Let `$f \sim\mathrm{GP}(0,k)$` be the prior, and let `$\boldsymbol{y}\mid f\sim\mathrm{N}(f(\boldsymbol{x}), \mathbf\Sigma)$` be the likelihood.
Let's begin with the *pathwise conditioning*[^efficient-sampling][^pathwise-conditioning] formula for posterior random functions, namely
Let's begin with the *pathwise conditioning*[^efficient-sampling][^pathwise-conditioning] formula for posterior random functions, namely

```
$$
Expand All @@ -70,7 +70,7 @@ $$

We can stochastically estimate the large sum using minibatches.
Similarly, we can apply a Fourier-feature-based stochastic estimator for the squared norm term.
We use *efficient sampling* to approximately sample the prior `$f(x_i)$` using Fourier features.[^efficient-sampling][^pathwise-conditioning]
We use *efficient sampling* to approximately sample the prior `$f(x_i)$` using Fourier features.[^efficient-sampling][^pathwise-conditioning]
This gives us a subquadratic stochastic estimator for this optimization objective which is almost unbiased, in the sense that the only bias present is from efficiently sampling the prior.
To reduce this objective's variance, we apply a number of tricks, including carefully shifting the `$\boldsymbol\varepsilon$` noise term into the regularizer, which are described in the paper.
The result is a practical stochastic optimization objective for Gaussian process posterior samples.
Expand All @@ -83,7 +83,7 @@ In the paper, use stochastic gradient descent with Nesterov momentum, gradient c
Let's see how this algorithm performs, in particular how it is affected by observation noise in the likelihood.

{% figure(alt=["Convergence of stochastic gradient descent for the Gaussian process mean"] src=["exact_metrics.svg"] dark_invert=[true]) %}
**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights
**Figure 2.** Convergence of stochastic gradient descent for the Gaussian process posterior mean, in terms of training and test error, along with Euclidean error for the representer weights.
{% end %}

From this plot, it is clear that stochastic gradient descent does not converge approximately to the correct representer weights.
Expand Down Expand Up @@ -151,7 +151,7 @@ This suggests the benign non-convergence, which we previously saw in one dimensi

# Conclusion

In this, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
In this work, we explored using stochastic gradient descent to approximately compute Gaussian process posteriors, by way of means and function samples.
We examined how to derive appropriate stochastic optimization objectives for doing so, and showed that SGD can produce accurate predictions even in cases where it does not converge to the respective optimum under the given compute budget.
We developed a spectral characterization of the effect of non-convergence in terms of the spectral basis functions.
We showed that, on a Thompson sampling benchmark where well-calibrated uncertainty is critical, SGD matches or exceeds the performance of more computationally expensive baselines.
Expand Down
8 changes: 4 additions & 4 deletions content/2024-05-02-vGPMP-Motion-Planning/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ Our results show the proposal achieves a reasonable balance between the motion p
# Applying Variational Gaussian Processes to Motion Planning

We begin with a motion planning framework, which we call *variational Gaussian process motion planning (vGPMP)*.
This framework is based on variational Gaussian processes, which were originally introduced for scalability:[^vfe][^gpbd] here, we instead apply them to create a straightforward way to parameterize motion plans.
This framework is based on variational Gaussian processes, which were originally introduced for scalability:[^vfe][^gpbd] here, we instead apply them to create a straightforward way to parameterize motion plans.
Let `$\mathcal{T}$` represent time: our motion plan is a map `$f: \mathcal{T} \to \mathbb{R}^d$`, where the output space represents each of the robot's joints.
We parameterize `$f$` as a posterior Gaussian process, conditioned on `$f(\boldsymbol{z}) = \boldsymbol{u}$`, where `$\boldsymbol{z}$` is a set of inducing locations `$\boldsymbol{z} \in \mathcal{T}^m$`, and `$\boldsymbol{u}$` are robot joint states at times `$\boldsymbol{z}$`.
We interpret `$(z_j,u_j)$`-pairs as *waypoints* through which the robot should move.
Our precise formulation in the paper also includes a bijective map which accounts for joint constraints: we suppress this here for simplicity.

To draw motion plans, we apply *pathwise conditioning*,[^efficient-sampling][^pathwise-conditioning] and represent posterior samples as
To draw motion plans, we apply *pathwise conditioning*,[^efficient-sampling][^pathwise-conditioning] and represent posterior samples as

```
$$
Expand All @@ -56,7 +56,7 @@ We illustrate this below.
Computing the motion plan therefore entails optimizing these parameters with respect to an appropriate variational objective.
Once optimized, in practice we can sample from the posterior using efficient sampling, that is, by first approximately sampling the prior `$f(\cdot)$` using Fourier features, then transforming the sampled prior motion plans into posterior motion plans.
This procedure allows us to draw random curves representing the posterior in a way that *resolves the stochasticity once in advance* per sample, after which we can evaluate and differentiate the motion plan at arbitrary time points without any additional sampling.
Compared to prior work such as GPMP2 and its variants,[^gpmp][^gpmp2][^igpmp2] we support general kernels and avoid relying on specialized techniques for stochastic differential equations, thereby enabling explicit control of motion plan smoothness properties.
Compared to prior work such as GPMP2 and its variants,[^gpmp][^gpmp2][^igpmp2] we support general kernels and avoid relying on specialized techniques for stochastic differential equations, thereby enabling explicit control of motion plan smoothness properties.
Additionally, in contrast with prior work,[^gvi] our formulation bypasses the need to use interpolation to evaluate the posterior in-between a set of pre-specified time points.

Following the framework of variational inference, the resulting variational posterior can be trained by solving the optimization problem
Expand Down Expand Up @@ -86,7 +86,7 @@ This is done by composing the forward kinematics map `$\operatorname{k}_{\operat
Then, we compute the hinge loss `$\operatorname{h}_\varepsilon(x) = \max(-x + \varepsilon, 0)$`, where `$\varepsilon$` is the *safety distance* parameter, and calculate its squared norm with respect to a diagonal scaling matrix `$\mathbf\Sigma_{\operatorname{obs}}$` which determines the overall importance of avoiding collisions in the objective.

The soft constraint term, which can be used to encode desired behavior such as a grasping pose, is handled analogously.
Compared to prior work,[^gpmp][^gpmp2][^igpmp2][^gvi] one of the key differences is the introduction of `$\sigma$`, which guarantees that joint limits are respected without the need for clamping or other post-processing-based heuristics.
Compared to prior work,[^gpmp][^gpmp2][^igpmp2][^gvi] one of the key differences is the introduction of `$\sigma$`, which guarantees that joint limits are respected without the need for clamping or other post-processing-based heuristics.

# Experiments

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit d8fa13e

Please sign in to comment.