Gaussian Processes |

Mathias Engel

15 March 2018

The Problem

We got NN data points 𝐗N,𝐭N={𝐱(n),tn}n=1N,\bm{X}_N, \bm{t}_N = \{\bm{x}^{(n)}, t_n \}_{n=1}^N, where 𝐱\bm{x} is a vector and tt is a scalar.

We want a model y(𝐱)y(\bm{x}).

The probability distribution of a function y(𝐱)y(\bm{x}) is a Gaussian process if for any finite selection of points 𝐱(1),𝐱(2),...,𝐱(N)\bm{x}^{(1)}, \bm{x}^{(2)}, ..., \bm{x}^{(N)}, the marginal density P(y(𝐱(1)),y(𝐱(2)),...,y(𝐱(N)))P(y(\bm{x}^{(1)}), y(\bm{x}^{(2)}), ..., y(\bm{x}^{(N)})) is a Gaussian.

Visualizing the Gaussian Process

Visualizing the Gaussian Procces

From parametric models to Gaussian Processes1

  • Rnhϕh(𝐱(n))R_{nh} \equiv \phi_h(\bm{x}^{(n)}), given HH basis functions {ϕh(𝐱)}h=1H\{\phi_h(\bm{x})\}_{h=1}^H.

  • 𝐲N\bm{y}_N is defined by ynRnhwhy_n \equiv R_{nh} w_h, given basis weights whw_h.

  • Assume P(𝐰)=𝒩(0,σw2𝐈)P(\bm{w}) = \mathcal{N}(0, \sigma_w^{2} \bm{I}) is the prior of 𝐰\bm{w}.

𝐲\bm{y} is a linear in 𝐰\bm{w} and therefore also Gaussian distributed with zero mean, P(𝐲)=𝒩(0,𝐐)P(\bm{y}) = \mathcal{N}(0, \bm{Q}).

𝐐=<𝐲𝐲>=<𝐑𝐰𝐰𝐑>=𝐑<𝐰𝐰>𝐑=σw2𝐑𝐑.\bm{Q} = <\bm{y}\bm{y}^\top> = <\bm{R}\bm{w}\bm{w}^\top\bm{R}^\top> = \bm{R}<\bm{w}\bm{w}^\top>\bm{R}^\top = \sigma_w^2 \bm{R}\bm{R}^\top.

From parametric models to Gaussian Processes

  • Given measurment noise σv2\sigma_v^2, 𝐭\bm{t} has a prior distribution P(𝐭)=𝒩(0,𝐂)P(\bm{t}) = \mathcal{N}(0, \bm{\bm{C}}), with 𝐂=𝐐+σv2𝐈=σw2𝐑𝐑+σv2𝐈. \bm{C} = \bm{Q} +\sigma_v^2\bm{I} = \sigma_w^2\bm{R} \bm{R}^\top + \sigma_v^2\bm{I}.

  • The entries of C is in general Cnn*=σw2hϕ(𝐱(n))ϕ(𝐱(n*))+σv2δnn*.C_{nn^\ast} = \sigma_w^2 \sum_h \phi(\bm{x}^{(n)}) \phi(\bm{x}^{(n^\ast)}) + \sigma_v^2 \delta_{nn^\ast}.

  • Let HH \to \infty. Solving this integral gives the kernel function for the Gaussian Process.

Example: Measurements of breast cancer cells

James Longden (data) & Mathias Engel (algorithm)

Gaussian process regression

  • Hyperparameters of 𝐂\bm{C} is optimized on training data, (log likelihood).
  • New points are predicted as marginal univariate Gaussian distributions.
  • Missing data and measurement errors are gracefully handled.

Example: Measurements of breast cancer cells

Thank you

You can learn more at gaussianprocess.org or read

Choosing intial values for the kernel hyperparameters

Equation 40 and Figure 4 from the paper.
Equation 40 and Figure 4 from the paper.

Extra definition

  • We can represent a function as a unknown big vector f.
  • We assume that f was drawn from a big correlated Gaussian distribution, a Gaussian process.
  • Observing elements of the vector (optionally corrupted by Gaussian noise) creates a posterior distribution.
  • The posterior over functions is still a Gaussian process.
  • Because marginalization in Gaussians is trivial, we can easily ignore all of the data that are neither observed nor queried. Missing data, lazy evaluation.

  1. (loosely from eq. 16-23 in Mackay (1997))