# Gaussian Process Latent Variable Models

## What's that?

• a probabilistic non-linear dimensionality reduction method
• similar to Gaussian Process Regression
• where {x_i,y_i} training data is given and
• we want to learn f: X→Y
• but here we only have {y_i}
• so we put the latent space corresponding points {x_i}
• into our model M = ( {x_i}, covariance kernel parameters) as well
• and learn both: the latent space representations {x_i} and the mapping f: X→Y
• i.e. the {x_i} and the hyperparameters of the covariance kernels of the Gaussian processes are optimized during training
• for each regression dimension we train one Gaussian process
• during training -ln p(M|Y) is minimized using gradient descent
• i.e. p(M|Y) maximized

So in one sentence:

GPLVM = {x_i} + Gaussian Processes for each regression dimension

## Good explanations quoted

In order to perform GP regression (i.e. adjusting the parameters of the kernel), we theoretically need to know both the observed data Y and the latent space data X. As X is not known a priori, an initial estimation is given using PCA (see equation (3)).

Once X has been initialized, GP regressions and corrected estimations of X are performed iteratively until convergence has been achieved or until a maximum number of iterations has been reached.

To perform the GP regression, the likelihood of the GP and X given Y has to be maximized with respect to the parameters of the kernel. This likelihood function is chosen so as to favor smooth mappings from latent space to observation space. To reassess the values of the vectors x_i , they are chosen to maximize the likelihood of X given the GP and Y.

For large datasets (i.e. large values of m), one may reduce the computational complexity of these optimizations by performing the GP regression using only an active subset of X and Y, reassessing only the inactive subset of X and choosing a different active subset for the next iteration. With this approach, each x_i may be optimized independently.

GPLVMs were introduced in the context of visualization of high-dimensional data. GPLVMs perform nonlinear dimensionality reduction in the context of Gaussian processes. The underlying probabilistic model is still a GP regression model. However, the input values X are not given and become latent variables that need to be determined during learning. In the GPLVM, this is done by optimizing over both the latent space X and the hyperparameters: <X∗,θ∗> = argmax_{X,θ} log p(Y | X,θ)

This optimization can be performed using scaled conjugate gradient descent. In practice, the approach requires a good initialization to avoid local maxima. Typically, such initializa- tions are done via PCA or Isomap.

## Mathematical explanations 