\section{Gaussian Processes Background}\label{sec:gaussian_processes} The \acrfull{gp} is a member of the \textit{kernel machines} class of algorithms in the field of machine learning. The formal definition~\cite{rasmussenGaussianProcessesMachine2006} of \acrlong{gp} states that: \begin{displayquote} A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. \end{displayquote} A \acrshort{gp} is completely specified by a mean and a covariance function: \begin{equation} \begin{aligned} m(\mathbf{x}) &= \mathbb{E}[f(\mathbf{x})] \\ k(\mathbf{x}, \mathbf{x'}) &= \mathbb{E}[f(\mathbf{x} - m(\mathbf{x}))(f(\mathbf{x'}) - m(\mathbf{x'}))] \end{aligned} \end{equation} This notation is commonly abbreviated as: \begin{equation} f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x'})) \end{equation} Making the simplifying assumption of $m(\mathbf{x}) = 0$, we have the joint distribution of the training outputs $f$, and the test outputs $f_*$ according to the given prior: \begin{equation} \begin{bmatrix} \mathbf{f} \\ \mathbf{f_*} \\ \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K(X, X) & K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \\ \end{bmatrix} \right) \end{equation} In the case of noisy observations, assuming $y = f + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$, we have the following joint distribution: \begin{equation} \begin{bmatrix} \mathbf{y} \\ \mathbf{f_*} \\ \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K(X, X) + \sigma_n^2 I& K(X, X_*) \\ K(X_*, X) & K(X_*, X_*) \\ \end{bmatrix} \right) \end{equation} which, for the rest of the section, will be used in the abbreviated form: \begin{equation} \begin{bmatrix} \mathbf{y} \\ \mathbf{f_*} \\ \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K + \sigma_n^2 I& K_*^T \\ K_* & K_{**} \\ \end{bmatrix} \right) \end{equation} \subsection{Parameter learning} "Training" a \acrshort{gp} is the process of finding the kernel parameters that best explain the data. This is done by maximizing the probability density function for the observations $y$, also known as the marginal likelihood: \begin{equation}\label{eq:gp_likelihood} p(y) = \frac{1}{\sqrt{(2\pi)^{n}\det{\left(K + \sigma_n^2I\right)}}} \exp{\left(-\frac{1}{2}y^T\left(K + \sigma_n^2I\right)^{-1}y\right)} \end{equation} In order to simplify computation in actual \acrshort{gp} implementations, the value that gets maximized is the log of Equation~\ref{eq:gp_likelihood}, the log marginal likelihood: \begin{equation}\label{eq:gp_log_likelihood} \log(p(y)) = - \frac{1}{2}\log{\left( \det{\left( K + \sigma_n^2I \right)} \right)} - \frac{1}{2}y^T\left( K + \sigma_n^2I \right)^{-1}y - \frac{n}{2}\log{\left(2\pi\right)} \end{equation} \subsection{Prediction} Given the proper covariance matrices $K$ and $K_*$, predictions on new points can be made as follows: \begin{equation} \begin{aligned} \mathbf{f_*} = \mathbb{E}\left(f_*|X, \mathbf{y}, X_*\right) &= K_*\left(K + \sigma_n^2I\right)^{-1}\mathbf{y} \\ cov(\mathbf{f_*}) &= K_{**} - K_*\left(K +\sigma_n^2I\right)^{-1}K_*^T \\ \end{aligned} \end{equation} The extensions of these predictions to a non-zero mean \acrshort{gp} comes naturally by applying the zero mean \acrshort{gp} to the \textit{difference} between the observations and the fixed mean function: \begin{equation} \bar{\mathbf{f}}_* = \mathbf{m}(X_*) + K_*\left(K + \sigma_n^2I\right)^{-1}(\mathbf{y} - \mathbf{m}(X)) \\ \end{equation} \subsection{Kernels}\label{sec:Kernels} The choice of the kernel is an important part for any kernel machine class algorithm. It serves the purpose of shaping the behaviour of the \acrshort{gp} by imposing a desired level of smoothness of the resulting functions, a periodicity, linearity, etc. This extends the use cases of the \acrshort{gp} models while including any available prior information of the system to be modeled. For the purpose of identifying a dynamical system, a few kernels are popular choices~\cite{kocijanModellingControlDynamic2016}: \subsubsection*{Squared Exponential Kernel} This kernel is used when the system to be modelled is assumed to be smooth and continuous. The basic version of the \acrfull{se} kernel has the following form: \begin{equation} k(\mathbf{x}, \mathbf{x'}) = \sigma^2 \exp{\left(- \frac{1}{2}\frac{\norm{\mathbf{x} - \mathbf{x'}}^2}{l^2}\right)} \end{equation} With the model variance $\sigma^2$ and lengthscale $l$ as parameters. The lengthscale indicates how fast the correlation diminishes as the two points get further apart from each other. This function can be modified to use an arbitrary positive semi-definite matrix $\Lambda^{-1}$. This will impose different weights for each combination of $\mathbf{x}$ and $\mathbf{x'}$'s dimensions: \begin{equation} \begin{aligned} k(\mathbf{x}, \mathbf{x'}) &= \sigma^2\exp{\left[-\frac{1}{2} (\mathbf{x} - \mathbf{x'})^T \Lambda^{-1} (\mathbf{x} - \mathbf{x'})\right]} \\ &= \sigma^2 \exp{\left[-\frac{1}{2}\sum_{d=1}^D w_d(x_d - x'_d)^2\right]} \end{aligned} \end{equation} where $w_d = \frac{1}{l_d^2}; d = 1 ,\dots, D$, with D being the dimension of the data. This special case of $\Lambda^{-1} = \text{diag}{\left([l_1^{-2},\dots,l_D^{-2}]\right)}$ is equivalent to implementing different lengthscales on different regressors. This can be used to asses the relative importance of each regressor through the value of the hyperparameters. This is the \acrfull{ard} property. \subsubsection*{Rational Quadratic Kernel} The \acrfull{rq} Kernel can be interpreted as an infinite sum of \acrshort{se} kernels with different lengthscales. It has the same smooth behaviour as the \acrshort{se} Kernel, but can take into account the difference in function behaviour for large scale vs small scale variations. \begin{equation} k(\mathbf{x}, \mathbf{x'}) = \sigma^2 \exp{\left(1+\frac{1}{2}\frac{\norm{\mathbf{x} - \mathbf{x'}}^2}{\alpha l^2}\right)}^{-\alpha} \end{equation} When extended with \acrshort{ard} behaviour, it becomes: \begin{equation} k(\mathbf{x}, \mathbf{x'}) = \sigma^2\exp{\left[1 + \frac{1}{2} {(\mathbf{x} - \mathbf{x'})}^T \Lambda^{-1} (\mathbf{x} - \mathbf{x'})\right]}^{-\alpha} \\ \end{equation} \subsection{Extending GPs to larger datasets} There exist multiple methods of scaling \acrshort{gp}s to be used on more data, without inquiring the penalty of inverting the covariance matrix. An overview and comparison of multiple methods is given at~\cite{liuUnderstandingComparingScalable2019}. For the scope of this project, the choice of using the \acrshort{svgp} models has been made, since it provides a very good balance of scalability, capability, robustness and controllability~\cite{liuUnderstandingComparingScalable2019}. The \acrshort{svgp} has been first introduced by~\textcite{hensmanGaussianProcessesBig2013} as a way to scale the use of \acrshort{gp}s to large datasets. A detailed explanation on the mathematics of \acrshort{svgp}s and reasoning behind it is given at~\cite{yiSparseVariationalGaussian2021}, with a very brief summary presented below. \subsubsection{Summarizing the training data} Trivially downsampling the available data to a size that fits within the computational budget is not the best solution, since it implies ignoring large amounts of possibly otherwise unavailable information. In order to solve the problem of reducing the dataset size without the downside of discarding potentially useful information, a new set of random variables $f(X_s)$, usually denoted as $f_s$ is introduced, with the requirement that this new dataset has size $n_s$ smaller than the size $n$ of the original dataset. The $X_s$ are called \textit{inducing locations}, and $f_s$ --- \textit{inducing random variables}. They summarize the data in the sense that a model trained on this new dataset should be able to generate the original dataset with a high probability. The multivariate Gaussian distribution is used to establish the relationship between $f_s$ and $f$, which will serve the role of the prior, now called the sparse prior: \begin{equation} \begin{bmatrix} \mathbf{f}(X) \\ \mathbf{f}(X_s) \\ \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K(X, X) & K(X, X_s) \\ K(X_s, X) & K(X_s, X_s) \\ \end{bmatrix} \right) \end{equation} \subsubsection{Evidence Lower Bound}\label{sec:elbo} Computing the log likelihood is one of the most expensive parts of model training, due to inversion of the kernel matrix term ${\left(K + \sigma_n^2I\right)}^{-1}$ (cf. Equation~\ref{eq:gp_likelihood}) with an algorithmic complexity of $\mathcal{O}(n^3)$. In order to solve this problem, the log likelihood equation (Equation~\ref{eq:gp_log_likelihood}) used for training a classical \acrshort{gp} is replaced with an approximate value, that is computationally tractable on larger sets of data. The following derivation of the \acrfull{elbo} is based on the one presented in~\cite{yangUnderstandingVariationalLower}. Assume $X$ to be the observations, and $Z$ the set parameters of the \acrshort{gp} model, also known as the latent variables. The posterior distribution of the hidden variables can be written as follows: \begin{equation} p(Z|X) = \frac{p(X|Z)p(Z)}{p(X)} = \frac{p(X|Z)p(Z)}{\int_Z p(X,Z)} \end{equation} The main idea of Variational Bayesian methods is to replace the true posterior distribution $p(Z|X)$ with an approximation $q(Z)$ that is easier to compute, but is as close as possible to the original. One measure of the closeness of the two distributions is the \acrfull{kl} divergence, which for variational inference takes the following form: \vspace{5pt} \begin{equation} \begin{aligned} KL\left[q(Z)||p(Z|X)\right] &= \int_Z q(Z)\log{\left(\frac{q(Z)}{p(Z|X)}\right)} \\ &= - \int_Z q(Z)\log{\left(\frac{q(Z|X)}{p(Z)}\right)} \\ &= - \left( \int_Z q(Z)\log{\left(\frac{p(X,Z)}{q(Z)}\right)} - \int_Z q(Z)\log{\left(p(X)\right)} \right) \\ &= - \int_Z q(Z) \log{\left(\frac{p(X,Z)}{q(Z)}\right)} + \log{\left(p(X)\right)}\int_Z q(Z) \\ &= -L + \log{\left(p(X)\right)} \\ \end{aligned} \end{equation} \vspace{5pt} where L is the \acrshort{elbo}. Rearranging this equation we get: \begin{equation} L = \log{\left(p(X)\right)} - KL\left[q(Z)||p(Z|X)\right] \end{equation} Since KL is always $\geq 0$, we get that $L \leq \log{\left(p(X)\right)}$ is a lower bound of the log probability of observations. \subsection{Gaussian Process Models for Dynamical Systems}\label{sec:gp_dynamical_system} In the context of Dynamical Systems Identification and Control, \acrshort{gp}s are used to represent different model structures, ranging from state space and \acrfull{nfir} structures, to the more complex \acrfull{narx}, \acrfull{noe} and \acrfull{narmax}. The general form of an \acrshort{narx} model is as follows: \begin{equation} \hat{y}(k) = f(w(k-1),\dots,w(k-l_w),y(k-1),\dots,y(k-l_y),u(k-1),\dots,u(k-l_u)) \end{equation} where w are the exogenous inputs with a maximum lag of $l_w$, u are the model inputs with a maximum lag of $l_u$, and y are the model outputs with a maximum lag of $l_y$. This structure is a prediction model, since it uses the measured values of y as regressors. The \acrfull{noe} uses the past predictions $\hat{y}$ for future predictions: \begin{equation} \hat{y}(k) = f(w(k-1),\dots,w(k-l_w),\hat{y}(k-1),\dots,\hat{y}(k-l_y),u(k-1),\dots,u(k-l_u)) \end{equation} Due to its use for multi-step ahead simulation of system behaviour, as opposed to only predicting one state ahead using current information, the \acrshort{noe} structure can be considered a \textit{simulation model}. In order to get the best simulation results from a \acrshort{gp} model, the \acrshort{noe} structure would have to be employed. Due to the high algorithmic complexity of training and evaluating \acrshort{gp} models, this approach is computationally intractable. In practice a \acrshort{narx} model will be trained, which will be validated through multi-step ahead prediction. \clearpage