Master-Thesis/50_Choice_of_Hyperparameters.tex

\section{Choice of Hyperparameters}

This section will discuss and try to validate the choice of all the
hyperparameters necessary for the training of a \acrshort{gp} model to capture
the CARNOT building's behaviour.

The class of black-box models is very versatile, being able to capture plant
behaviour directly from data. This comes in contrast to white-box and grey-box
modelling techniques, which require much more physical insight into the plant's
behaviour.

The advantage of black-box models lies in the lack of physical parameters to be
fitted. On the flip side, this versatility of being able to fit much more
complex models purely on data comes at the cost of having to properly define the
model hyperparameters: the number of regressors, the number of autoregressive
lags for each class of inputs, the shape of the covariance function have to be
taken into account when designing a \acrshort{gp} model. These choices have
direct influence on the resulting model behaviour and where it can be
generalized, as well as indirect influence in the form of more time consuming
computations in the case of larger number of regressors and more complex kernel
functions.

As described in Section~\ref{sec:gp_dynamical_system}, for the purpose of this
project, the \acrlong{gp} model will be trained using the \acrshort{narx}
structure. This already presents an important choice in the selection of
regressors and their respective autoregressive lags.

The output of the model has been chosen as the \textit{temperature measured
inside} the CARNOT building. This is a suitable choice for the \acrshort{ocp}
defined in Section~\ref{sec:mpc_problem}, where the goal is tracking as close as
possible the inside temperature of the building.

The input of the \acrshort{gp} model coincides with the input of the CARNOT
building, namely the \textit{heat} passed to the idealized \acrshort{hvac},
which is held constant at each step.

As for the exogenous inputs the choice turned out to be more complex. The CARNOT
\acrshort{wdb} format (cf. Section~\ref{sec:CARNOT_WDB}) consists of information
of all the solar angles, the different components of solar radiation, wind speed
and direction, temperature, precipitation, etc. All of this information is
required in order for CARNOT's proper functioning.

Including all of this information into the \acrshort{gp}s exogenous inputs would
come with a few downsides. First, depending on the number of lags chosen for the
exogenous inputs, the number of inputs to the \acrshort{gp} could be very large,
an exogenous inputs vector of 10 elements with 2 lags would yield 20 inputs for
the \acrshort{gp} model. This is very computationally expensive both for
training and using the model, as its algorithmic complexity is
$\mathcal{O}(n^3)$.

Second, this information may not always available in experimental
implementations on real buildings, where legacy equipment might already be
installed, or where budget restrictions call for simpler equipment.  An example
of this are the experimental datasets used for validation of the CARNOT model,
where the only available weather information is the \acrshort{ghi} and the
measurement of the outside temperature. This would also be a limitation when
getting the weather predictions for the next steps during real-world
experiments.

Last, while very verbose information, such as the solar angles and the components
of the solar radiation is very useful for CARNOT which simulates each node
individually knowing their absolute positions, this information would not
always benefit the \acrshort{gp} model at least not comparably to the
additional computational complexity.

For the exogenous inputs the choice has therefore been made to take the
\textit{Global Solar Irradiance} and \textit{Outside Temperature Measurement}.

\subsection{The Kernel}

The covariance matrix is an important choice when creating the \acrshort{gp}. A
properly chosen kernel can impose a prior desired behaviour on the
\acrshort{gp} such as continuity of the function and its derivatives,
periodicity, linearity, etc. On the flip side, choosing the wrong kernel can
make computations more expensive, require more data to learn the proper
behaviour or outright be numerically unstable and/or give erroneous predictions.

The \acrlong{se} kernel (cf. Section~\ref{sec:Kernels}) is very versatile,
theoretically being able to fit any continuous function given enough data. When
including the \acrshort{ard} behaviour, it also gives an insight into the
relative importance of each regressor, though their respective lengthscales.

Many different kernels have been used when identifying models for building
thermal control, such as a pure Rational Quadratic
Kernel~\cite{pleweSupervisoryModelPredictive2020}, a combination of
\acrshort{se}, \acrshort{rq} and a Linear
Kernel~\cite{jainLearningControlUsing2018}, Squared Exponential Kernel and
Kernels from the M\`atern family~\cite{massagrayThermalBuildingModelling2016}.

For the purpose of this project, the choice has been made to use the
\textit{\acrlong{se} Kernel}, as it provides a very good balance of versatility
and computational complexity for the modelling of the CARNOT building.

\subsection{Lengthscales}\label{sec:lengthscales}

The hyperparameters of the \acrshort{se} can be useful when studying the
importance of regressors. The larger the distance between the two inputs, the
less correlated they are. In fact, setting the kernel variance $\sigma^2$ = 1,
we can compute the correlation of two inputs located at distance one, two and
three lengthscales apart.

\begin{table}[ht]
\centering
    \begin{tabular}{||c c ||}
        \hline
        $\norm{\mathbf{x} - \mathbf{x}'}$ &
        $\exp{(-\frac{1}{2}*\frac{\norm{\mathbf{x} - \mathbf{x}'}^2}{l^2})}$ \\
        \hline \hline
        $1l$ & 0.606 \\
        $2l$ & 0.135 \\
        $3l$ & 0.011 \\
        \hline
    \end{tabular}
\caption{Correlation of inputs relative to their distance}
\label{tab:se_correlation}
\end{table}

From Table~\ref{tab:se_correlation} is can be seen that at 3 lengthscales apart,
the inputs are already almost uncorrelated. In order to better visualize this
difference the value of \textit{relative lengthscale importance} is introduced:

\begin{equation}
    \lambda = \frac{1}{\sqrt{l}}
\end{equation}

Another indicator of model behaviour is the variance of the identified
\acrshort{se} kernel. The expected value of the variance is around the variance
of the inputs. An extremely high or extremely low value of the variance could
mean a numerically unstable model.

Table~\ref{tab:GP_hyperparameters} presents the relative lengthscale importances
and the variance for different combinations of the exogenous input lags ($l_w$),
the controlled input lags ($l_u$) and the output lags ($l_y$) for a classical
\acrshort{gp} model.

\begin{table}[ht]
%\vspace{-8pt}
\centering
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{||c c c|c|c c c c c c c c c c c||}
        \hline
        \multicolumn{3}{||c|}{Lags} & Variance &\multicolumn{11}{c||}{Kernel
        lengthscales relative importance} \\
            $l_w$ & $l_u$ & $l_y$ & $\sigma^2$ &$\lambda_{w1,1}$ & $\lambda_{w1,2}$ &
            $\lambda_{w1,3}$ & $\lambda_{w2,1}$ & $\lambda_{w2,2}$ &
            $\lambda_{w2,3}$ & $\lambda_{u1,1}$ & $\lambda_{u1,2}$ &
            $\lambda_{y1,1}$ & $\lambda_{y1,2}$ & $\lambda_{y1,3}$\\
        \hline \hline
        1 & 1 & 1 & 0.11 & 0.721 &  &  & 2.633 &  &  & 0.569 &  & 2.645 &  &  \\
        1 & 1 & 2 & 22.68 & 0.222 &  &  & 0.751 &  &  & 0.134 &  & 3.154 & 3.073
          &  \\ 1 & 1 & 3 & 0.29 & 0.294 &  &  & 1.303 &  &  & 0.356 &  & 2.352
          & 1.361 & 2.045 \\ 1 & 2 & 1 & 7.55 & 0.157 &  &  & 0.779 &  &  &
        0.180 & 0.188 & 0.538 &  &  \\ 1 & 2 & 3 & 22925.40 & 0.018 &  &  &
        0.053 &  &  & 0.080 & 0.393 & 0.665 & 0.668 & 0.018 \\ 2 & 1 & 2 & 31.53
              & 0.010 & 0.219 &  & 0.070 & 0.719 &  & 0.123 &  & 3.125 & 3.044 &
              \\ 2 & 1 & 3 & 0.44 & 0.007 & 0.251 &  & 0.279 & 1.229 &  & 0.319
                   &  & 2.705 & 1.120 & 2.510 \\ 3 & 1 & 3 & 0.56 & 0.046 &
              0.064 & 0.243 & 0.288 & 1.151 & 0.233 & 0.302 &  & 2.809 & 1.086 &
              2.689 \\ 3 & 2 & 2 & 1.65 & 0.512 & 0.074 & 0.201 & 0.161 & 1.225
                         & 0.141 & 0.231 & 0.331 & 0.684 & 0.064 &  \\
        \hline
    \end{tabular}%
    }
\caption{GP hyperparameter values for different autoregressive lags}
\label{tab:GP_hyperparameters}
\end{table}

In general, the results of Table~\ref{tab:GP_hyperparameters} show that the
past outputs are important when predicting future values. Of importance is also
the past inputs, with the exception of the models with very high variance, where
the relative importances stay almost constant across all the inputs. For the
exogenous inputs, the outside temperature ($w2$) is generally more important
than the solar irradiation ($w1$). In the case of more autoregressive lags for
the exogenous inputs the more recent information is usually more important in
the case of the solar irradiation, while the second-to-last measurement is
preffered for the outside temperature.

For the classical \acrshort{gp} model the appropriate choice of lags would be
$l_u = 1$ and $l_y = 3$ with $l_w$ taking the values of either 1, 2 or 3,
depending on the results of further analysis.


As for the case of the \acrlong{svgp}, the results for the classical
\acrshort{gp} (cf. Table~\ref{tab:GP_hyperparameters}) are not necessarily
representative of the relationships between the regressors of the
\acrshort{svgp} model, due to the fact that the dataset used for training is
composed of the \textit{inducing variables}, which are not the real data, but a
set of parameters chosen by the training algorithm in a way to best generate the
original data.

Therefore to better understand the behaviour of the \acrshort{svgp} models, the
same computations as in Table~\ref{tab:GP_hyperparameters} have been made,
presented in Table~\ref{tab:SVGP_hyperparameters}:

\begin{table}[ht]
%\vspace{-8pt}
\centering
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{||c c c|c|c c c c c c c c c c c||}
        \hline
        \multicolumn{3}{||c|}{Lags} & Variance &\multicolumn{11}{c||}{Kernel
        lengthscales relative importance} \\
            $l_w$ & $l_u$ & $l_y$ & $\sigma^2$ &$\lambda_{w1,1}$ & $\lambda_{w1,2}$ &
            $\lambda_{w1,3}$ & $\lambda_{w2,1}$ & $\lambda_{w2,2}$ &
            $\lambda_{w2,3}$ & $\lambda_{u1,1}$ & $\lambda_{u1,2}$ &
            $\lambda_{y1,1}$ & $\lambda_{y1,2}$ & $\lambda_{y1,3}$\\
        \hline \hline
                1 & 1 & 1 & 0.2970 & 0.415 &  &  & 0.748 &  &  & 0.675 &  & 0.680 &  &  \\
        1 & 1 & 2 & 0.2717 & 0.430 &  &  & 0.640 &  &  & 0.687 &  & 0.559 & 0.584 &  \\
        1 & 1 & 3 & 0.2454 & 0.455 &  &  & 0.589 &  &  & 0.671 &  & 0.522 & 0.512 & 0.529 \\
        1 & 2 & 1 & 0.2593 & 0.310 &  &  & 0.344 &  &  & 0.534 & 0.509 & 0.597 &  &  \\
        1 & 2 & 3 & 0.2139 & 0.330 &  &  & 0.368 &  &  & 0.537 & 0.447 & 0.563 & 0.410 & 0.363 \\
        2 & 1 & 2 & 0.2108 & 0.421 & 0.414 &  & 0.519 & 0.559 &  & 0.680 &  & 0.525 & 0.568 &  \\
        2 & 1 & 3 & 0.1795 & 0.456 & 0.390 &  & 0.503 & 0.519 &  & 0.666 &  & 0.508 & 0.496 & 0.516 \\
        3 & 1 & 3 & 0.1322 & 0.432 & 0.370 & 0.389 & 0.463 & 0.484 & 0.491 & 0.666 &  & 0.511 & 0.501 & 0.526 \\
        3 & 2 & 2 & 0.1228 & 0.329 & 0.317 & 0.325 & 0.334 & 0.337 & 0.331 &
        0.527 & 0.441 & 0.579 & 0.435 &  \\
        \hline
    \end{tabular}%
    }
\caption{SVGP hyperparameter values for different autoregressive lags}
\label{tab:SVGP_hyperparameters}
\end{table}

The results of Table~\ref{tab:SVGP_hyperparameters} are not very surprising, even
if very different from the classical \acrshort{gp} case. The kernel variance is
always of a reasonable value, and the relative importance of the lengthscales is
relatively constant across the board. It is certainly harder to interpret these
results as pertaining to the relevance of the chosen regressors. For the
\acrshort{svgp} model, the choice of the autoregressive lags has been made
purely on the values of the loss functions, presented in
Table~\ref{tab:SVGP_loss_functions}.

\subsection{Loss functions}

The most important metric for measuring the performance of a model is the value
of the loss function, computed on a dataset separate from the one used for
training.

There exist a number of different loss functions, each focusing on different
aspects of a model's performance. A selection of loss functions used in
identification of Gaussian Process models is presented
below~\cite{kocijanModellingControlDynamic2016}.

The \acrfull{rmse} is a very commonly used performance measure. As the name
suggests, it computes the root of the mean squared error:

\begin{equation}\label{eq:rmse}
    \text{RMSE} = \sqrt{\frac{1}{N}\sum_{i=1}^N \left(y_i -
    E(\hat{y}_i)\right)^{2}}
\end{equation}

This performance metric is very useful when training a model whose goal is
solely to minimize the difference between the measured values, and the ones
predicted by the model.

A variant of the \acrshort{mse} is the \acrfull{smse}, which normalizes the
\acrlong{mse} by the variance of the output values of the validation dataset.

\begin{equation}\label{eq:smse}
    \text{SMSE} = \frac{1}{N}\frac{\sum_{i=1}^N \left(y_i -
    E(\hat{y}_i)\right)^{2}}{\sigma_y^2}
\end{equation}

While the \acrshort{rmse} and the \acrshort{smse} are very good at ensuring the
predicted mean value of the Gaussian Process is close to the measured values of
the validation dataset, the confidence of the Gaussian Process prediction is
completely ignored. In this case, two models predicting the same mean values but
having very different confidence intervals, would be equivalent according to these
performance metrics.

The \acrfull{lpd} is a performance metric which takes into account not only the
mean value of the GP prediction, but the entire distribution:

\begin{equation}
    \text{LPD} = \frac{1}{2} \ln{\left(2\pi\right)} + \frac{1}{2N}
    \sum_{i=1}^N\left(\ln{\left(\sigma_i^2\right)} + \frac{\left(y_i -
    E(\hat{y}_i)\right)^{2}}{\sigma_i^2}\right)
\end{equation}

where $\sigma_i^2$ is the model's output variance at the \textit{i}-th step.
The \acrshort{lpd} scales the error of the mean value prediction $\left(y_i -
E(\hat{y}_i)\right)^{2}$ by the variance $\sigma_i^2$. This means that the
overconfident models get penalized more than the more conservative models for
the same mean prediction error, leading to models that better represent
the real system.

The \acrfull{msll} is obtained by subtracting the loss of the model that
predicts using a Gaussian with the mean $E(\boldsymbol{y})$ and variance
$\sigma_y^2$ of the measured data from the model \acrshort{lpd} and taking the
mean of the obtained result:

\begin{equation}
    \text{MSLL} = \frac{1}{2N}\sum_{i=1}^N\left[
        \ln{\left(\sigma_i^2\right) + \frac{\left(y_i -
        E\left(\hat{y}_i\right)\right)^2}{\sigma_i^2}}
    \right] - \frac{1}{2N}\sum_{i=1}^N\left[
        \ln{\left(\sigma_y^2\right) + \frac{\left(y_i -
        E\left(\boldsymbol{y}\right)\right)^2}{\sigma_y^2}}
    \right]
\end{equation}

The \acrshort{msll} is approximately zero for simple models and negative for
better ones.

Table~\ref{tab:GP_loss_functions} and Table~\ref{tab:SVGP_loss_functions}
present the values of the different loss functions for the same lag combinations
as the ones analyzed in Section~\ref{sec:lengthscales} for the classical
\acrshort{gp} and the \acrshort{svgp} models respectively:

\begin{table}[ht]
%\vspace{-8pt}
\centering
    \begin{tabular}{||c c c|c c c c||}
        \hline
        \multicolumn{3}{||c|}{Lags} & \multicolumn{4}{c||}{Loss functions}\\
        $l_w$ & $l_u$ & $l_y$ & RMSE & SMSE & MSLL & LPD\\
        \hline \hline
        1 & 1 & 1 & 0.3464 & 0.36394 & 20.74 & 21.70 \\
        1 & 1 & 2 & 0.1415 & 0.06179 & -9.62 & -8.67 \\
        1 & 1 & 3 & 0.0588 & 0.01066 & -8.99 & -8.03 \\
        1 & 2 & 1 & 0.0076 & 0.00017 & 71.83 & 72.79 \\
        1 & 2 & 3 & \textbf{0.0041} & \textbf{0.00005} & 31.25 & 32.21 \\
        2 & 1 & 2 & 0.1445 & 0.06682 & -9.57 & -8.61 \\
        2 & 1 & 3 & 0.0797 & 0.02033 & -10.94 & -9.99 \\
        3 & 1 & 3 & 0.0830 & 0.02219 & \textbf{-11.48} & \textbf{-10.53} \\
        3 & 2 & 2 & 0.0079 & 0.00019 & 58.30 & 59.26 \\
        \hline
    \end{tabular}
\caption{GP Loss function values for different autoregressive lags}
\label{tab:GP_loss_functions}
\end{table}

For the classical \acrshort{gp} model (cf. Table~\ref{tab:GP_loss_functions}) a
number of different lag combinations give rise to models with very large
\acrshort{msll}/\acrshort{lpd} values. This might indicate that those models are
overconfident, either due to the very large kernel variance parameter, or the
specific lengthscales combinations. The model with the best
\acrshort{rmse}/\acrshort{smse} metrics \model{1}{2}{3} had very bad
\acrshort{msll} and \acrshort{lpd} metrics, as well as by far the largest
variance of all the combinations. On the contrary, the \model{3}{1}{3} model has
the best \acrshort{msll} and \acrshort{lpd} performance, while still maintaining
small \acrshort{rmse} and \acrshort{smse} values. The inconvenience of this set
of lags is the large number of regressors, which leads to much more expensive
computations. Other good choices for the combinations of lags are
\model{2}{1}{3} and \model{1}{1}{3}, which have good performance on all four
metrics, as well as being cheaper from a computational perspective. In order to
make a more informed choice for the best hyperparameters, the simulation
performance of all three combinations has been analysed.

\clearpage

\begin{table}[ht]
%\vspace{-8pt}
\centering
    \begin{tabular}{||c c c|c c c c||}
        \hline
        \multicolumn{3}{||c|}{Lags} & \multicolumn{4}{c||}{Loss functions}\\
        $l_w$ & $l_u$ & $l_y$ & RMSE & SMSE & MSLL & LPD\\
        \hline \hline
        1 & 1 & 1 & 0.3253 & 0.3203 & 228.0278 & 228.9843 \\
        1 & 1 & 2 & 0.2507 & 0.1903 & 175.5525 & 176.5075 \\
        1 & 1 & 3 & 0.1983 & 0.1192 & 99.7735 & 100.7268 \\
        1 & 2 & 1 & 0.0187 & 0.0012 & -9.5386 & -8.5836 \\
        1 & 2 & 3 & \textbf{0.0182} & \textbf{0.0011} & \textbf{-10.2739} &
        \textbf{-9.3206} \\
        2 & 1 & 2 & 0.2493 & 0.1884 & 165.0734 & 166.0284 \\
        2 & 1 & 3 & 0.1989 & 0.1200 & 103.6753 & 104.6287 \\
        3 & 1 & 3 & 0.2001 & 0.1214 & 104.4147 & 105.3681 \\
        3 & 2 & 2 & 0.0206 & 0.0014 & -9.9360 & -8.9826 \\
        \hline
    \end{tabular}
\caption{SVGP Loss function values for different autoregressive lags}
\label{tab:SVGP_loss_functions}
\end{table}

The results for the \acrshort{svgp} model, presented in
Table~\ref{tab:SVGP_loss_functions} are much less ambiguous. The \model{1}{2}{3}
model has the best performance according to all four metrics, with most of the
other combinations scoring much worse on the \acrshort{msll} and \acrshort{lpd}
loss functions. This has, therefore, been chosen as the model for the full year
simulations.


\subsection{Validation of hyperparameters}\label{sec:validation_hyperparameters}

The validation step has the purpose of testing the viability of the trained
models. If choosing a model according to loss function values on a new dataset
is a way of minimizing the possibility of over fitting the model to the training
data, validating the model by analyzing its multi-step prediction performance
ensures the model was able to learn the correct dynamics and is useful in
simulation scenarios.

The following subsections analyze the performance of the trained \acrshort{gp}
and \acrshort{svgp} models over 20-step ahead predictions. For the \acrshort{gp}
model the final choice of parameters is made according to the simulation
performance. The simulation performance of the \acrshort{svgp} model is compared
to that of the classical models while speculating on the possible reasons for
the discrepancies.

\subsubsection{Conventional Gaussian Process}

The simulation performance of the three lag combinations chosen for the
classical \acrlong{gp} models has been analysed, with the results presented in
Figures~\ref{fig:GP_113_multistep_validation},~\ref{fig:GP_213_multistep_validation}
and~\ref{fig:GP_313_multistep_validation}. For reference, the one-step ahead
predictions for the training and test datasets are presented in
Appendix~\ref{apx:hyperparams_gp}.


\begin{figure}[ht]
    \centering
    \includegraphics[width =
    \textwidth]{Plots/GP_113_-1pts_test_prediction_20_steps.pdf}
    \vspace{-25pt}
    \caption{20-step ahead simulation for \model{1}{1}{3}}
    \label{fig:GP_113_multistep_validation}
\end{figure}

In the case of the simplest model (cf.
Figure~\ref{fig:GP_113_multistep_validation}), overall the predictions are quite
good. The large deviation from true values starts happening at around 15 steps.
This could impose an additional limit on the size of the control horizon of the
\acrlong{ocp}.

\begin{figure}[ht]
    \centering
    \includegraphics[width =
    \textwidth]{Plots/GP_213_-1pts_test_prediction_20_steps.pdf}
    \vspace{-25pt}
    \caption{20-step ahead simulation for \model{2}{1}{3}}
    \label{fig:GP_213_multistep_validation}
\end{figure}

The more complex model, presented in
Figure~\ref{fig:GP_213_multistep_validation} has a much better prediction
performance, with only two predictions out of a total of twenty five diverging
at the later steps. Except for the late-stage divergence on the two predictions,
this proves to be the best simulation model.

\begin{figure}[ht]
    \centering
    \includegraphics[width =
    \textwidth]{Plots/GP_313_-1pts_test_prediction_20_steps.pdf}
    \vspace{-25pt}
    \caption{20-step ahead simulation for \model{3}{1}{3}}
    \label{fig:GP_313_multistep_validation}
\end{figure}

Lastly, \model{3}{1}{3} has a much worse simulation performance than the other
two models. This could hint at an over fitting of the model on the training data.
This is consistent with the results found in Table~\ref{tab:GP_loss_functions}
for the \acrshort{rmse} and \acrshort{smse}, as well as can be seen in
Appendix~\ref{apx:hyperparams_gp}, Figure~\ref{fig:GP_313_test_validation},
where the model has much worse performance on the testing dataset predictions
than the other two models.

The performance of the three models in simulation mode is consistent with the
previously found results. It is of note that neither the model that scored the
best on the \acrshort{rmse}/\acrshort{smse}, \model{1}{2}{3}, nor the one that
had the best \acrshort{msll}/\acrshort{lpd}, perform the best under a simulation
scenario. In the case of the former it is due to numerical instability, the
training/ prediction often failing depending on the inputs. On the other hand,
in the case of the latter, only focusing on the \acrshort{msll}/\acrshort{lpd}
performance metrics can lead to very conservative models, that give good and
confident one-step ahead predictions, while still unable to fit the true
behaviour of the plant.

Overall, the \model{2}{1}{3} performed the best in the simulation scenario,
while still having good performance on all loss functions. In implementation,
however, this model turned out to be very unstable, and the more conservative
\model{1}{1}{3} model was used instead.

\clearpage

    \subsubsection{Sparse and Variational Gaussian Process}

For the \acrshort{svgp} models, only the performance of \model{1}{2}{3} was
investigated, since it had the best performance according to all four loss
metrics.

As a first validation step, it is of note that the \acrshort{svgp} model was
able to accurately reproduce the training dataset with only 150 inducing
locations (cf.  Appendix~\ref{apx:hyperparams_svgp}). It also performs about as
well as the better \acrshort{gp} models for the one step prediction on the
testing datasets.

In the case of the simulation performance, presented in
Figure~\ref{fig:SVGP_multistep_validation}, two things are of particular
interest. First, all 25 simulations have good overall behaviour --- there are no
simulations starting to exhibit erratic behaviour --- this is a good indicator
for lack of over fitting. This behaviour is indicative of a more conservative
model than the ones identified for the \acrshort{gp} models. It is also possible
to conclude that given the same amount of data, the classical \acrshort{gp}
models can better learn plant behaviour, provided the correct choice of
regressors.

\begin{figure}[ht]
    \centering
    \includegraphics[width =
    \textwidth]{Plots/SVGP_123_test_prediction_20_steps.pdf}
    \caption{20-step ahead simulation for \model{1}{2}{3}}
    \label{fig:SVGP_multistep_validation}
\end{figure}


\clearpage