Approximate parameter inference in systems biology using gradient matching: a comparative evaluation

Background A challenging problem in current systems biology is that of parameter inference in biological pathways expressed as coupled ordinary differential equations (ODEs). Conventional methods that repeatedly numerically solve the ODEs have large associated computational costs. Aimed at reducing this cost, new concepts using gradient matching have been proposed, which bypass the need for numerical integration. This paper presents a recently established adaptive gradient matching approach, using Gaussian processes (GPs), combined with a parallel tempering scheme, and conducts a comparative evaluation with current state-of-the-art methods used for parameter inference in ODEs. Among these contemporary methods is a technique based on reproducing kernel Hilbert spaces (RKHS). This has previously shown promising results for parameter estimation, but under lax experimental settings. We look at a range of scenarios to test the robustness of this method. We also change the approach of inferring the penalty parameter from AIC to cross validation to improve the stability of the method. Methods Methodology for the recently proposed adaptive gradient matching method using GPs, upon which we build our new method, is provided. Details of a competing method using RKHS are also described here. Results We conduct a comparative analysis for the methods described in this paper, using two benchmark ODE systems. The analyses are repeated under different experimental settings, to observe the sensitivity of the techniques. Conclusions Our study reveals that for known noise variance, our proposed method based on GPs and parallel tempering achieves overall the best performance. When the noise variance is unknown, the RKHS method proves to be more robust.


Background
A central objective of current systems biology research is explaining the actions among components in biopathways. A standard approach is to view a biopathway as a network of biochemical reactions, which is modelled as a system of ordinary differential equations (ODEs).
This system can typically be expressed as: where s ∈ {1, . . . , N } denotes one of N components (referred to throughout as "species") in the biopathway, x s (t i ) denotes the concentration of species s at time t i and x(t i ) is a vector of concentrations of all system components that influence or regulate the concentration of species s at time t i . 1 If, for example, species s is an mRNA, then x(t i ) might contain the concentrations of transcription factors (proteins), from which s is transcribed, that bind to the promoter of the gene. The regulation is described by the regulation function f. The type of regulatory interaction depends on the species involved, e.g., f may describe mass action kinetics, Michaelis-Menten kinetics, etc. All of these interactions depend on a vector of kinetic parameters, θ s . For many biopathways, only a small fraction of θ s can be measured in practice. Therefore, in order to understand the dynamics of the biopathway, the majority of these kinetic parameters need to be inferred from observed (typically noisy and sparse) time course concentration profiles. Conventional inference methods typically rely on searching the space of θ values, and at each candidate, numerically solving the ODEs and comparing the output with that observed. After choosing an appropriate noise model, the form of the likelihood is defined, and a measure of similarity between the data signals and the signals described by the current set of ODE parameters can be calculated. This process is repeated, as part of either an iterative optimisation scheme or sampling procedure in order to estimate the parameters. However, the computational costs involved with repeatedly numerically solving the ODEs are usually high.
Several authors have adopted approaches based on gradient matching (e.g. [1,2]), aiming to reduce this computational complexity. These approaches are based on the following two-step procedure. At the first step, interpolation is used to smooth the time series data, in order to avoid modelling noisy observations; in a second step, the kinetic parameters θ of the ODEs are either optimised or sampled, whilst minimising some metric measuring the difference between the slopes of the tangents to the interpolants, and the θ-dependent time derivative from the ODEs. In this fashion, the ODEs never have to be numerically integrated, and the problem of inferring the typically unknown initial conditions of the system is removed, as it is not required for matching gradients. A downside to this two-step scheme is that the results of parameter inference are critically dependent on the quality of the initial interpolant. Alternatively, as first suggested in [3], we can allow the ODEs to regularise the interpolant. Dondelinger et al. [4] applied this to the nonparametric Bayesian approach in [1], which uses Gaussian processes (GPs), and demonstrated that it significantly improves the parameter inference 1 A summary of the notation used throughout this paper can be found in Table 1.
accuracy and robustness with respect to noise. Unlike in [3], all hyperparameters that control the smoothness of the interpolants are consistently inferred in the framework of nonparametric Bayesian statistics, which dispenses with the need to use heuristics and approximations in the configuration of the interpolation function.
We further the work of [4] in two respects. Firstly, we combine adaptive gradient matching using GPs with a parallel tempering scheme for the parameter that controls the mismatch between the gradients. This is conceptually different from the inference paradigm of the mismatch parameter that Dondelinger et al. [4] uses. Ideally, if the ODEs provide a correct mathematical description of the system, there should be no difference between the gradients of the interpolant and those predicted from the ODEs. However, in practice, forcing the gradients to be equal is likely to cause parameter inference methods to converge to a local optimum of the likelihood. Forcing the gradients to immediately be the same would restrict the inference procedure to a section of the likelihood corresponding to parameters that agree with the gradient match. However, there is no guarantee that these parameters are suitable for the data, see [5] for details. A parallel tempering scheme is the natural way to deal with such local optima, as opposed to inferring the degree of mismatch, since different tempering levels correspond to different strengths of penalising the mismatch between the gradients.Campbell and Steele [5] explore a parallel tempering scheme, but in order to get an understanding as to how well utilising this scheme improves inference, the rest of the set-up should be as similar as possible. Hence, comparing the results directly to the GP approach in [4], won't provide us with this understanding, since the approach in [5] uses a different methodological paradigm. In this paper, we present a comparative assessment of parallel tempering versus inference in the context of gradient matching for the same modelling framework, i.e. without any confounding influence from the modelling choice. Secondly, we compare the approach of Bayesian inference using GPs with a variety of other methodological paradigms, within the specific context of comparing the gradients from the interpolant to the gradients from the ODEs, which is highly relevant to current computational systems biology.
We test the methods on two benchmark ODE systems: the Fitz-Hugh Nagumo system [which can model the behaviour of cardiac conditions such as: electrical excitation-conduction [6]; cardiac action potentials [7], and arrhythmias [8], as well as neurodegenerative diseases ( [9,10])], and a protein signalling transduction pathway [where cell signalling pathways can model cancers [11] and neurodegenerative diseases such as: Alzheimer's disease; Parkinson's disease, and amyotrophic lateral sclerosis (ALS) [12]], systems that are highly relevant to current biomedical engineering.
This paper is an extended version of [13] and includes a full description of the RKHS method in [14], as well as a series of comparative simulation studies using this method on the Fitz-Hugh Nagumo system, under different observational noise scenarios, and a protein signalling transduction pathway. The description of the methodology in [13] and [14] is also outlined in the review article [15], included here with permission from the authors.

Adaptive gradient matching with Gaussian processes
Consider a set of T arbitrary timepoints t 1 < · · · < t i < · · · < t T , and noisy observations is the data vector of the observations of all species concentrations at time t i , x(t i ) is the vector of the concentrations of all species at time t i , y s is the data vector of the observations of species concentrations s at all timepoints, x s is the vector of concentrations of species s at all timepoints, y s (t i ) is the observed datapoint of the concentration of species s at time t i , x s (t i ) is the concentration of species s at time t i and ǫ is multivariate Gaussian noise, ǫ ∼ N (0, σ 2 s I). The time-dependent signals of the system can be described by ordinary differential equations which can be represented in scalar form as T and ẋ s is the vector containing the gradients from the ODEs for species s at all timepoints. Then, where the dimension of the matrices X and Y are N by T. We take the approach in [1], and place a GP prior on x s , where φ s is a mean vector, which for simplicity we set as the sample mean, and K η s is a positive definite matrix of covariance functions with hyperparameters η s . Differentiation is a linear operation, and therefore a GP is closed under differentiation ( [16,17]), meaning that the joint prior distribution of the concentrations of the species x s and their time derivatives ẋ s is multivariate Gaussian with mean (φ s , 0) T and covariance functions where K η s (t i , t j ) are the components of the covariance matrix K η s . The conditional distribution for the state derivatives is obtained using elementary transformations of Gaussian distributions (see page 87 of [18] for details), yielding where We assume additive Gaussian noise with a state-specific error variance γ s , and so, from Eq. (2), we get Using a product of experts approach, Calderhead et al. [1], Dondelinger et al. [4], link the interpolant in Eq. (10) with the ODE model in Eq. (12), giving us the following distribution Therefore, the joint distribution is where γ is the vector which contains all the gradient mismatch parameters and p(θ), p(η), p(γ ) are the prior distributions over the respective parameters. Dondelinger et al. [4] show that the marginalisation over the state derivatives yields a closed form solution to Using the noise model in Eq. (4) and the closed form solution to Eq. (15), our full joint distribution becomes where p(σ 2 ) is the prior over the variance of the observational error. The work in [4] shows (13) p(ẋ s |X, θ s , φ s , η s , γ s ) ∝ p(ẋ s |x s , φ s , η s )p(ẋ s |X, θ s , γ s ) = N (µ s , A s )N (f s (X, θ s , t), γ s I).
where C = s |2π(A s + γ s I)| 1 2 and f s is the vector containing the ODE predicted gradients for species s. Sampling is conducted using MCMC and the whitening approach of [19] is used to efficiently sample in the joint space of latent variables X and GP hyperparameters η.
Parallel tempering: Consider a series of "temperatures", 0 = α (1) < · · · < α (M) = 1 and a power posterior distribution of our ODE parameters ( [20]) It is clear that Eq. (18) becomes the prior for α (i) = 0 and is the posterior when α (i) = 1. For 0 < α (i) < 1 we get a distribution between our prior and posterior. The M α (i) s in Eq. (18) are annealed likelihoods that are used as the target densities of parallel MCMC chains ( [5]). At each MCMC step, all "temperature" chains independently perform a metropolishastings step to update θ (i) , the parameter vector associated with temperature α (i) where q( ) represents the proposal distribution and the superscripts "prop" and "curr" indicate whether the algorithm is being evaluated at the proposed or current state, respectively. At each MCMC step, two chains are randomly selected (uniformly) and the corresponding parameters are proposed to swap between them. This proposal has acceptance probability The method we develop in this paper focuses on the intrinsic slack parameter γ s (see Eq. 12), which theoretically should be γ s = 0, since this corresponds to no mismatch between the gradients. In practice, to prevent the inference scheme from getting stuck in sub-optimal states, it is allowed to take on larger values γ s > 0. However, rather than inferring γ s like a model parameter, as Dondelinger et al. [4] do, other authors (e.g. [5]) state that γ s should be gradually set to zero, since values closer to zero force the gradients to be more similar to one another and allow the interpolants to be informed by the ODEs. It is possible to abruptly set the values to zero, rather than gradually, however this is likely to cause the parameter inference techniques to converge to a local optimum of the likelihood. Hence, we combine the gradient matching with GPs approach in [4] with the tempering approach in [5] and temper this parameter to zero.
Prior to the parameter inference, we choose values of γ s and assign them to the variance parameter in Eq. (12) for each "temperature" α (i) , such that chains closer to the prior (α (i) values closer to 0) allow the gradients from the interpolant to have more freedom to deviate from those predicted by the ODEs (which corresponds to larger γ s values), chains closer to the posterior (α (i) values closer to 1) more closely match the gradients (corresponding to smaller γ s values), and for the chain corresponding to α (M) = 1 , we want the mismatch to be approximately zero (γ s ≈ 0). Since γ s corresponds to the (18) p α (i) (θ (i) |y) ∝ p(θ (i) )p(y|θ (i) ) α (i) . variance of our species-specific error (see Eq. 12), as γ s → 0, we have less difference between the gradients, and as γ s gets larger, the gradients have more freedom to deviate from one another. Hence, we temper γ s towards zero. Now, each α (i) chain in Eq. (18) has a γ (i) s (where the superscript (i) indicates the gradient mismatch parameter associated with "temperature" α (i) ) fixed in place for the strength of the gradient mismatch. The specific schedules of the gradient mismatch parameter are included in Table 4.

Reproducing kernel Hilbert space
Reproducing kernel Hilbert spaces (RKHS) allow for any function defined in an RKHS to be written as a linear combination of the kernel function evaluated at the training points. This provides a computationally fast process for interpolation. The objective function is expressed as where y denotes the data, K is a matrix of kernel elements for all combinations of observed timepoints and c is a vector of coefficients. Minimising with respect to c gives us Hence, where t * is the timepoint at which one wants to make predictions and k * is the vector of kernel elements for all combinations of t * and t i . Note that this form is the same form as a posterior mean of a GP predictive distribution. For more details on RKHS, see [21].

Penalised likelihood with RKHS
The aim of González et al. [14] is to create a penalised likelihood function that incorporates the information of the ODEs, then, using the properties of reproducing kernel Hilbert spaces, perform parameter estimation in a computationally fast manner. González et al. [14] consider ODEs of the form which can be represented in scalar form as where x s is the vector of concentrations for species s, δ s is the degradation rate of the concentrations for species s, ρ s is a parameter vector for species s and g s (t) = g s (t 1 ), . . . , g s (t T ) T . It is important to realise the difference between Eqs. (1) and (24). Whereas in Eq. (1), all parameter terms are included in the function f s (), Eq. (24) To make it clear how Dx s is computed, as an example let us consider x s = (x(t 1 ), . . . , x(t 5 )) T and t = (3, 4, 5, 6, 7) T . Then Now denote R = D + δ s I (where I is the identity matrix). This gives us the following penalty to be incorporated into the likelihood term: From Eq. (27), we can see that Rx s − g s (X, ρ s , t) = 0. However, since x s = 0 does not necessarily imply that (x s ) = 0, Eq. (29) cannot be expressed as a norm of x s within the RKHS framework. In order to make them compatible, the authors transform the state variables x s (and subsequently y s ). Instead, consider (where Eq. (33) is used as the term in the far right of Eq. (21) and c is given in Eq. (22)).
We are able to obtain closed form expressions for the transformed state variables by using Eqs. (22) and (23) (the original expressions can be recovered using Eq. 30) where is the covariance matrix of the data (generalising Eq. 22, since the observational error of our data may not be independent between species) and s is a penalty parameter.
In the case of homogeneous ODEs, where g s () = 0, a kernel in a Hillbert space can be constructed using the Green's function of the linear operator R. K is the Green's function of R T R, where R T is the adjoint operator of R. Aronszajin et al. [22] show Since the analytical form of Green functions of R T R is not available, the differential operator is approximated with the difference operator (D ). In the non-homogeneous ODE system, the model is linearised by feeding surrogate x s (using spline interpolation, in this case) into g s (). �(x s ) is still a valid RKHS norm for the transformed variable x s defined in Eq. (30).
The penalised log-likelihood function is now expressed as where α s is the vector containing the coefficients from the spline interpolant for species s. Parameter estimation using Eq. (35) can be carried out with standard non-linear optimisation algorithms such as quasi-Newton or conjugate gradients.
In the original paper of [14], the penalty parameter s is inferred using AIC. For a given value of s , Eq. (35) is optimised to estimate the ODE parameters and subsequently the AIC score of the procedure is calculated. This is repeated for different s values and the s value corresponding to the smallest AIC score is chosen. As well as using this approach for estimating s , we found that using threefold cross validation, instead of AIC, provided more robust parameter estimation. We present the results from both schemes.

Brief summary of methods
The detailed methodology we cover previously pertains to our new method and to the methods we were able to obtain the authors' code for and adapt to many experimental settings. We were unable to obtain the authors' code for the publication in Ramsay et al. [3]. The run times for the parallel tempering method of Campbell and Steele [5], using the authors' own code, were excessive (typically in the order of days), and we were therefore unable to carry out an exhaustive exploration of the method. The methods of Ramsay et al. [3] and Campbell and Steele [5] are included as a benchmark comparison and so we leave it to the readers to refer to the original publications for full details. The following is a brief summary of all the methods we compare. Since many methods and settings are used in this paper for comparison purposes, abbreviations are used for ease of reading. Table 2 contains a key for those methods.
C&S [5]: Parameter inference is carried out using adaptive gradient matching and tempering of the mismatch parameter. B-splines are used as the choice of interpolation scheme. INF [4]: This method conducts parameter inference through adaptive gradient matching using GPs. The penalty mismatch parameters γ s are inferred. LB2: This method conducts parameter inference through adaptive gradient matching using GPs. The penalty mismatch parameters γ s are tempered in log base 2 increments, see Table 4 for details. LB10: As with LB2, parameter inference is conducted through adaptive gradient matching using GPs, however, the penalty mismatch parameters γ s are tempered in log base 10 increments, see Table 4 for details. GON [14]: Parameter inference is conducted in a non-Bayesian fashion, implementing a reproducing kernel Hilbert space (RKHS) and penalised likelihood approach. Comparisons between RKHS and GPs have been previously explored conceptually (for example, see [21,23]), and in this paper we analyse them empirically in the specific context of inference in ODEs. The RKHS method that incorporates the information from the ODEs in [14] obtains the ODE kernel using a differencing operator. AIC is used to estimate the penalty parameter . GON Cross [14]: The method is the same as GON, however, cross validation is used to estimate the penalty parameter , instead of AIC. RAM [3]: This technique uses a non-Bayesian optimisation process for parameter inference. The method penalises the difference between the gradients using splines and a hierarchical 3 level regularisation approach is used to set the tuning parameters (see [3] for details). Table 3 describes particular settings with some of the methods in Table 2. The ranges of the penalty parameters γ s , for the LB2 and LB10 methods are given in Table 4. The increments are linear on the log scale. The M α s s from 0 to 1 are set by taking a series of M equally spaced values and raising them to the power 5, as described in [20].

Data
Fitz-Hugh Nagumo ([24, 25]): These equations originally were used to describe the voltage potential across the cell membrane of the axon of giant squid neurons. There are 3 parameters; α, β and ψ and two "species"; Voltage (V) and Recovery variable (R). Species in [ ] denote the time-dependent concentration for that species: The Fitz-Hugh Nagumo equations are used in biomedical engineering to model features such as cardiac conditions (i.e. electrical excitation-conduction in cardiac tissue [6], cardiac action potentials [7] and arrhythmias [8]) and neurodegenerative diseases (Drosophila courtship can be modelled using these equations and used to screen genes linked to memory-deficiency and human neurodegeneration [9] and the system can also be used for diagnosing leprosy [10]).
Protein signalling transduction pathway [26]: These equations describe protein signalling transduction pathways in a signal transduction cascade [26], where the kinetic parameters control how quickly the proteins ("species") convert to one another. There are 6 parameters (k 1 , k 2 , k 3 , k 4 , V , K m ) and 5 "species" (S, dS, R, RS, Rpp). The system describes the phosphorylation of a protein, R → Rpp (Eq.  Originally, the authors use 401 observations. We reduced this to a dataset size more usual with these types of experiments to observe the dependency of the methods on the amount of data 15K 15 knots The method in C&S uses B-splines interpolation. We changed the original tuning parameters from the authors' paper to observe the sensitivity of the parameter estimation by these tuning parameters P3 polynomial order 3 (cubic spline) The original polynomial order is 5 and again, we wanted to observe the sensitivity of the parameter estimation by these tuning parameters Table 4 Ranges of the penalty parameter γ s for LB2 and LB10 In this paper γ s = γ ∀s Cell signalling is a highly relevant topic in current biomedical engineering. The equations can model cancers [11] and neurodegenerative diseases that include Alzheimer's disease, Parkinson's disease and ALS [12]. These ODE systems give us benchmark data and produce periodic signals (in the Fitz-Hugh Nagumo system) and signals that make a transition to a stationary phase (protein signalling transduction pathway), which is representative of models in this area. Hence, we can assess the methods discussed in this paper on systems that are meaningful to the field of biomedical engineering.

Simulation
We have compared the proposed GP tempering scheme with the alternative methods summarised in the "Methods" section. For the comparison to Ramsay et al. [3], we were unable to obtain the authors' software and so we compared our results directly with the results from the original publications. Hence, we generated test data in the same manner as described by the authors and used them for the evaluation of our method. For the methods in Campbell and Steel [5], Dondelinger et al. [4] and González et al. [14], where we did receive the authors' software, we repeated the evaluation twice, first on data equivalent to those used in the original publications, and again on new data generated with different (more realistic) parameter settings. For comparisons using the Fitz-Hugh Nagumo model, Eqs. (36) and (37), we used the ODE prior distributions in [5] and for comparisons using the protein signalling transduction pathway model, Eqs. (38-42), we used the parameter priors from [4]. This gave us priors that were motivated by the current literature. Our code is available upon request.
Tempered mismatch parameter using splines-based smooth functional tempering (C&S) [5]: The authors tested their method on the Fitz-Hugh Nagumo system, Eqs. (36) and (37), with the following parameter settings: α = 0.2, β = 0.2 and ψ = 3, starting from initial values of (−1, 1) for the two "species". They generated 401 observations over the time course [0, 20] (producing 2 periods) and Gaussian noise with sd {0.5, 0.4} was used to corrupt each respective "species". To infer the ODE parameters with their approach, the authors chose the following settings: B-splines of polynomial order 5 with 301 knots; 4 parallel tempering chains, gradient mismatch parameter schedules {10,100,1000,10000}; parameter prior distributions for the ODE parameters: α ∼ N (0, 0.4 2 ), β ∼ N (0, 0.4 2 ) and ψ ∼ χ 2 2 . As well as comparing our method with the results the authors had obtained with their original settings, we made the following modifications to test the robustness of (39) their procedure. We reduced the number of observations from 401 to 20 over the time course [0, 10] (producing 1 period), which more closely reflects the amount of data typically available in current systems biology. In doing so, we also reduced the number of knots for the splines to 15 (preserving the same proportionality of knots to datapoints as before), and we tried a different polynomial order: 3 instead of 5. The method incurred high computational costs, (roughly 1 1 2 weeks for a run), and so we could only repeat the inference on 3 independent data sets. The posterior samples were combined in order to approximately marginalise over datasets and thereby remove their potential particularities. For a fair comparison, we also ran our methods with 4 rather than the 10 chains that we used as default.
Inference of the gradient mismatch parameter using GPs and adaptive gradient matching (INF) [4]: We applied the method in the same way as described in the original publication of [4], using the authors' software and selecting the same kernels and parameter/ hyperparameter priors for the method proposed in the present paper. We generated data from the protein signal transduction pathway described in Eqs. (38-42), with the same settings as in [4]; initial values of the species: (S = 1, dS = 0, R = 1, RS = 0, Rpp = 0) ; ODE parameters: (k 1 = 0.07, k 2 = 0.6, k 3 = 0.05, k 4 = 0.3, V = 0.017, K m = 0.3); 15 timepoints producing one period: {0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100}. As in [4], we used multiplicative iid Gaussian noise of standard deviation = 0.1 to corrupt the signals and reflect the noisy observations obtained in experiments. We chose the same gamma prior on the ODE parameters as used in [4], namely Ŵ(4, 0.5), for Bayesian inference. For the GP, we used the same kernel they originally used; see further on for details. In addition to this ODE system, we also applied this method to the rest of the described set-ups.
As well as comparing to the original publication set-up, we also tested the methods on a scenario with larger observational noise. We tested on 2 scenarios, when the signal to noise ratio was on average 10 for each species and when the average signal to noise ratio was 5. We used the average signal to noise ratio so that each species had the same observational error as one another. We reduced the dataset size to 25 timepoints over the time course [0, 10], producing 1 period, and show the results across 10 independent datasets.
To observe the variation between ODE models, we also ran the method on the protein signal transduction pathway in Eqs. (38-42). We generated data under the same settings as in [4]; ODE parameters: (k 1 = 0.07, k 2 = 0.6, k 3 = 0.05, k 4 = 0.3, V = 0.017, K m = 0.3) ; initial values of the species: (S = 1, dS = 0, R = 1, RS = 0, Rpp = 0); 15 timepoints covering one period: {0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100}. We examined 2 noise scenarios; when the average signal to noise ratio was 10, and when the average signal to noise ratio was 5. As opposed to the set-up in [4], we use additive Gaussian noise to corrupt the data, to correspond with the assumed noise model. [3]: González et al. [14] used the method of Ramsay et al. [3] to compare with their technique. We have used the results from the original publication of [14]. For fairness of comparison, our method was applied in the same way as with the set-up in [14].

Penalised splines and 2nd derivative penalty method (RAM)
Choice of kernel: For the GP, we need to choose a suitable kernel, which reflects our prior knowledge in function space. We considered two kernels in our study (to correspond with the authors' set-ups), the radial basis function (RBF) kernel with hyperparameters σ 2 RBF and l 2 , and the sigmoid variance kernel with hyperparameters σ 2 sig , a and b [23]. To initialise the hyperparameters, we fit a standard GP regression model (i.e. without information from the ODE) using maximum likelihood. We then checked to see whether the interpolant adequately represents our prior knowledge.
We found that the RBF kernel provided a good fit to the data for the data generated from the Fitz-Hugh Nagumo model. However, in confirmation of the findings in [4], we found that for the protein signalling transduction pathway, the non-stationary nature of the data is not represented properly with the RBF kernel, which is stationary [23]. As in [4], we used the sigmoid variance kernel, which is non-stationary [23] and found a considerable improvement to the fit to the data.
Other settings: We need to set the values for our variance mismatch parameter of the gradients, γ s . Since studies that indicate reasonable values for our technique are limited (see [1,20]), we used Log 2 and Log 10 increments with an initial start at 1. All parameters were initialised with a random draw from the respective priors (apart from GON, which did not use priors).

Results
Tempered mismatch parameter using splines-based smooth functional tempering (C&S) [5]: By examining Figs. 1, 2 and 3, we can see that the C&S method shows good performance over all parameters in the one case where the number of observations is 401, the number of knots is 301 and the polynomial order is 3 (cubic spline), since the bulk of the distributions of the sampled parameters surround the true parameters in Figs. 1 and 3 and are close to the true parameter in Fig. 2. These settings, however, require a great deal of "hand-tuning" or time expensive cross-validation and would be very difficult to set when using real data. We can observe the sensitivity of the method in the other set-ups, where the results are noticeably worse. An important point to note is when the dataset size was reduced, the cubic spline performed very badly. This lack of robustness makes these splines based methods very difficult to apply in practice. The INF, LB2 and LB10 methods consistently outperform the C&S method with distributions being closer to or overlapping the true parameters. On the set-up with 20 observations, for both 4 and  Inference of the gradient mismatch parameter using GPs, adaptive method (INF) [4]: In order to see how the LB2 and LB10 tempering methods perform in comparison to the INF method, we can examine the results from the protein signalling transduction pathway (see Eqs. 38-42), as well as comparing how each method did in the other setups. Figure 4 shows the distributions of parameter estimates minus the true values for the protein signalling transduction pathway. After implementing the authors' code, we noticed that some of the MCMC simulations had not converged. In order to present a fair depiction of the methods' performance, we show results from the dataset that produced the median performance. For each dataset the root mean square was calculated on the parameter samples minus the true values. The dataset that produced the median root mean square value is given.
We can see by examining Fig. 4, that for each parameter the methods are performing well, since the distributions are close to the true values. For this set-up, overall there does not appear to be a significant difference between the INF, LB2 and LB10 methods.
For the original set-up in [14], Fig. 5 shows the expected cumulative distribution functions (ECDFs) of the absolute errors of the parameter samples for the tempering and inference schemes. P-values for 2-sample, 1-sided Kolmogorov-Smirnov tests are given. Since the distributions are of the average error, if a distribution's ECDF is significantly higher than another's, this constitutes better parameter estimation. A higher curve means that there are more values located in the lower range of absolute error.    [13], with permission from Springer By examining Fig. 5 and using the standard significance level of 0.05 as a cut-off, we can see that the CDFs for LB2 and LB10 are significantly higher than those for INF. This means that the parameter estimates from the LB2 and LB10 methods are closer to the true parameters than the INF method, since we are dealing with absolute error. The LB2 and LB10 method show no significant difference to each other.
For the set-up in [5], Figs. 1, 2 and 3 show us that the LB2 and LB10 methods perform well across dataset size and over all the parameters, since most of the mass of the distributions surround or are situated close to the true parameters. One type of scheduling did not always outperform another, the LB2 does better than the LB10 for 4 parallel chains (distributions overlapping the true parameter for all three parameters) and the LB10 outperforms the LB2 for 10 parallel chains (distribution overlapping true parameter in Fig. 1, being closer to the true parameter in Fig. 2 and narrower and more symmetric around the true parameter in Fig. 3). The bulks of parameter sample distributions for the INF method are located close to the true parameters for all dataset sizes. However, the method produces less uncertainty at the expense of bias. When reducing the dataset size to 20 observations, for both 4 and 10 chains, the results deteriorate for the INF method and it is outperformed by the LB2 and LB10 methods.
Reproducing kernel Hilbert space (GON) [14] and Hierarchical regularisation splines based method (RAM) [3]: For these sets of results, to assess the performance of the methods, we used the same criterion as in GON. For each parameter, the absolute value of the difference between an estimator and the true parameter (|θ i − θ i |) was computed and the distribution across the datasets was examined. For the LB2, LB10 and INF methods, we used the median of the sampled parameters as an estimator, since it is a robust average. Fig. 6 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 50 datasets. The three sections from left to right represent the parameters α, β and ψ from the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method (boxplot reconstructed from [14]) and RAM method (boxplot reconstructed from [14]). For definitions, see Tables 2 and 3. Figure reproduced from [13], with permission from Springer Examining Fig. 6, the LB2, LB10 and INF methods do as well as the GON method for 2 parameters (INF doing slightly worse for ψ) and outperform it for 1 parameter with the width of the distributions of the absolute distances to the true parameter roughly 1 3 of the size. All methods outperform the RAM method.
Looking at Fig. 7, the performance of LB2 and LB10 is poorer for the 3 rd parameter than the other methods by about 1 unit in absolute difference to the true parameter. When the noise is increased, Fig. 8, the GON and GON Cross methods are more robust in estimating the final parameter, where overall the average absolute error to the true parameter is about 0.5 smaller. Fig. 7 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The three sections from left to right represent the parameters α, β and ψ from the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is inferred. For definitions, see Tables 2 and 3 Fig . 8 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The three sections from left to right represent the parameters α, β and ψ from the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is inferred. For definitions, see Tables 2 and 3 Examining the results for the protein signalling transduction pathway, Eqs. (38-42), in Figs. 9 and 10, we can see that the performance of INF, LB2 and LB10 vary in accuracy. The GON Cross method shows a more robust set of estimates, with results that are on average 0.2 units in absolute value closer to the true parameters. The GON method (which uses AIC to estimate the penalty parameter) was unable to optimise for this ODE system. Given certain values of s , the optimiser of the log likelihood function tends to choose kernel parameters which make (K + s σ s I) non-invertible and computationally singular. In the cross validation scheme, all problematic s s are rejected. We present the results for the GON Cross method only, for this ODE model.
We also present the root mean square (RMS) values in function space. Firstly, the signal was reconstructed with the sampled parameters, by numerically integrating the Fig. 9 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The 5 sections from left to right represent the parameters for the protein signalling transduction pathway, Eqs. (38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is inferred. For definitions, see Tables 2 and 3 Fig. 10 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The 5 sections from left to right represent the parameters for the protein signalling transduction pathway, Eqs. (38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is inferred. For definitions, see Tables 2 and 3 ODEs, and then the true signal was subtracted (signal created with true parameters and no observational noise added). The RMS was calculated on these residuals. It is important to assess the methods on this criterion as well as looking at the parameter uncertainty, as some parameters might only be weakly identifiable, corresponding to ridges in the likelihood landscape. In other words, large uncertainty in parameter estimates may not necessarily imply a poor performance by a method, if the reconstructed signals for all groups of sampled parameters were close to the truth.
By examining Fig. 11 we can see that the LB2 and LB10 methods perform poorer than the rest, with an average RMS value roughly 0.5 larger. In Fig. 12, the increased noise scenario, we can see that the LB2 and LB10 methods have an average RMS value about 0.5 units larger than the other methods. Figures 13 and 14 show that the GON Cross method is slightly outperforming the INF, LB2 and LB10 methods, with RMS distributions that are on average 0.1 units lower.
The wider range of estimates of the parameters (as well as the long tails in the posterior distributions in Figs. 1, 2 and 3), for the INF, LB2 and LB10 methods, were observed when occasionally the time course signals would flatten. An inspection of Eq. (17) reveals that when f s (X, θ , t) = 0 ∀s, then p(X|θ, φ, η, γ ) is maximised at x s = φ s ∀s . This corresponds to a flattening of the true concentration profiles, which usually can be assumed to be a poor fit to the data. Hence, this flattening should be discouraged by the likelihood term p(Y|X, σ ) in Eq. (4). However, for σ ≫ σ True (where σ True is the unknown true standard deviation of the observational error of the signals), the likelihood term is effectively switched off, which will allow the system to converge to a high probability attractor state corresponding to x s = φ s . In practice, we observe this effect  Tables 2 and 3 Fig. 13 Distribution of RMS values in function space, calculated on the residuals of the true signal (signal produced with true parameters and no observational error) minus the signal produced with the estimate of the parameters, for the protein signalling transduction pathway (Eqs. 38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is inferred. For definitions, see Tables 2 and 3 for σ exceeding σ True . This attractor state is further self-enforcing by driving the length scales included in the GP hyperparameters η to very large values, as we have observed in our simulations. Obviously, x s = φ s is unrealistic. To test whether holding the standard deviation of the noise at the true value prevents the Markov chains from being driven to this unrealistic attractor state, we repeated the simulations of the comparison to GON and GON Cross, for the Fitz-Hugh Nagumo system and protein signalling transduction pathway for signal to noise ratios of 10 and 5. We held the standard deviation of the noise at the value that was used to generate the data, where in practice this could be estimated through a standard GP regression. We used the true value in order to observe whether this approach affects the results and to what extent, under the most favourable conditions.
Examining Fig. 15, where now the standard deviation of the noise is held fixed at the true value, the INF, LB2, LB10, GON and GON Cross methods perform similarly for the first 2 parameters and the GON and GON Cross do about 1 unit of absolute diferrence to the true parameter better for the 3rd. When the noise is increased, Fig. 16, the methods produce estimates that are similar to one another for all three parameters.
For the protein signalling transduction pathway, Eqs. (38-42), Fig. 17 shows that the INF, LB2 and LB10 methods perform on average 0.075 units in absolute value to the true parameter better than GON Cross, over the different parameters. Similarly, in Fig. 18, INF, LB2 and LB10 perform roughly 0.07 units better in absolute distance to the true parameter than GON Cross.
The RMS distributions in Fig. 19 show that the GON and GON Cross methods are producing slightly better estimates, reflected by the distributions being around 0.5 units  Tables 2 and 3 in RMS lower. For the increased noise scenario, Fig. 20, all methods are performing similarly.
In Fig. 21, we can see that the INF, LB2 and LB10 methods outperform the GON Cross method, shown by smaller RMS distributions that are roughly 0.05 units smaller. In Fig. 22, the INF and LB10 methods do better than LB2 and GON Cross with RMS values on average 0.05 units smaller. Fig. 15 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The three sections from left to right represent the parameters α, β and ψ from the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 Fig. 16 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The three sections from left to right represent the parameters α, β and ψ from the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3

Discussion
We have modified a recently developed gradient matching approach for systems biology (INF) by combining it with a parallel tempering scheme for the gradient mismatch parameter (C&S). We have also carried out a comparative evaluation of this new method with various state-of-the-art gradient matching methods. These methods are based on different inference approaches and statistical models, namely: non-parametric Bayesian statistics using GPs (INF, LB2, LB10), splines-based smooth functional tempering (C&S), hierarchical regularisation using splines interpolation (RAM), and penalised likelihood based on reproducing kernel Hilbert spaces (GON). Our set-ups have also allowed us to Fig. 17 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The 5 sections from left to right represent the parameters for the protein signalling transduction pathway, Eqs. (38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 Fig. 18 Boxplots of the distributions of the absolute differences of an estimate to the true parameter over 10 datasets. The 5 sections from left to right represent the parameters for the protein signalling transduction pathway, Eqs. (38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3  Distribution of RMS values in function space, calculated on the residuals of the true signal (signal produced with true parameters and no observational error) minus the signal produced with the estimate of the parameters, for the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 Fig. 20 Distribution of RMS values in function space, calculated on the residuals of the true signal (signal produced with true parameters and no observational error) minus the signal produced with the estimate of the parameters, for the Fitz-Hugh Nagumo model (Eqs. 36, 37). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method, GON method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 Fig. 21 Distribution of RMS values in function space, calculated on the residuals of the true signal (signal produced with true parameters and no observational error) minus the signal produced with the estimate of the parameters, for the protein signalling transduction pathway (Eqs. 38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 10. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 Fig. 22 Distribution of RMS values in function space, calculated on the residuals of the true signal (signal produced with true parameters and no observational error) minus the signal produced with the estimate of the parameters, for the protein signalling transduction pathway (Eqs. 38-42). Within each section, the boxplots from left to right are: LB2 method, INF method, LB10 method and GON method using cross validation for inferring the penalty parameter. The average signal to noise ratio for each "species" is 5. The standard deviation of the observational noise is held at the true value. For definitions, see Tables 2 and 3 compare the opposing paradigms of Bayesian inference (INF) versus parallel tempering (LB2, LB10) of the slack parameters controlling the amount of mismatch between the gradients.
In one case, when the number of observations was very high (higher than what would be expected in these types of experiments) and the tuning parameters were finely adjusted (which is time-consuming in practice), the C&S method does well. When the dataset size was reduced, all settings for this method deteriorated significantly, including the previous tuning setting that performed well. It is also important to note that the particular settings that we found to be optimal were different than in the original paper, which highlights the sensitivity and lack of robustness in the splines based method.
The GON and GON Cross methods produce estimates that are close to the true parameters in terms of absolute uncertainty. For the Fitz-Hugh Nagumo ODE model, the method outperforms the other schemes for one parameter, in the case when the signal to noise ratio was 10 and 25 datapoints were generated. For the protein signalling transduction pathway, however, this method is outperformed by INF, LB2 and LB10. This method also has a drawback to practical implementation, on non-simulated data. The method, which uses a classical approach to parameter estimation (producing point estimates), cannot immediately produce confidence intervals for the parameters and so quantifying the uncertainty in the parameter estimates will be more difficult. For simulated data, this is not an issue, since it is possible to generate multiple datasets and quantify the accuracy of the method by observing the results across all datasets. In practice however, this is not available. One would need to rely on other processes, such as bootstrapping, and the effect on the accuracy and computational time is something that needs to be investigated.
The INF method performs reasonably by producing results close to the true parameters across the scenarios that we have examined. However, this method's decrease in uncertainty is at the expense of bias.
The LB2 and LB10 methods show good performance across the set-ups. The parameter inference is accurate across the different ODE models and the different settings of those models. The parallel tempering schedule has proven to be quite robust, as the methods perform similarly across the various set-ups.
For some simulations, we noticed a flattening of the time course signals for INF, LB2 and LB10. The uncertainty in the signals reduced the accuracy in the methods. In order to achieve a robust method that provides accurate parameter estimation, we examined holding the standard deviation at the true value. In this case, the GON and GON Cross outperformed INF, LB2 and LB10 on one parameter in the Fitz-Hugh Nagumo system, when the signal to noise ratio was 10. For the signal to noise ratio setting of 5, the methods all performed similarly. The INF, LB2 and LB10 methods outperform the GON Cross method for the protein signalling transduction pathway. Holding the standard deviation of the noise at the true value, for the INF, LB2 and LB10 methods, stops the likelihood term from effectively being switched off and prevents the flattening. In practice, this parameter could be estimated by a standard GP regression, in order to fix the standard deviation of the noise when the true value is unknown. This is a somewhat heuristic fix to the problem however, and a general robust solution should be the focus for future research.
It is also important to note that the methods in this paper were not derived in order to operate with a particular ODE model. The results therefore, should be similar across ODE type. We have seen evidence of this in other ODE systems, like the Lorenz attractor and the Lotka-Volterra predator-prey model, which are less relevant to the field of biomedical engineering, though, and are thus beyond the scope of the present paper.

Conclusions
The combination of adaptive gradient matching using GPs from Dondelinger et al. [4] and a parallel tempering scheme for the gradient mismatch parameter from Campbell and Steele [5], has yielded a method that provides accurate parameter estimates for ODEs when the true standard deviation of the noise is known. This method performs well across ODE models and variation of the scheduling of the tempered mismatch parameter.
We have found that the method in Dondelinger et al. [4] provides accurate estimation, although the decrease in uncertainty is at the expense of bias. The method in Campbell and Steele [5] shows a lack of robustness, due to the difficulty in configuring the splines settings. For the method in Ramsay et al. [3], we found it was outperformed by the other methods we looked at. The method in González et al. [14] is accurate and robust, but can be outperformed by Dondelinger et al. [4] and the proposed method in this paper. For a signal to noise ratio of 10 on the Fitz-Hugh Nagumo system, the González et al. [14] method is able to outperform the method in Dondelinger et al. [4] and our new method, for one parameter. We found that using cross validation as opposed to AIC for the González et al. [14] method, to estimate the penalty parameter, yielded results that were more robust.
In order to avoid a potential drawback to our proposed method, we hold the standard deviation of the noise at the true value, to avoid the signals deviating from the data and flattening. This remedy was found to lead to a significant improvement over the method with a flexible standard deviation of the error. In practice, the standard deviation of the noise could be estimated, for example by a standard GP regression, and general approaches to this should be the focus of future research.