# Approximate parameter inference in systems biology using gradient matching: a comparative evaluation

- Benn Macdonald
^{1}Email author, - Mu Niu
^{1}, - Simon Rogers
^{2}, - Maurizio Filippone
^{3}and - Dirk Husmeier
^{1}

**15(Suppl 1)**:80

https://doi.org/10.1186/s12938-016-0186-x

© The Author(s) 2016

**Published: **15 July 2016

## Abstract

### Background

A challenging problem in current systems biology is that of parameter inference in biological pathways expressed as coupled ordinary differential equations (ODEs). Conventional methods that repeatedly numerically solve the ODEs have large associated computational costs. Aimed at reducing this cost, new concepts using gradient matching have been proposed, which bypass the need for numerical integration. This paper presents a recently established adaptive gradient matching approach, using Gaussian processes (GPs), combined with a parallel tempering scheme, and conducts a comparative evaluation with current state-of-the-art methods used for parameter inference in ODEs. Among these contemporary methods is a technique based on reproducing kernel Hilbert spaces (RKHS). This has previously shown promising results for parameter estimation, but under lax experimental settings. We look at a range of scenarios to test the robustness of this method. We also change the approach of inferring the penalty parameter from AIC to cross validation to improve the stability of the method.

### Methods

Methodology for the recently proposed adaptive gradient matching method using GPs, upon which we build our new method, is provided. Details of a competing method using RKHS are also described here.

### Results

We conduct a comparative analysis for the methods described in this paper, using two benchmark ODE systems. The analyses are repeated under different experimental settings, to observe the sensitivity of the techniques.

### Conclusions

Our study reveals that for known noise variance, our proposed method based on GPs and parallel tempering achieves overall the best performance. When the noise variance is unknown, the RKHS method proves to be more robust.

## Keywords

## Background

A central objective of current systems biology research is explaining the actions among components in biopathways. A standard approach is to view a biopathway as a network of biochemical reactions, which is modelled as a system of ordinary differential equations (ODEs).

*N*components (referred to throughout as “species”) in the biopathway, \(x_s(t_i)\) denotes the concentration of species \(s\) at time \(t_i\) and \(\mathbf x (t_i)\) is a vector of concentrations of all system components that influence or regulate the concentration of species \(s\) at time \(t_i\).

^{1}If, for example, species \(s\) is an mRNA, then \(\mathbf x (t_i)\) might contain the concentrations of transcription factors (proteins), from which \(s\) is transcribed, that bind to the promoter of the gene. The regulation is described by the regulation function

*f*. The type of regulatory interaction depends on the species involved, e.g.,

*f*may describe mass action kinetics, Michaelis-Menten kinetics, etc. All of these interactions depend on a vector of kinetic parameters, \(\varvec{\theta }_s\). For many biopathways, only a small fraction of \(\varvec{\theta }_s\) can be measured in practice. Therefore, in order to understand the dynamics of the biopathway, the majority of these kinetic parameters need to be inferred from observed (typically noisy and sparse) time course concentration profiles.

Examples of the notation used throughout this paper

Notation | Meaning | Example |
---|---|---|

Bold face uppercase letter or symbol | Matrix | \(\mathbf X\) |

Bold face lowercase letter or symbol | Vector | \(\varvec{\theta }\) |

Vector at time \(t_i\) | Concentration for all species at time \(t_i\) | \(\mathbf y (t_i)\) or \(\mathbf x (t_i)\) |

Vector of concentrations for species \(s\) | Concentrations for species \(s\) over all timepoints | \(\mathbf y _s\) or \(\mathbf x _s\) |

Vector of concentrations | Concentrations over all timepoints for one species | \(\mathbf y\) or \(\mathbf x\) |

Lower case letter at time \(t_i\) for species \(s\) | Concentration for species \(s\) at timepoint \(t_i\) | \(y_s(t_i)\) or \(x_s(t_i)\) |

Conventional inference methods typically rely on searching the space of \(\varvec{\theta }\) values, and at each candidate, numerically solving the ODEs and comparing the output with that observed. After choosing an appropriate noise model, the form of the likelihood is defined, and a measure of similarity between the data signals and the signals described by the current set of ODE parameters can be calculated. This process is repeated, as part of either an iterative optimisation scheme or sampling procedure in order to estimate the parameters. However, the computational costs involved with repeatedly numerically solving the ODEs are usually high.

Several authors have adopted approaches based on gradient matching (e.g. [1, 2]), aiming to reduce this computational complexity. These approaches are based on the following two-step procedure. At the first step, interpolation is used to smooth the time series data, in order to avoid modelling noisy observations; in a second step, the kinetic parameters \(\varvec{\theta }\) of the ODEs are either optimised or sampled, whilst minimising some metric measuring the difference between the slopes of the tangents to the interpolants, and the \(\varvec{\theta }\)-dependent time derivative from the ODEs. In this fashion, the ODEs never have to be numerically integrated, and the problem of inferring the typically unknown initial conditions of the system is removed, as it is not required for matching gradients. A downside to this two-step scheme is that the results of parameter inference are critically dependent on the quality of the initial interpolant. Alternatively, as first suggested in [3], we can allow the ODEs to regularise the interpolant. Dondelinger et al. [4] applied this to the nonparametric Bayesian approach in [1], which uses Gaussian processes (GPs), and demonstrated that it significantly improves the parameter inference accuracy and robustness with respect to noise. Unlike in [3], all hyperparameters that control the smoothness of the interpolants are consistently inferred in the framework of nonparametric Bayesian statistics, which dispenses with the need to use heuristics and approximations in the configuration of the interpolation function.

We further the work of [4] in two respects. Firstly, we combine adaptive gradient matching using GPs with a parallel tempering scheme for the parameter that controls the mismatch between the gradients. This is conceptually different from the inference paradigm of the mismatch parameter that Dondelinger et al. [4] uses. Ideally, if the ODEs provide a correct mathematical description of the system, there should be no difference between the gradients of the interpolant and those predicted from the ODEs. However, in practice, forcing the gradients to be equal is likely to cause parameter inference methods to converge to a local optimum of the likelihood. Forcing the gradients to immediately be the same would restrict the inference procedure to a section of the likelihood corresponding to parameters that agree with the gradient match. However, there is no guarantee that these parameters are suitable for the data, see [5] for details. A parallel tempering scheme is the natural way to deal with such local optima, as opposed to inferring the degree of mismatch, since different tempering levels correspond to different strengths of penalising the mismatch between the gradients.Campbell and Steele [5] explore a parallel tempering scheme, but in order to get an understanding as to how well utilising this scheme improves inference, the rest of the set-up should be as similar as possible. Hence, comparing the results directly to the GP approach in [4], won’t provide us with this understanding, since the approach in [5] uses a different methodological paradigm. In this paper, we present a comparative assessment of parallel tempering versus inference in the context of gradient matching for the same modelling framework, i.e. without any confounding influence from the modelling choice. Secondly, we compare the approach of Bayesian inference using GPs with a variety of other methodological paradigms, within the specific context of comparing the gradients from the interpolant to the gradients from the ODEs, which is highly relevant to current computational systems biology.

We test the methods on two benchmark ODE systems: the Fitz–Hugh Nagumo system [which can model the behaviour of cardiac conditions such as: electrical excitation-conduction [6]; cardiac action potentials [7], and arrhythmias [8], as well as neurodegenerative diseases ([9, 10])], and a protein signalling transduction pathway [where cell signalling pathways can model cancers [11] and neurodegenerative diseases such as: Alzheimer’s disease; Parkinson’s disease, and amyotrophic lateral sclerosis (ALS) [12]], systems that are highly relevant to current biomedical engineering.

This paper is an extended version of [13] and includes a full description of the RKHS method in [14], as well as a series of comparative simulation studies using this method on the Fitz–Hugh Nagumo system, under different observational noise scenarios, and a protein signalling transduction pathway. The description of the methodology in [13] and [14] is also outlined in the review article [15], included here with permission from the authors.

## Methods

### Adaptive gradient matching with Gaussian processes

Consider a set of T arbitrary timepoints \(t_1< \dots< t_i< \dots < t_T\), and noisy observations \(\mathbf Y = (\mathbf y (t_1),...,\mathbf y (t_T))\), where \(\mathbf y (t_i) = \mathbf x (t_i) + \varvec{\epsilon }(t_i)\), N = dim\(\left( \mathbf x (t_i)\right)\), \(\mathbf X = (\mathbf x (t_1),...,\mathbf x (t_T))\), \(\mathbf y (t_i)\) is the data vector of the observations of all species concentrations at time \(t_i\), \(\mathbf x (t_i)\) is the vector of the concentrations of all species at time \(t_i\), \(\mathbf y _s\) is the data vector of the observations of species concentrations \(s\) at all timepoints, \(\mathbf x _s\) is the vector of concentrations of species \(s\) at all timepoints, \(y_s(t_i)\) is the observed datapoint of the concentration of species \(s\) at time \(t_i\), \(x_s(t_i)\) is the concentration of species \(s\) at time \(t_i\) and \(\varvec{\epsilon }\) is multivariate Gaussian noise, \(\varvec{\epsilon } \sim N(\mathbf 0 ,\sigma ^2_s\mathbf I )\).

*Parallel tempering:*Consider a series of “temperatures”, \(0 = \alpha ^{(1)}<\cdots< \alpha ^{(M)} = 1\) and a power posterior distribution of our ODE parameters ([20])

*q*( ) represents the proposal distribution and the superscripts “prop” and “curr” indicate whether the algorithm is being evaluated at the proposed or current state, respectively. At each MCMC step, two chains are randomly selected (uniformly) and the corresponding parameters are proposed to swap between them. This proposal has acceptance probability

Prior to the parameter inference, we choose values of \(\gamma _s\) and assign them to the variance parameter in Eq. (12) for each “temperature” \(\alpha ^{(i)}\), such that chains closer to the prior (\(\alpha ^{(i)}\) values closer to 0) allow the gradients from the interpolant to have more freedom to deviate from those predicted by the ODEs (which corresponds to larger \(\gamma _s\) values), chains closer to the posterior (\(\alpha ^{(i)}\) values closer to 1) more closely match the gradients (corresponding to smaller \(\gamma _s\) values), and for the chain corresponding to \(\alpha ^{(M)} = 1\), we want the mismatch to be approximately zero (\(\gamma _s\approx 0\)). Since \(\gamma _s\) corresponds to the variance of our species-specific error (see Eq. 12), as \(\gamma _s\rightarrow\) 0, we have less difference between the gradients, and as \(\gamma _s\) gets larger, the gradients have more freedom to deviate from one another. Hence, we temper \(\gamma _s\) towards zero. Now, each \(\alpha ^{(i)}\) chain in Eq. (18) has a \(\gamma _{s}^{(i)}\) (where the superscript \((i)\) indicates the gradient mismatch parameter associated with “temperature” \(\alpha ^{(i)}\)) fixed in place for the strength of the gradient mismatch. The specific schedules of the gradient mismatch parameter are included in Table 4.

### Reproducing kernel Hilbert space

### Penalised likelihood with RKHS

In the case of homogeneous ODEs, where \(g_s()=0\), a kernel in a Hillbert space can be constructed using the Green’s function of the linear operator \(\mathbf R\). \(\mathbf K\) is the Green’s function of \(\mathbf R ^\mathsf{T}{} \mathbf R\), where \(\mathbf R ^\mathsf{T}\) is the adjoint operator of \(\mathbf R\). Aronszajin et al. [22] show \(||\mathbf R \tilde{\mathbf{x }}_s||^2_{L^2} = ||\tilde{\mathbf{x }}_s||^2_{H_\mathbf K }=\Omega (\tilde{\mathbf{x }}_s)\). Since the analytical form of Green functions of \(\mathbf R ^\mathsf{T}{} \mathbf R\) is not available, the differential operator is approximated with the difference operator (\(\mathbf D\)). In the non-homogeneous ODE system, the model is linearised by feeding surrogate \(\hat{\mathbf{x }}_s\) (using spline interpolation, in this case) into \(g_s()\). \(\Omega (\tilde{\mathbf{x }}_s)\) is still a valid RKHS norm for the transformed variable \(\tilde{\mathbf{x }}_s\) defined in Eq. (30).

In the original paper of [14], the penalty parameter \(\lambda _s\) is inferred using AIC. For a given value of \(\lambda _s\), Eq. (35) is optimised to estimate the ODE parameters and subsequently the AIC score of the procedure is calculated. This is repeated for different \(\lambda _s\) values and the \(\lambda _s\) value corresponding to the smallest AIC score is chosen.

As well as using this approach for estimating \(\lambda _s\), we found that using threefold cross validation, instead of AIC, provided more robust parameter estimation. We present the results from both schemes.

### Brief summary of methods

Abbreviations of the methods used throughout this paper. Table reproduced from [13], with permission from Springer

Abbreviation | Method | Reference |
---|---|---|

C&S | Tempered mismatch parameter using splines-based smooth functional tempering | Campbell and Steele [5] |

INF | Inference of the gradient mismatch parameter using GPs | Dondelinger et al. [4] |

LB2 | Tempered mismatch parameter using GPs in log base 2 increments | Our method |

LB10 | Tempered mismatch parameter using GPs in log base 10 increments | Our method |

GON | Reproducing kernel Hilbert space and penalised likelihood. The penalty parameter is estimated using AIC | González et al. [14] |

GON Cross | Reproducing kernel Hilbert space and penalised likelihood. The penalty parameter is estimated using 3-fold cross validation | González et al. [14] |

RAM | Hierarchical 3 level regularisation approach using splines based interpolation | Ramsay et al. [3] |

**C&S**[5]: Parameter inference is carried out using adaptive gradient matching and tempering of the mismatch parameter. B-splines are used as the choice of interpolation scheme.

**INF**[4]: This method conducts parameter inference through adaptive gradient matching using GPs. The penalty mismatch parameters \(\varvec{\gamma }_s\) are inferred.

**LB2**: This method conducts parameter inference through adaptive gradient matching using GPs. The penalty mismatch parameters \(\varvec{\gamma }_s\) are tempered in log base 2 increments, see Table 4 for details.

**LB10**: As with LB2, parameter inference is conducted through adaptive gradient matching using GPs, however, the penalty mismatch parameters \(\varvec{\gamma }_s\) are tempered in log base 10 increments, see Table 4 for details.

**GON**[14]: Parameter inference is conducted in a non-Bayesian fashion, implementing a reproducing kernel Hilbert space (RKHS) and penalised likelihood approach. Comparisons between RKHS and GPs have been previously explored conceptually (for example, see [21, 23]), and in this paper we analyse them empirically in the specific context of inference in ODEs. The RKHS method that incorporates the information from the ODEs in [14] obtains the ODE kernel using a differencing operator. AIC is used to estimate the penalty parameter \(\lambda\).

**GON Cross**[14]: The method is the same as

**GON**, however, cross validation is used to estimate the penalty parameter \(\lambda\), instead of AIC.

**RAM**[3]: This technique uses a non-Bayesian optimisation process for parameter inference. The method penalises the difference between the gradients using splines and a hierarchical 3 level regularisation approach is used to set the tuning parameters (see [3] for details). Table 3 describes particular settings with some of the methods in Table 2. The ranges of the penalty parameters \(\varvec{\gamma }_s\), for the LB2 and LB10 methods are given in Table 4. The increments are linear on the log scale. The M \(\alpha _s\)s from 0 to 1 are set by taking a series of M equally spaced values and raising them to the power 5, as described in [20].

Particular settings of Campbell and Steele’s [5] method

Abbreviation | Definition | Details |
---|---|---|

10C | 10 chains | When comparing our methods, it was of interest to see how the performance depended on the number of parallel MCMC chains, as originally the authors used 4 chains |

Obs20 | 20 observations | Originally, the authors use 401 observations. We reduced this to a dataset size more usual with these types of experiments to observe the dependency of the methods on the amount of data |

15K | 15 knots | The method in C&S uses B-splines interpolation. We changed the original tuning parameters from the authors' paper to observe the sensitivity of the parameter estimation by these tuning parameters |

P3 | polynomial order 3 (cubic spline) | The original polynomial order is 5 and again, we wanted to observe the sensitivity of the parameter estimation by these tuning parameters |

Ranges of the penalty parameter \(\varvec{\gamma }_s\) for LB2 and LB10

Method | Chains | Range of penalty \(\varvec{\gamma }\) |
---|---|---|

LB2 | 4 | [1 , 0.125] |

LB2 | 10 | [1 , 0.00195] |

LB10 | 4 | [1 , 0.001] |

LB10 | 10 | \([1 , 1e^{-9}]\) |

## Data

*Fitz–Hugh Nagumo*([24, 25]): These equations originally were used to describe the voltage potential across the cell membrane of the axon of giant squid neurons. There are 3 parameters; \(\alpha\), \(\beta\) and \(\psi\) and two “species”; Voltage (V) and Recovery variable (R). Species in [ ] denote the time-dependent concentration for that species:

*Protein signalling transduction pathway*[26]: These equations describe protein signalling transduction pathways in a signal transduction cascade [26], where the kinetic parameters control how quickly the proteins (“species”) convert to one another. There are 6 parameters (\(k_1, k_2, k_3, k_4, V, K_m\)) and 5 “species” (

*S*,

*dS*,

*R*,

*RS*,

*Rpp*). The system describes the phosphorylation of a protein, \(R \rightarrow Rpp\) (Eq. 42), catalysed by an enzyme

*S*, via an active protein complex [

*RS*, Eq. (41)], where the enzyme is subject to degradation [\(S \rightarrow dS\), Eq. (39)]. The chemical kinetics are described by a combination of mass action kinetics [Eqs. (38), (39), (41)] and Michaelis–Menten kinetics [Eqs. (40), (42)]. Species in [ ] denote the time-dependent concentration for that species:

These ODE systems give us benchmark data and produce periodic signals (in the Fitz–Hugh Nagumo system) and signals that make a transition to a stationary phase (protein signalling transduction pathway), which is representative of models in this area. Hence, we can assess the methods discussed in this paper on systems that are meaningful to the field of biomedical engineering.

## Simulation

We have compared the proposed GP tempering scheme with the alternative methods summarised in the "Methods" section. For the comparison to Ramsay et al. [3], we were unable to obtain the authors’ software and so we compared our results directly with the results from the original publications. Hence, we generated test data in the same manner as described by the authors and used them for the evaluation of our method. For the methods in Campbell and Steel [5], Dondelinger et al. [4] and González et al. [14], where we did receive the authors’ software, we repeated the evaluation twice, first on data equivalent to those used in the original publications, and again on new data generated with different (more realistic) parameter settings. For comparisons using the Fitz–Hugh Nagumo model, Eqs. (36) and (37), we used the ODE prior distributions in [5] and for comparisons using the protein signalling transduction pathway model, Eqs. (38–42), we used the parameter priors from [4]. This gave us priors that were motivated by the current literature. Our code is available upon request.

*Tempered mismatch parameter using splines-based smooth functional tempering (C&S)* [5]: The authors tested their method on the Fitz–Hugh Nagumo system, Eqs. (36) and (37), with the following parameter settings: \(\alpha =0.2\), \(\beta =0.2\) and \(\psi =3\), starting from initial values of \((-1,1)\) for the two “species”. They generated 401 observations over the time course [0, 20] (producing 2 periods) and Gaussian noise with sd {0.5, 0.4} was used to corrupt each respective “species”. To infer the ODE parameters with their approach, the authors chose the following settings: B-splines of polynomial order 5 with 301 knots; 4 parallel tempering chains, gradient mismatch parameter schedules {10,100,1000,10000}; parameter prior distributions for the ODE parameters: \(\alpha \sim N(0, 0.4^2)\), \(\beta \sim N(0, 0.4^2)\) and \(\psi \sim \chi ^2_2\).

As well as comparing our method with the results the authors had obtained with their original settings, we made the following modifications to test the robustness of their procedure. We reduced the number of observations from 401 to 20 over the time course [0, 10] (producing 1 period), which more closely reflects the amount of data typically available in current systems biology. In doing so, we also reduced the number of knots for the splines to 15 (preserving the same proportionality of knots to datapoints as before), and we tried a different polynomial order: 3 instead of 5. The method incurred high computational costs, (roughly \(1 \frac{1}{2}\) weeks for a run), and so we could only repeat the inference on 3 independent data sets. The posterior samples were combined in order to approximately marginalise over datasets and thereby remove their potential particularities. For a fair comparison, we also ran our methods with 4 rather than the 10 chains that we used as default.

*Inference of the gradient mismatch parameter using GPs and adaptive gradient matching (INF)* [4]: We applied the method in the same way as described in the original publication of [4], using the authors’ software and selecting the same kernels and parameter/hyperparameter priors for the method proposed in the present paper. We generated data from the protein signal transduction pathway described in Eqs. (38–42), with the same settings as in [4]; initial values of the species: \((S=1, dS=0, R=1, RS=0, Rpp=0)\); ODE parameters: \((k_1=0.07, k_2=0.6, k_3=0.05, k_4=0.3, V=0.017, K_m=0.3)\); 15 timepoints producing one period: \(\{0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100\}\). As in [4], we used multiplicative iid Gaussian noise of standard deviation \(=0.1\) to corrupt the signals and reflect the noisy observations obtained in experiments. We chose the same gamma prior on the ODE parameters as used in [4], namely \(\Gamma (4, 0.5)\), for Bayesian inference. For the GP, we used the same kernel they originally used; see further on for details. In addition to this ODE system, we also applied this method to the rest of the described set-ups.

*Reproducing kernel Hilbert space method (GON)* [14]: The authors tested their method on the Fitz–Hugh Nagumo data (Eqs. 36, 37) with the following settings; initial values of \((-1 , -1)\) and ODE parameters of \(\alpha =0.2\); \(\beta =0.2\) and \(\psi =3\). The authors generated 50 datapoints over the time domain [0, 20] (producing 2 periods), with iid Gaussian noise (sd = 0.1) added to introduce error to the observations. 50 independent data sets were created in this way.

As well as comparing to the original publication set-up, we also tested the methods on a scenario with larger observational noise. We tested on 2 scenarios, when the signal to noise ratio was on average 10 for each species and when the average signal to noise ratio was 5. We used the average signal to noise ratio so that each species had the same observational error as one another. We reduced the dataset size to 25 timepoints over the time course [0, 10], producing 1 period, and show the results across 10 independent datasets.

To observe the variation between ODE models, we also ran the method on the protein signal transduction pathway in Eqs. (38–42). We generated data under the same settings as in [4]; ODE parameters: \((k_1=0.07, k_2=0.6, k_3=0.05, k_4=0.3, V=0.017, K_m=0.3)\); initial values of the species: \((S=1, dS=0, R=1, RS=0, Rpp=0)\); 15 timepoints covering one period: \(\{0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100\}\). We examined 2 noise scenarios; when the average signal to noise ratio was 10, and when the average signal to noise ratio was 5. As opposed to the set-up in [4], we use additive Gaussian noise to corrupt the data, to correspond with the assumed noise model.

*Penalised splines and 2nd derivative penalty method (RAM) *[3]: González et al. [14] used the method of Ramsay et al. [3] to compare with their technique. We have used the results from the original publication of [14]. For fairness of comparison, our method was applied in the same way as with the set-up in [14].

*Choice of kernel*: For the GP, we need to choose a suitable kernel, which reflects our prior knowledge in function space. We considered two kernels in our study (to correspond with the authors’ set-ups), the radial basis function (RBF) kernel

*a*and

*b*[23].

To initialise the hyperparameters, we fit a standard GP regression model (i.e. without information from the ODE) using maximum likelihood. We then checked to see whether the interpolant adequately represents our prior knowledge.

We found that the RBF kernel provided a good fit to the data for the data generated from the Fitz-Hugh Nagumo model. However, in confirmation of the findings in [4], we found that for the protein signalling transduction pathway, the non-stationary nature of the data is not represented properly with the RBF kernel, which is stationary [23]. As in [4], we used the sigmoid variance kernel, which is non-stationary [23] and found a considerable improvement to the fit to the data.

*Other settings*: We need to set the values for our variance mismatch parameter of the gradients, \(\gamma _s\). Since studies that indicate reasonable values for our technique are limited (see [1, 20]), we used \(Log_2\) and \(Log_{10}\) increments with an initial start at 1. All parameters were initialised with a random draw from the respective priors (apart from GON, which did not use priors).

## Results

*Tempered mismatch parameter using splines-based smooth functional tempering (C&S)*[5]: By examining Figs. 1, 2 and 3, we can see that the C&S method shows good performance over all parameters in the one case where the number of observations is 401, the number of knots is 301 and the polynomial order is 3 (cubic spline), since the bulk of the distributions of the sampled parameters surround the true parameters in Figs. 1 and 3 and are close to the true parameter in Fig. 2. These settings, however, require a great deal of “hand-tuning” or time expensive cross-validation and would be very difficult to set when using real data. We can observe the sensitivity of the method in the other set-ups, where the results are noticeably worse. An important point to note is when the dataset size was reduced, the cubic spline performed very badly. This lack of robustness makes these splines based methods very difficult to apply in practice. The INF, LB2 and LB10 methods consistently outperform the C&S method with distributions being closer to or overlapping the true parameters. On the set-up with 20 observations, for both 4 and 10 chains, the INF method produced largely different estimates across the datasets, as depicted by the wide boxplots and long tails.

*Inference of the gradient mismatch parameter using GPs, adaptive method (INF)*[4]: In order to see how the LB2 and LB10 tempering methods perform in comparison to the INF method, we can examine the results from the protein signalling transduction pathway (see Eqs. 38–42), as well as comparing how each method did in the other set-ups. Figure 4 shows the distributions of parameter estimates minus the true values for the protein signalling transduction pathway. After implementing the authors' code, we noticed that some of the MCMC simulations had not converged. In order to present a fair depiction of the methods’ performance, we show results from the dataset that produced the median performance. For each dataset the root mean square was calculated on the parameter samples minus the true values. The dataset that produced the median root mean square value is given.

We can see by examining Fig. 4, that for each parameter the methods are performing well, since the distributions are close to the true values. For this set-up, overall there does not appear to be a significant difference between the INF, LB2 and LB10 methods.

By examining Fig. 5 and using the standard significance level of 0.05 as a cut-off, we can see that the CDFs for LB2 and LB10 are significantly higher than those for INF. This means that the parameter estimates from the LB2 and LB10 methods are closer to the true parameters than the INF method, since we are dealing with absolute error. The LB2 and LB10 method show no significant difference to each other.

For the set-up in [5], Figs. 1, 2 and 3 show us that the LB2 and LB10 methods perform well across dataset size and over all the parameters, since most of the mass of the distributions surround or are situated close to the true parameters. One type of scheduling did not always outperform another, the LB2 does better than the LB10 for 4 parallel chains (distributions overlapping the true parameter for all three parameters) and the LB10 outperforms the LB2 for 10 parallel chains (distribution overlapping true parameter in Fig. 1, being closer to the true parameter in Fig. 2 and narrower and more symmetric around the true parameter in Fig. 3). The bulks of parameter sample distributions for the INF method are located close to the true parameters for all dataset sizes. However, the method produces less uncertainty at the expense of bias. When reducing the dataset size to 20 observations, for both 4 and 10 chains, the results deteriorate for the INF method and it is outperformed by the LB2 and LB10 methods.

*Reproducing kernel Hilbert space (GON)*[14]

*and Hierarchical regularisation splines based method (RAM)*[3]: For these sets of results, to assess the performance of the methods, we used the same criterion as in GON. For each parameter, the absolute value of the difference between an estimator and the true parameter (\(|\hat{\theta }_i - \theta _i|\)) was computed and the distribution across the datasets was examined. For the LB2, LB10 and INF methods, we used the median of the sampled parameters as an estimator, since it is a robust average. Examining Fig. 6, the LB2, LB10 and INF methods do as well as the GON method for 2 parameters (INF doing slightly worse for \(\psi\)) and outperform it for 1 parameter with the width of the distributions of the absolute distances to the true parameter roughly \(\frac{1}{3}\) of the size. All methods outperform the RAM method.

The wider range of estimates of the parameters (as well as the long tails in the posterior distributions in Figs. 1, 2 and 3), for the INF, LB2 and LB10 methods, were observed when occasionally the time course signals would flatten. An inspection of Eq. (17) reveals that when \(f_{s}(\mathbf X ,\varvec{\theta },\mathbf t ) = \mathbf{0}\) \(\forall s\), then \(p(\mathbf X |\varvec{\theta },\varvec{\phi },\varvec{\eta },\varvec{\gamma })\) is maximised at \(\mathbf x _{s} = {\varvec{\phi }_s}\) \(\forall s\). This corresponds to a flattening of the true concentration profiles, which usually can be assumed to be a poor fit to the data. Hence, this flattening should be discouraged by the likelihood term \(p(\mathbf Y |\mathbf X ,\varvec{\sigma })\) in Eq. (4). However, for \(\varvec{\sigma } \gg \varvec{\sigma }_{\mathrm {True}}\) (where \(\varvec{\sigma }_{\text {True}}\) is the unknown true standard deviation of the observational error of the signals), the likelihood term is effectively switched off, which will allow the system to converge to a high probability attractor state corresponding to \(\mathbf x _{s} = {\varvec{\phi }_s}\). In practice, we observe this effect for \(\varvec{\sigma }\) exceeding \(\varvec{\sigma }_{\text {True}}\). This attractor state is further self-enforcing by driving the length scales included in the GP hyperparameters \(\varvec{\eta }\) to very large values, as we have observed in our simulations. Obviously, \(\mathbf x _{s} = {\varvec{\phi }_s}\) is unrealistic. To test whether holding the standard deviation of the noise at the true value prevents the Markov chains from being driven to this unrealistic attractor state, we repeated the simulations of the comparison to GON and GON Cross, for the Fitz–Hugh Nagumo system and protein signalling transduction pathway for signal to noise ratios of 10 and 5. We held the standard deviation of the noise at the value that was used to generate the data, where in practice this could be estimated through a standard GP regression. We used the true value in order to observe whether this approach affects the results and to what extent, under the most favourable conditions.

## Discussion

We have modified a recently developed gradient matching approach for systems biology (INF) by combining it with a parallel tempering scheme for the gradient mismatch parameter (C&S). We have also carried out a comparative evaluation of this new method with various state-of-the-art gradient matching methods. These methods are based on different inference approaches and statistical models, namely: non-parametric Bayesian statistics using GPs (INF, LB2, LB10), splines-based smooth functional tempering (C&S), hierarchical regularisation using splines interpolation (RAM), and penalised likelihood based on reproducing kernel Hilbert spaces (GON). Our set-ups have also allowed us to compare the opposing paradigms of Bayesian inference (INF) versus parallel tempering (LB2, LB10) of the slack parameters controlling the amount of mismatch between the gradients.

In one case, when the number of observations was very high (higher than what would be expected in these types of experiments) and the tuning parameters were finely adjusted (which is time-consuming in practice), the C&S method does well. When the dataset size was reduced, all settings for this method deteriorated significantly, including the previous tuning setting that performed well. It is also important to note that the particular settings that we found to be optimal were different than in the original paper, which highlights the sensitivity and lack of robustness in the splines based method.

The GON and GON Cross methods produce estimates that are close to the true parameters in terms of absolute uncertainty. For the Fitz-Hugh Nagumo ODE model, the method outperforms the other schemes for one parameter, in the case when the signal to noise ratio was 10 and 25 datapoints were generated. For the protein signalling transduction pathway, however, this method is outperformed by INF, LB2 and LB10. This method also has a drawback to practical implementation, on non-simulated data. The method, which uses a classical approach to parameter estimation (producing point estimates), cannot immediately produce confidence intervals for the parameters and so quantifying the uncertainty in the parameter estimates will be more difficult. For simulated data, this is not an issue, since it is possible to generate multiple datasets and quantify the accuracy of the method by observing the results across all datasets. In practice however, this is not available. One would need to rely on other processes, such as bootstrapping, and the effect on the accuracy and computational time is something that needs to be investigated.

The INF method performs reasonably by producing results close to the true parameters across the scenarios that we have examined. However, this method’s decrease in uncertainty is at the expense of bias.

The LB2 and LB10 methods show good performance across the set-ups. The parameter inference is accurate across the different ODE models and the different settings of those models. The parallel tempering schedule has proven to be quite robust, as the methods perform similarly across the various set-ups.

For some simulations, we noticed a flattening of the time course signals for INF, LB2 and LB10. The uncertainty in the signals reduced the accuracy in the methods. In order to achieve a robust method that provides accurate parameter estimation, we examined holding the standard deviation at the true value. In this case, the GON and GON Cross outperformed INF, LB2 and LB10 on one parameter in the Fitz-Hugh Nagumo system, when the signal to noise ratio was 10. For the signal to noise ratio setting of 5, the methods all performed similarly. The INF, LB2 and LB10 methods outperform the GON Cross method for the protein signalling transduction pathway. Holding the standard deviation of the noise at the true value, for the INF, LB2 and LB10 methods, stops the likelihood term from effectively being switched off and prevents the flattening. In practice, this parameter could be estimated by a standard GP regression, in order to fix the standard deviation of the noise when the true value is unknown. This is a somewhat heuristic fix to the problem however, and a general robust solution should be the focus for future research.

It is also important to note that the methods in this paper were not derived in order to operate with a particular ODE model. The results therefore, should be similar across ODE type. We have seen evidence of this in other ODE systems, like the Lorenz attractor and the Lotka-Volterra predator-prey model, which are less relevant to the field of biomedical engineering, though, and are thus beyond the scope of the present paper.

## Conclusions

The combination of adaptive gradient matching using GPs from Dondelinger et al. [4] and a parallel tempering scheme for the gradient mismatch parameter from Campbell and Steele [5], has yielded a method that provides accurate parameter estimates for ODEs when the true standard deviation of the noise is known. This method performs well across ODE models and variation of the scheduling of the tempered mismatch parameter.

We have found that the method in Dondelinger et al. [4] provides accurate estimation, although the decrease in uncertainty is at the expense of bias. The method in Campbell and Steele [5] shows a lack of robustness, due to the difficulty in configuring the splines settings. For the method in Ramsay et al. [3], we found it was outperformed by the other methods we looked at. The method in González et al. [14] is accurate and robust, but can be outperformed by Dondelinger et al. [4] and the proposed method in this paper. For a signal to noise ratio of 10 on the Fitz-Hugh Nagumo system, the González et al. [14] method is able to outperform the method in Dondelinger et al. [4] and our new method, for one parameter. We found that using cross validation as opposed to AIC for the González et al. [14] method, to estimate the penalty parameter, yielded results that were more robust.

In order to avoid a potential drawback to our proposed method, we hold the standard deviation of the noise at the true value, to avoid the signals deviating from the data and flattening. This remedy was found to lead to a significant improvement over the method with a flexible standard deviation of the error. In practice, the standard deviation of the noise could be estimated, for example by a standard GP regression, and general approaches to this should be the focus of future research.

## Declarations

### Authors' contributions

The simulations for the LB2, LB10 and INF methods, as well as the writing of the manuscript, was carried out by Macdonald. Niu carried out the simulations for the GON and GON Cross methods. All authors contributed equally to selecting the methods, planning the simulations, interpreting the results and revising the manuscript. All authors read and approved the final manuscript.

### Acknowledgements

We would like to thank Dr Catherine Higham for helpful discussions on the research topic. We would like to thank Dr Caroline Haig for feedback on the manuscript. This project has been funded by the Engineering and Physical Sciences Research Council (EPSRC), grant agreement EP/L020319/1.

### Competing interests

The authors declare that they have no competing interests.

### Declarations

Publication costs for this article were funded by the Engineering and Physical Sciences Research Council (EPSRC), grant agreement EP/L020319/1.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Calderhead B, Girolami MA, Lawrence ND. Accelerating Bayesian inference over non-linear differential equations with Gaussian processes. Neural Inf Process Syst (NIPS). 2008;22.Google Scholar
- Liang H, Wu H. Parameter estimation for differential equation models using a framework of measurement error in regression models. J Am Stat Assoc. 2008;103:1570–83.MathSciNetView ArticleMATHGoogle Scholar
- Ramsay JO, Hooker G, Campbell D, Cao J. Parameter estimation for differential equations: a generalized smoothing approach. J R Statist. 2007;69:741–96.MathSciNetView ArticleGoogle Scholar
- Dondelinger F, Filippone M, Rogers S, Husmeier D. ODE parameter inference using adaptive gradient matching with Gaussian processes. The 16th Int Conf Artif Intell Stat (AISTATS) 31 JMLR. 2013:216–28.Google Scholar
- Campbell D, Steele RJ. Smooth functional tempering for nonlinear differential equation models. Stat Comput. 2012;22:429–43.MathSciNetView ArticleMATHGoogle Scholar
- Adon NA, Jabbar MH, Mahmud F. FPGA implementation for cardiac excitation-conduction simulation based on FitzHugh-Nagumo model. 5th Int Conf Biomed Eng Vietnam. 2015;46.Google Scholar
- Duckett G, Barkley D. Modeling the dynamics of cardiac action potentials. Phys Rev Lett. 2000;85:884–7.View ArticleGoogle Scholar
- Goktepe S, Kuhl, E. Computational modeling of cardiac electrophysiology: a novel finite element approach. Int J Numer Methods Eng. 2009.Google Scholar
- Bruggemeier B, Schusterreiter C, Pavlou H, Jenkins N, Lynch S, Bianchi A, Cai X. Improving the utility of drosophila melanogaster for neurodegenerative disease research by modelling courtship behaviour patterns. Report summarising the outcomes from the UK NC3R’s and POEM’s meeting. 2014.Google Scholar
- Vivekanandan S, Emmanuel DS, Kumari R. Propogation of action potential for Hansen’s disease affected nerve model using FitzHugh Nagumo like excitation. J Theor Appl Inf Technol. 2013.Google Scholar
- Martin GS. Cell signaling and cancer. Meeting review. Cancer. 2003.Google Scholar
- Kim EK, Choi E-J. Pathological roles of mapk signaling pathways in human diseases. Biochimica et Biophysica Acta (BBA)–molecular basis of disease. Amsterdam: Elsevier; 2010.Google Scholar
- Macdonald B, Husmeier D. Computational inference in systems biology. Bioinformatics and Biomedical Engineering:Third International Conference, IWBBIO. Proceedings, Part II. Series: Lecture Notes in Computer Science, vol. 9044. Berlin: Springer; 2015. p. 276–88.Google Scholar
- González J, Vujačić I, Wit E. Inferring latent gene regulatory network kinetics. Stat Appl Genet Mol Biol. 2013;12(1):109–27.MathSciNetGoogle Scholar
- Macdonald B, Husmeier D. Gradient matching methods for computational inference in mechanistic models for systems biology: a review and comparative analysis. Front Bioeng Biotechnol. 2015;3:180.View ArticleGoogle Scholar
- Solak E, Murray-Smith R, Leithead WE, Leith DJ, Rasmussen CE. Derivative observations in Gaussian process models of dynamic systems. Adv Neural Inf Process Syst. 2003; 9–14.Google Scholar
- Holsclaw, T., Sansó B, Lee HKH, Heitmann K, Habi S, Higdon D, Alam U. Gaussian process modeling of derivative curves. Technometrics. 2011.Google Scholar
- Bishop CM. Pattern recognition and machine learning. Berlin: Springer; 2006.Google Scholar
- Murray I, Adams R. Slice sampling covariance hyperparameters of latent gaussian models. Adv Neural Inf Process Syst (NIPS); 2010:23.Google Scholar
- Friel N, Pettitt AN. Marginal likelihood estimation via power posteriors. J R Stat Soc. 2008;70:589–607.MathSciNetView ArticleMATHGoogle Scholar
- Murphy KP. Machine learning. A probabilistic perspective. The MIT Press. 2012.Google Scholar
- Aronszajin N. Green’s functions and reproducing kernels. Proceedings of the Symposium on spectral theory and differential problems; 1951:355–411.Google Scholar
- Rasmussen CE, Williams CKI. Gaussian processes for machine learning. The MIT Press; 2006.Google Scholar
- FitzHugh R. Impulses and physiological states in models of nerve membrane. Biophys J. 1961;1:445–66.View ArticleGoogle Scholar
- Nagumo JS, Arimoto S, Yoshizawa S. An active pulse transmission line simulating a nerve axon. Proc Inst Radio Eng. 1962;50:2061–70.Google Scholar
- Vyshemirsky V, Girolami MA. Bayesian ranking of biochemical system models. Bioinformatics. 2008;24(6):839–83.View ArticleGoogle Scholar