Monte Carlo-based prediction intervals for nonlinear regression

May 11, 2018

Calculation of the propagated uncertainty \sigma_y using \nabla \Sigma \nabla^T (1), where \nabla is the gradient and \Sigma the covariance matrix of the coefficients \beta_i, is called the “Delta Method” and is widely applied in nonlinear least-squares (NLS) fitting. However, this method is based on first-order Taylor expansion and thus assummes linearity around \hat{y} = f(x, \hat{\beta}). The second-order approach can partially correct for this restriction by using a second-order polynomial around \hat{y}, which is \nabla \Sigma \nabla^T + \frac{1}{2} \rm{tr}(H \Sigma H \Sigma) (2), where \rm{tr}(\cdot) is the matrix trace and \rm{H} is the Hessian.
Confidence and prediction intervals for NLS models are calculated using t(1 - \frac{\alpha}{2}, \nu) \cdot \sigma_y (3) or t(1 - \frac{\alpha}{2}, \nu) \cdot \sqrt{\sigma_y^2 + \sigma_r^2} (4), respectively, where the residual variance \sigma_r^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\nu} (5).
Now, how can we employ the matrix notation of error propagation for creating Taylor expansion- and Monte Carlo-based prediction intervals?
The inclusion of \sigma_r^2 in the prediction interval can be implemented as an extended gradient and “augmented” covariance matrix. So instead of using \hat{y} = f(x, \hat{\beta}) (6), we take \hat{y} = f(x, \hat{\beta}) + \sigma_r^2 (7) as the expression and augment the i \times i covariance matrix \Sigma to an (i+1) \times (i+1) covariance matrix, where \Sigma_{i+1, i+1} = \sigma_r^2. Partial differentiation and matrix multiplication will then yield, for example with two coefficients \beta_1 and \beta_2 and their corresponding covariance matrix \Sigma:
\left[\frac{\partial f}{\partial \beta_1}\; \frac{\partial f}{\partial \beta_2}\; 1\right] \left[ \begin{array}{ccc} \;\sigma_1^2\;\;\; \sigma_1\sigma_2\;\; 0 \\ \sigma_2\sigma_1\;\; \sigma_2^2\;\;\;\; 0 \\ \;\;\;0\;\;\;\;\;\; 0\;\;\;\;\; \sigma_r^2 \end{array} \right] \left[ \begin{array}{c} \frac{\partial f}{\partial \beta_1} \\ \frac{\partial f}{\partial \beta_2} \\ 1 \end{array} \right] (8)
= \left(\frac{\partial f}{\partial \beta_1}\right)^2\sigma_1^2 + 2 \frac{\partial f}{\partial \beta_1} \frac{\partial f}{\partial \beta_2} \sigma_1 \sigma_2 + \left(\frac{\partial f}{\partial \beta_2}\right)^2\sigma_2^2 + \sigma_r^2 (9)
\equiv \sigma_y^2 + \sigma_r^2, which then goes into (4).
The advantage of the augmented covariance matrix is that it can be exploited for creating Monte Carlo-based prediction intervals. This is new from propagate version 1.0-6 and is based on the paradigm that we add another dimension by employing the augmented covariance matrix of (8) in the multivariate t-distribution random number generator (in our case rtmvt), with \mu = 0. All n simulations are then evaluated with (7) and the usual [1 - \frac{\alpha}{2}, \frac{\alpha}{2}] quantiles calculated for the prediction interval. Using the original covariance matrix with (6) will deliver the MC-based confidence interval.
Application of second-order Taylor expansion or the MC-based approach demonstrates nicely that for the majority of nonlinear models, the confidence/prediction intervals around \hat{y} are quite asymmetric, which the classical Delta Method does not capture:


DNase1 <- subset(DNase, Run == 1)
fm3DNase1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))
## first-order prediction interval
PROP1 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000,
second.order = FALSE, interval = "prediction")

\begin{array}{cc}  Prop.Mean.1 & 0.74804722 \\ & 0.02081131 \\  Prop.2.5 & 0.70308712 \\  Prop.97.5 & 0.79300731 \\  \end{array}

## second-order prediction interval and MC
PROP2 <- predictNLS(fm3DNase1, newdata = data.frame(conc = 2), nsim = 1000000,
second.order = TRUE, interval = "prediction")

\begin{array}{cc}  Prop.Mean.1 & 0.74804722 \\  Prop.Mean.2 & 0.74815136 \\ &  0.02081131 \\ &  0.02081520 \\  Prop.2.5 &  0.70318286 \\  Prop.97.5 & 0.79311987 \\  Sim.Mean & 0.74815598 \\ & 0.02261884 \\  Sim.2.5 & 0.70317629 \\  Sim.97.5 & 0.79309874 \\  \end{array}

What we see here is that
i) the first-order prediction interval [0.70308712; 0.79300731] is symmetric and slightly down-biased compared to the second-order one [0.70317629; 0.79309874],
ii) the second-order prediction interval tallies nicely up to the 4th decimal with the new MC-based interval (0.70318286 and 0.70317629; 0.79311987 and 0.79309874).
I believe this clearly demonstrates the usefulness of the MC-based approach for NLS prediction interval estimation…

Introducing: Orthogonal Nonlinear Least-Squares Regression in R

January 18, 2015

With this post I want to introduce my newly bred ‘onls’ package which conducts Orthogonal Nonlinear Least-Squares Regression (ONLS):
Orthogonal nonlinear least squares (ONLS) is a not so frequently applied and maybe overlooked regression technique that comes into question when one encounters an “error in variables” problem. While classical nonlinear least squares (NLS) aims to minimize the sum of squared vertical residuals, ONLS minimizes the sum of squared orthogonal residuals. The method is based on finding points on the fitted line that are orthogonal to the data by minimizing for each (x_i, y_i) the Euclidean distance \|D_i\| to some point (x_{0i}, y_{0i}) on the fitted curve. There is a 25 year old FORTRAN implementation for ONLS available (ODRPACK,, which has been included in the ‘scipy’ package for Python ( Here, onls has been developed for easy future algorithm tweaking in R. The results obtained from onls are exactly similar to those found in the original implementation [1, 2]. It is based on an inner loop using optimize for each (x_i, y_i) to find \min \|D_i\| within some border [x_{i-w}, x_{i+w}] and an outer loop for the fit parameters using nls.lm of the ‘minpack’ package. Sensible starting parameters for onls are obtained by prior fitting with standard nls, as parameter values for ONLS are usually fairly similar to those from NLS.

There is a package vignette available with more details in the “/onls/inst” folder, especially on what to do if fitting fails or not all points are orthogonal. I will work through one example here, the famous DNase 1 dataset of the nls documentation, with 10% added error. The semantics are exactly as in nls, albeit with a (somewhat) different output:

> DNase1 <- subset(DNase, Run == 1)
> DNase1$density <- sapply(DNase1$density, function(x) rnorm(1, x, 0.1 * x))
> mod1 <- onls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))

Obtaining starting parameters from ordinary NLS...
Relative error in the sum of squares is at most `ftol'.
Optimizing orthogonal NLS...
Passed... Relative error in the sum of squares is at most `ftol'.

The print.onls method gives, as in nls, the parameter values and the vertical residual sum-of-squares. However, the orthogonal residual sum-of-squares is also returned and MOST IMPORTANTLY, information on how many points (x_{0i}, y_{0i}) are actually orthogonal to (x_i, y_i) after fitting:

> print(mod1)
Nonlinear orthogonal regression model
model: density ~ Asym/(1 + exp((xmid - log(conc))/scal))
data: DNase1
Asym xmid scal
2.422 1.568 1.099
vertical residual sum-of-squares: 0.2282
orthogonal residual sum-of-squares: 0.2234
PASSED: 16 out of 16 fitted points are orthogonal.

Number of iterations to convergence: 2
Achieved convergence tolerance: 1.49e-08

Checking all points for orthogonality is accomplished using the independent checking routine check_o which calculates the angle between the slope \mathrm{m}_i of the tangent obtained from the first derivative at (x_{0i}, y_{0i}) and the slope \mathrm{n}_i of the onls-minimized Euclidean distance between (x_{0i}, y_{0i}) and (x_i, y_i):

\tan(\alpha_i) = \left|\frac{\mathrm{m}_i - \mathrm{n}_i}{1 + \mathrm{m}_i \cdot \mathrm{n}_i}\right|
\mathrm{m}_i = \frac{df(x, \beta)}{dx_{0i}}, \,\, \mathrm{n}_i = \frac{y_i - y_{0i}}{x_i - x_{0i}}
=> \alpha_i[^{\circ}] = \tan^{-1} \left( \left|\frac{\mathrm{m}_i - \mathrm{n}_i}{1 + \mathrm{m}_i \cdot \mathrm{n}_i}\right| \right) \cdot \frac{360}{2\pi} which should be 90^{\circ}, if the Euclidean distance has been minimized.

When plotting an ONLS model with the plot.onls function, it is important to know that orthogonality is only evident with equal scaling of both axes:

> plot(mod1, xlim = c(0, 0.5), ylim = c(0, 0.5))


As with nls, all generics work:
print(mod1), plot(mod1), summary(mod1), predict(mod1, newdata = data.frame(conc = 6)), logLik(mod1), deviance(mod1), formula(mod1), weights(mod1), df.residual(mod1), fitted(mod1), residuals(mod1), vcov(mod1), coef(mod1), confint(mod1).
However, deviance and residuals deliver the vertical, standard NLS values. To calculate orthogonal deviance and obtain orthogonal residuals, use deviance_o and residuals_o.

[1] ALGORITHM 676 ODRPACK: Software for Weighted Orthogonal Distance Regression.
Boggs PT, Donaldson JR, Byrd RH and Schnabel RB.
ACM Trans Math Soft (1989), 15: 348-364.
[2] User’s Reference Guide for ODRPACK Version 2.01.
Software for Weighted Orthogonal Distance Regression.
Boggs PT, Byrd RH, Rogers JE and Schnabel RB.\\
NISTIR (1992), 4834: 1-113.


I’ll take my NLS with weights, please…

January 13, 2014

Today I want to advocate weighted nonlinear regression. Why so?
Minimum-variance estimation of the adjustable parameters in linear and non-linear least squares requires that the data be weighted inversely as their variances w_i \propto \sigma^{-2}. Only then \hat{\beta} is the BLUE (Best Linear Unbiased Estimator) for linear regression and nonlinear regression with small errors (, an important fact frequently neglected, especially in scenarios with heteroscedasticity.
The variance of a fit s^2 is also characterized by the statistic \chi^2 defined as followed:
\chi^2 \equiv \sum_{i=1}^n \frac{(y_i - f(x_i))^2}{\sigma_i^2}
The relationship between s^2 and \chi^2 can be seen most easily by comparison with the reduced \chi^2: \chi_\nu^2 = \frac{\chi^2}{\nu} = \frac{s^2}{\langle \sigma_i^2 \rangle}
whereas \nu = degrees of freedom (N – p), and \langle \sigma_i^2 \rangle is the weighted average of the individual variances. If the fitting function is a good approximation to the parent function, the value of the reduced chi-square should be approximately unity, \chi_\nu^2 = 1. If the fitting function is not appropriate for describing the data, the deviations will be larger and the estimated variance will be too large, yielding a value greater than 1. A value less than 1 can be a consequence of the fact that there exists an uncertainty in the determination of s^2, and the observed values of \chi_\nu^2 will fluctuate from experiment to experiment. To assign significance to the \chi^2 value, one can use the integral probability P_\chi(\chi^2;\nu) = \int_{\chi^2}^\infty P_\chi(x^2, \nu)dx^2 which describes the probability that a random set of n data points sampled from the parent distribution would yield a value of \chi^2 equal to or greater than the calculated one. This can be calculated by 1 - pchisq(chi^2, nu) in R.

To see that this actually works, we can Monte Carlo simulate some heteroscedastic data with defined variance as a function of y-magnitude and compare unweighted and weighted NLS.
First we take the example from the documentation to nls and fit an enzyme kinetic model:

DNase1 <- subset(DNase, Run == 1)
fm3DNase1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = DNase1,
start = list(Asym = 3, xmid = 0, scal = 1))

Then we take the fitted values \hat{y} (which are duplicated because of the initial replicates), create a new unique dataset on which we create 20 response values y_i for each concentration x sampled from a normal distribution with 2% random heteroscedastic gaussian noise as a function of the value’s magnitude y_i = \hat{y} + \mathcal{N}(0, 0.02 \cdot \hat{y}):

FITTED <- unique(fitted(fm3DNase1))
DAT <- sapply(FITTED, function(x) rnorm(20, mean = x, sd = 0.02 * x))
matplot(t(DAT), type = "p", pch = 16, lty = 1, col = 1)
lines(FITTED, col = 2)

Now we create the new dataframe to be fitted. For this we have to stack the unique x– and y_i-values into a 2-column dataframe:

CONC <- unique(DNase1$conc)
fitDAT <- data.frame(conc = rep(CONC, each = 20), density = matrix(DAT))

First we create the unweighted fit:

FIT1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = fitDAT,
start = list(Asym = 3, xmid = 0, scal = 1))

Then we fit the data with weights w = 1/var(y). IMPORTANT: we need to replicate the weight values by 20 in order to match the data length.

VAR <- tapply(fitDAT$density, fitDAT$conc, var)
VAR <- rep(VAR, each = 20)
FIT2 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = fitDAT, weights = 1/VAR,
start = list(Asym = 3, xmid = 0, scal = 1))

For calculation of \chi^2_\nu and its corresponding p-value, we use the fitchisq function of my ‘qpcR’ package:

> fitchisq(FIT1)
[1] 191.7566
[1] 1.22138
[1] 0.03074883

> fitchisq(FIT2)
[1] 156.7153
[1] 0.9981866
[1] 0.4913983

Now we see the benefit of weighted fitting: Only the weighted model shows us with it’s reduced chi-square value of almost exactly 1 and its high p-value that our fitted model approximates the parent model. And of course it does, because we simulated our data from it…