predictNLS (Part 1, Monte Carlo simulation): confidence intervals for ‘nls’ models

August 14, 2013

Those that do a lot of nonlinear fitting with the nls function may have noticed that predict.nls does not have a way to calculate a confidence interval for the fitted value. Using confint you can obtain the error of the fit parameters, but how about the error in fitted values? ?predict.nls says: “At present se.fit and interval are ignored.” What a pity… This is largely to the fact that confidence intervals for nonlinear fits are not easily calculated and under some debate, see http://r.789695.n4.nabble.com/plotting-confidence-bands-from-predict-nls-td3505012.html or http://thr3ads.net/r-help/2011/05/1053390-plotting-confidence-bands-from-predict.nls. In principle, since calculating the error in the fitted values is a matter of “error propagation”, two different approaches can be used:

1) Error propagation using approximation by (first-order) Taylor expansion around $f(x)$,

2) Error propagation using Monte Carlo simulation.

Topic 1) will be subject of my next post, today we will stick with the MC approach.

When calculating the error in the fitted values, we need to propagate the error of all variables, i.e. the error in all predictor variables $x_m$ and the error of the fit parameters $\beta$, to the response $y = f(x_m, \beta)$. Often (as in the ‘Examples’ section of nls), there is no error in the $x_m$-values. The errors of the fit parameters $\beta$ are obtained, together with their correlations, in the variance-covariance matrix $\Sigma$ from vcov(object).

A Monte Carlo approach to nonlinear error propagation does the following:

1) Use as input $\mu_m$ and $\sigma_m^2$ of all predictor variables and the vcov matrix $\Sigma$ of the fit parameters $\beta$.

2) For each variable $m$, we create $n$ samples from a multivariate normal distribution using the variance-covariance matrix: $x_{m, n} \sim \mathcal{N}(\mu, \Sigma)$.

3) We evaluate the function on each simulated variable: $y_n = f(x_{m, n}, \beta)$

4) We calculate statistics (mean, s.d., median, mad) and quantile-based confidence intervals on the vector $y_n$.

This is exactly what the following function does: It takes an nls object, extracts the variables/parameter values/parameter variance-covariance matrix, creates an “augmented” covariance matrix (with the variance/covariance values from the parameters and predictor variables included, the latter often being zero), simulates from a multivariate normal distribution (using mvrnorm of the ‘MASS’ package), evaluates the function (object$call$formula) on the values and finally collects statistics. Here we go:

predictNLS <- function(
object,
newdata,
level = 0.95,
nsim = 10000,
...
)
{
require(MASS, quietly = TRUE)

## get right-hand side of formula
RHS <- as.list(object$call$formula)[[3]]
EXPR <- as.expression(RHS)

## all variables in model
VARS <- all.vars(EXPR)

## coefficients
COEF <- coef(object)

## extract predictor variable
predNAME <- setdiff(VARS, names(COEF))

## take fitted values, if 'newdata' is missing
if (missing(newdata)) {
newdata <- eval(object$data)[predNAME] colnames(newdata) <- predNAME } ## check that 'newdata' has same name as predVAR if (names(newdata)[1] != predNAME) stop("newdata should have name '", predNAME, "'!") ## get parameter coefficients COEF <- coef(object) ## get variance-covariance matrix VCOV <- vcov(object) ## augment variance-covariance matrix for 'mvrnorm' ## by adding a column/row for 'error in x' NCOL <- ncol(VCOV) ADD1 <- c(rep(0, NCOL)) ADD1 <- matrix(ADD1, ncol = 1) colnames(ADD1) <- predNAME VCOV <- cbind(VCOV, ADD1) ADD2 <- c(rep(0, NCOL + 1)) ADD2 <- matrix(ADD2, nrow = 1) rownames(ADD2) <- predNAME VCOV <- rbind(VCOV, ADD2) ## iterate over all entries in 'newdata' as in usual 'predict.' functions NR <- nrow(newdata) respVEC <- numeric(NR) seVEC <- numeric(NR) varPLACE <- ncol(VCOV) ## define counter function counter <- function (i) { if (i%%10 == 0) cat(i) else cat(".") if (i%%50 == 0) cat("\n") flush.console() } outMAT <- NULL for (i in 1:NR) { counter(i) ## get predictor values and optional errors predVAL <- newdata[i, 1] if (ncol(newdata) == 2) predERROR <- newdata[i, 2] else predERROR <- 0 names(predVAL) <- predNAME names(predERROR) <- predNAME ## create mean vector for 'mvrnorm' MU <- c(COEF, predVAL) ## create variance-covariance matrix for 'mvrnorm' ## by putting error^2 in lower-right position of VCOV newVCOV <- VCOV newVCOV[varPLACE, varPLACE] <- predERROR^2 ## create MC simulation matrix simMAT <- mvrnorm(n = nsim, mu = MU, Sigma = newVCOV, empirical = TRUE) ## evaluate expression on rows of simMAT EVAL <- try(eval(EXPR, envir = as.data.frame(simMAT)), silent = TRUE) if (inherits(EVAL, "try-error")) stop("There was an error evaluating the simulations!") ## collect statistics PRED <- data.frame(predVAL) colnames(PRED) <- predNAME FITTED <- predict(object, newdata = data.frame(PRED)) MEAN.sim <- mean(EVAL, na.rm = TRUE) SD.sim <- sd(EVAL, na.rm = TRUE) MEDIAN.sim <- median(EVAL, na.rm = TRUE) MAD.sim <- mad(EVAL, na.rm = TRUE) QUANT <- quantile(EVAL, c((1 - level)/2, level + (1 - level)/2)) RES <- c(FITTED, MEAN.sim, SD.sim, MEDIAN.sim, MAD.sim, QUANT[1], QUANT[2]) outMAT <- rbind(outMAT, RES) } colnames(outMAT) <- c("fit", "mean", "sd", "median", "mad", names(QUANT[1]), names(QUANT[2])) rownames(outMAT) <- NULL cat("\n") return(outMAT) }  The input is an ‘nls’ object, a data.frame ‘newdata’ of values to be predicted with the value $x_{new}$ in the first column and (optional) “errors-in-x” (as $\sigma$) in the second column. The number of simulations can be tweaked with nsim as well as the alpha-level for the confidence interval. The output is $f(x_{new}, \beta)$ (fitted value), $\mu(y_n)$ (mean of simulation), $\sigma(y_n)$ (s.d. of simulation), $median(y_n)$ (median of simulation), $mad(y_n)$ (mad of simulation) and the lower/upper confidence interval. Ok, let’s go to it (taken from the ‘?nls’ documentation):  DNase1 <- subset(DNase, Run == 1) fm1DNase1 <- nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1) ## usual predict.nls has no confidence intervals implemented predict(fm1DNase1, newdata = data.frame(conc = 5), interval = "confidence") [1] 1.243631 attr(,"gradient") Asym xmid scal [1,] 0.5302925 -0.5608912 -0.06804642  In the next post we will see how to use the gradient attribute to calculate a first-order Taylor expansion around $f(x)$ However, predictNLS gives us the error and confidence interval at $x = 5$:  predictNLS(fm1DNase1, newdata = data.frame(conc = 5)) . fit mean sd median mad 2.5% 97.5% [1,] 1.243631 1.243293 0.009462893 1.243378 0.009637439 1.224608 1.261575  Interesting to see, how close the mean of the simulation comes to the actual fitted value… We could also add some error in $x$ to propagate to $y$:  > predictNLS(fm1DNase1, newdata = data.frame(conc = 5, error = 0.1)) . fit mean sd median mad 2.5% 97.5% [1,] 1.243631 1.243174 0.01467673 1.243162 0.01488567 1.214252 1.272103  Have fun. If anyone know how to calculate a “prediction interval” (maybe quantile regression) give me hint… Cheers, Andrej Advertisements Trivial, but useful: sequences with defined mean/s.d. July 31, 2013 O.k., the following post may be (mathematically) trivial, but could be somewhat useful for people that do simulations/testing of statistical methods. Let’s say we want to test the dependence of p-values derived from a t-test to a) the ratio of means between two groups, b) the standard deviation or c) the sample size(s) of the two groups. For this setup we would need to i.e. generate two groups with defined $\mu, \sigma$ and $n$. Often encountered in simulations is that groups are generated with rnorm and then plugged into the simulation. However (and evidently), it is clear that sampling from a normal distribution does not deliver a vector with exactly defined statistical properties (although the “law of large numbers” states that with enough large sample size it converges to that…). For example, > x <- rnorm(1000, 5, 2) > mean(x) [1] 4.998388 > sd(x) [1] 2.032262 shows what I meant above ($\mu_x \neq 5, \sigma_x \neq 2$). Luckily, we can create vectors with exact mean and s.d. by a “scaled-and-shifted z-transformation” of an input vector $X$: $Z = \frac{X - \mu_X}{\sigma_X} \cdot \mathbf{sd} + \mathbf{mean}$ where sd is the desired standard deviation and mean the desired mean of the output vector Z. The code is simple enough: statVec <- function(x, mean, sd) { X <- x MEAN <- mean SD <- sd Z <- (((X - mean(X, na.rm = TRUE))/sd(X, na.rm = TRUE))) * SD MEAN + Z }  So, using this on the rnorm-generated vector x from above: > z <- statVec(x, 5, 2) > mean(z) [1] 5 > sd(z) [1] 2 we have created a vector with exact statistical properties, which is also normally distributed since multiplication and addition of a normal distribution preserves normality. Cheers, Andrej wapply: A faster (but less functional) ‘rollapply’ for vector setups April 23, 2013 For some cryptic reason I needed a function that calculates function values on sliding windows of a vector. Googling around soon brought me to ‘rollapply’, which when I tested it seems to be a very versatile function. However, I wanted to code my own version just for vector purposes in the hope that it may be somewhat faster. This is what turned out (wapply for “window apply”): wapply <- function(x, width, by = NULL, FUN = NULL, ...) { FUN <- match.fun(FUN) if (is.null(by)) by <- width lenX <- length(x) SEQ1 <- seq(1, lenX - width + 1, by = by) SEQ2 <- lapply(SEQ1, function(x) x:(x + width - 1)) OUT <- lapply(SEQ2, function(a) FUN(x[a], ...)) OUT <- base:::simplify2array(OUT, higher = TRUE) return(OUT) }  It is much more restricted than ‘rollapply’ (no padding, left/center/right adjustment etc). But interestingly, for some setups it is very much faster: library(zoo) x <- 1:200000 large window, small slides: > system.time(RES1 <- rollapply(x, width = 1000, by = 50, FUN = fun)) User System verstrichen 3.71 0.00 3.84 > system.time(RES2 <- wapply(x, width = 1000, by = 50, FUN = fun)) User System verstrichen 1.89 0.00 1.92 > all.equal(RES1, RES2) [1] TRUE small window, small slides: > system.time(RES1 <- rollapply(x, width = 50, by = 50, FUN = fun)) User System verstrichen 2.59 0.00 2.67 > system.time(RES2 <- wapply(x, width = 50, by = 50, FUN = fun)) User System verstrichen 0.86 0.00 0.89 > all.equal(RES1, RES2) [1] TRUE small window, large slides: > system.time(RES1 <- rollapply(x, width = 50, by = 1000, FUN = fun)) User System verstrichen 1.68 0.00 1.77 > system.time(RES2 <- wapply(x, width = 50, by = 1000, FUN = fun)) User System verstrichen 0.06 0.00 0.06 > all.equal(RES1, RES2) [1] TRUE There is about a 2-3 fold gain in speed for the above two setups but a 35-fold gain in the small window/large slides setup. Interesting… I noticed that zoo:::rollapply.zoo uses mapply internally, maybe there is some overhead for pure vector calculations… Cheers, Andrej bigcor: Large correlation matrices in R February 22, 2013 Here is the update to my bigcor function for creating large correlation/covariance matrices. What does it do? 1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of $A_1 ... A_n$$B_1 ... B_n$, etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796. 2) For all combinations of blocks, the correlation matrix is calculated, so$latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function can found here: http://www.dr-spiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!

The magic empty bracket

January 30, 2013

I have been working with R for some time now, but once in a while, basic functions catch my eye that I was not aware of…
For some project I wanted to transform a correlation matrix into a covariance matrix. Now, since cor2cov does not exist, I thought about “reversing” the cov2cor function (stats:::cov2cor).
Inside the code of this function, a specific line jumped into my retina:

r[] <- Is * V * rep(Is, each = p)


What’s this [ ]?

Well, it stands for every element $E_{ij}$ of matrix $E$. Consider this:

mat <- matrix(NA, nrow = 5, ncol = 5)

> mat
[,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA

With the empty bracket, we can now substitute ALL values by a new value:

mat[] <- 1

> mat
[,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    1    1    1    1
[3,]    1    1    1    1    1
[4,]    1    1    1    1    1
[5,]    1    1    1    1    1

Interestingly, this also works with lists:

L <- list(a = 1, b = 2, c = 3)

>L
$a [1] 1$b
[1] 2
$c [1] 3 L[] <- 5 > L$a
[1] 5

$b [1] 5$c
[1] 5

Cheers,
Andrej



Peer-reviewed R packages?

November 23, 2012

Dear R-Users,

a question: I am the author of the ‘qpcR’ package. Within this, there is a function ‘propagate’ that does error propagation based on Monte Carlo Simulation, permutation-based confidence intervals and Taylor expansion. For the latter I recently implemented a second-order Taylor expansion term that can correct for nonlinearity. The formulas are quite complex and I had quite a hard time to transform the second-order term that has covariance included into a matrix form. The first order expansion can be represented by GRAD * SIGMA * t(GRAD) with GRAD being the gradient matrix and SIGMA the covariance matrix. The second order term is 0.5 * trace(HESS * SIGMA * HESS * SIGMA) with HESS being the Hessian matrix of partial second derivatives.
So here the problem: Being a molecular biologist, my statistical skills have limitations. I am pretty sure my implementation is right (I checked the examples in GUM “Guide to the evaluation of uncertainties in experiments” and got the same results). BUT: Being sure says nothing. It could well be that I’m on the wrong track and people use a function that gives essentially wrong (maybe slightly wrong which may still be fatal…) results because they rely on me doing things right.
Now let’s suppose I’m not the only one with a queer feeling when submitting packages to CRAN in the hope that everything I did is right.
Which brings me to the question: What are the merits of a peer-review system for R packages? I would suppose (if the paradigm is right that peer-reviewing DOES increase quality…) that if packages were independently peer-reviewed in a way that a handful of other people inspect the code and check for detrimental bugs that are enigmatic to the packages author, then this would significantly increase the credibility of the packages functions.
Now I know that there is the “R Journal” or the “Journal of Statistical Software” that regularly publish papers where new R packages are explained in detail. However, they are restricted to the statistical background of the implementations and I’m not sure if the lines of code are actually checked.
An idea would be that such a review system is voluntary, but if a package has been reviewed and declared as “passed”, then this would be some sort of quality criterion. For scientists like me that need to write grant applications, this would give more impact on the work one has dedicated into developing a “good” package that could maybe be cited as being “peer-reviewed”.

Pros:
* increase of quality
* maybe citable
* beneficial for grant applications or C.V.

Cons:
* reviewers must meticulously inspect the code
* longer time to CRAN submission
* no platform yet to publish the reviews which can be cited like a journal

Cheers,
Andrej

A weighting function for ‘nls’ / ‘nlsLM’

July 19, 2012

Standard nonlinear regression assumes homoscedastic data, that is, all response values $y_i$ are distributed normally.  In case of heteroscedastic data (i.e. when the variance is dependent on the magnitude of the data), weighting the fit is essential. In nls (or nlsLM of the minpack.lm package), weighting can be conducted by two different methods:

1) by supplying a vector of weighting values $w_i$ for every $y_i$ that is supplied to the ‘weights’ argument.

2) by modifying the objective function in a way that it has the weights included in the body of the function. An example is given for this one in the documentation to nls.

1) is easy of course, but all $w_i$ values have to be derived from somewhere, for example the reciproce of the response values $w_i = 1/y_i$. 2) is a bit more cumbersome and one does not always want to define a new weighted version of the objective function.

I found it about time to design a weighting function that can be supplied to the ‘weights’ argument and that makes weighting by different regimes a breeze. wfct() returns a vector of weights that are calculated from a user-defined expression and transfers this vector within nls.

wfct <- function(expr)
{
expr <- deparse(substitute(expr))

## create new environment
newEnv <- new.env()

## get call
mc <- sys.calls()[[1]]
mcL <- as.list(mc)

## get data and write to newEnv
DATA <- mcL[["data"]]
DATA <- eval(DATA)
DATA <- as.list(DATA)
NAMES <- names(DATA)
for (i in 1:length(DATA)) assign(NAMES[i], DATA[[i]], envir = newEnv)

## get parameter, response and predictor names
formula <- as.formula(mcL[[2]])
VARS <- all.vars(formula)
RESP <- VARS[1]
RHS <- VARS[-1]
PRED <- match(RHS, names(DATA))
PRED <- names(DATA)[na.omit(PRED)]

## calculate variances for response values if "error" is in expression
## and write to newEnv
if (length(grep("error", expr)) > 0) {
y <- DATA[[RESP]]
x <- DATA[[PRED]]
## test for replication
if (!any(duplicated(x))) stop("No replicates available to calculate error from!")
## calculate error
error <- tapply(y, x, function(e) var(e, na.rm = TRUE))
error <- as.numeric(sqrt(error))
## convert to original repititions
error <- rep(error, as.numeric(table(x)))
assign("error", error, envir = newEnv)
}

## calculate fitted or residual values if "fitted"/"resid" is in expression
## and write to newEnv
if (length(grep("fitted", expr)) > 0 || length(grep("resid", expr)) > 0) {
mc2 <- mc
mc2$weights <- NULL MODEL <- eval(mc2) fitted <- fitted(MODEL) resid <- residuals(MODEL) assign("fitted", fitted, newEnv) assign("resid", resid, newEnv) } ## return evaluation in newEnv: vector of weights OUT <- eval(parse(text = expr), envir = newEnv) return(OUT) }  The weighting function can take 5 different variable definitions and combinations thereof: the name of the predictor (independent) variable, the name of the response (dependent) variable, error: if replicates $y_{ij}$ exist, the error $\sigma(y_{ij})$, fitted: the fitted values $\hat{y}_i$ of the model, resid: the residuals $y_i - \hat{y}_i$ of the model, For the last two, the model is fit unweighted, fitted values and residuals are extracted and the model is refit by the defined weights. Let’s have some examples (taken from the nls doc): Treated <- Puromycin[Puromycin$state == "treated", ]

Weighting by inverse of response $1/y_i$:

nls(rate ~ Vm * conc/(K + conc), data = Treated,
start = c(Vm = 200, K = 0.05), weights = wfct(1/rate))
Nonlinear regression model
model:  rate ~ Vm * conc/(K + conc)
data:  Treated
Vm         K
209.59669   0.06065
weighted residual sum-of-squares: 12.27

Number of iterations to convergence: 5
Achieved convergence tolerance: 5.2e-06


Weighting by square root of predictor $\sqrt{x_i}$:

nls(rate ~ Vm * conc/(K + conc), data = Treated,
start = c(Vm = 200, K = 0.05), weights = wfct(sqrt(conc)))
Nonlinear regression model
model:  rate ~ Vm * conc/(K + conc)
data:  Treated
Vm         K
216.74134   0.07106
weighted residual sum-of-squares: 332.6

Number of iterations to convergence: 5
Achieved convergence tolerance: 2.403e-06


Weighting by inverse square of fitted values $1/\hat{y_i}^2$:

nls(rate ~ Vm * conc/(K + conc), data = Treated,
start = c(Vm = 200, K = 0.05), weights = wfct(1/fitted^2))
Nonlinear regression model
model:  rate ~ Vm * conc/(K + conc)
data:  Treated
Vm         K
200.84098   0.04943
weighted residual sum-of-squares: 0.2272

Number of iterations to convergence: 4
Achieved convergence tolerance: 1.47e-06


A common weighting regime is to use the inverse variance $1/\sigma^2(y_i)$: