a
(intercept) and b
(slope).
It does so by:
1) Fitting a linear model #1 to the x,y
data.
2) Correcting y
by : .
3) Refitting linear model #2: .
4) Correcting y
by : .
5) Refitting linear model #3: , which is the final model with parameter estimates a
and b
.
Below is the code:
exactLM < function(
x = 1:100, ## predictor values
b = 0.01, ## slope
a = 3, ## intercept
error = NULL, ## homoscedastic error
n = 1, ## number of replicates
weights = NULL, ## possible weights, i.e. when heteroscedastic
plot = TRUE, ## plot data and regression
... ## other parameters to 'lm'
)
{
if (is.null(error)) stop("Please define some error!")
## create x and yvalues
x < rep(x, n)
y < a + b * x
if (!is.null(error) & length(error) != length(x)) stop("'x' and 'error' must be of same length!")
if (!is.null(weights) & length(weights) != length(x)) stop("'x' and 'weights' must be of same length!")
## add error
y < y + error
## create linear model #1
LM1 < lm(y ~ x, weights = weights, ...)
COEF1 < coef(LM1)
## correct slope and create linear model #2
y < y * (b/COEF1[2])
LM2 < lm(y ~ x, weights = weights, ...)
COEF2 < coef(LM2)
## correct intercept and create linear model #3
y < y + (a  COEF2[1])
LM3 < lm(y ~ x, weights = weights, ...)
## plot data and regression
plot(x, y, pch = 16)
abline(LM3, lwd = 2, col = "darkred")
return(list(model = LM3, x = x, y = y))
}
Here are some applications using replicates and weighted fitting:
############ Examples #################
## n = 1
exactLM(x = 1:100, a = 0.5, b = 0.2, error = rnorm(100, 0, 2))
## n = 3
exactLM(x = 1:100, a = 0.5, b = 0.2, error = rnorm(300, 0, 2), n = 3)
## weighted by exact 1/var
x < 1:100
error < rnorm(100, 0, 0.1 * x)
weights < 1/(0.1 * x)^2
exactLM(x = x, a = 0.5, b = 0.2, error = error, weights = weights)
## weighted by empirical 1/var
x < rep(1:100, 3)
error < rnorm(300, 0, 0.1 * x)
weights < rep(1/(tapply(error, x, mean)^2), 3)
exactLM(x = x, a = 0.5, b = 0.2, error = error, weights = weights)
I am curious on comments concerning simplification and more importantly, application (other than cheating data…)!
Cheers,
Andrej
onls
has been developed for easy future algorithm tweaking in R. The results obtained from onls
are exactly similar to those found in the original implementation [1, 2]. It is based on an inner loop using optimize
for each to find within some border and an outer loop for the fit parameters using nls.lm
of the ‘minpack’ package. Sensible starting parameters for onls
are obtained by prior fitting with standard nls
, as parameter values for ONLS are usually fairly similar to those from NLS.
There is a package vignette available with more details in the “/onls/inst” folder, especially on what to do if fitting fails or not all points are orthogonal. I will work through one example here, the famous DNase 1 dataset of the nls
documentation, with 10% added error. The semantics are exactly as in nls
, albeit with a (somewhat) different output:
> DNase1 < subset(DNase, Run == 1)
> DNase1$density < sapply(DNase1$density, function(x) rnorm(1, x, 0.1 * x))
> mod1 < onls(density ~ Asym/(1 + exp((xmid  log(conc))/scal)),
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))
Obtaining starting parameters from ordinary NLS...
Passed...
Relative error in the sum of squares is at most `ftol'.
Optimizing orthogonal NLS...
Passed... Relative error in the sum of squares is at most `ftol'.
The print.onls
method gives, as in nls
, the parameter values and the vertical residual sumofsquares. However, the orthogonal residual sumofsquares is also returned and MOST IMPORTANTLY, information on how many points are actually orthogonal to after fitting:
> print(mod1)
Nonlinear orthogonal regression model
model: density ~ Asym/(1 + exp((xmid  log(conc))/scal))
data: DNase1
Asym xmid scal
2.422 1.568 1.099
vertical residual sumofsquares: 0.2282
orthogonal residual sumofsquares: 0.2234
PASSED: 16 out of 16 fitted points are orthogonal.
Number of iterations to convergence: 2
Achieved convergence tolerance: 1.49e08
Checking all points for orthogonality is accomplished using the independent checking routine check_o
which calculates the angle between the slope of the tangent obtained from the first derivative at and the slope of the onls
minimized Euclidean distance between and :
=> which should be , if the Euclidean distance has been minimized.
When plotting an ONLS model with the plot.onls
function, it is important to know that orthogonality is only evident with equal scaling of both axes:
> plot(mod1, xlim = c(0, 0.5), ylim = c(0, 0.5))
As with nls
, all generics work:
print(mod1), plot(mod1), summary(mod1), predict(mod1, newdata = data.frame(conc = 6)), logLik(mod1), deviance(mod1), formula(mod1), weights(mod1), df.residual(mod1), fitted(mod1), residuals(mod1), vcov(mod1), coef(mod1), confint(mod1).
However, deviance
and residuals
deliver the vertical, standard NLS values. To calculate orthogonal deviance and obtain orthogonal residuals, use deviance_o
and residuals_o
.
[1] ALGORITHM 676 ODRPACK: Software for Weighted Orthogonal Distance Regression.
Boggs PT, Donaldson JR, Byrd RH and Schnabel RB.
ACM Trans Math Soft (1989), 15: 348364.
[2] User’s Reference Guide for ODRPACK Version 2.01.
Software for Weighted Orthogonal Distance Regression.
Boggs PT, Byrd RH, Rogers JE and Schnabel RB.\\
NISTIR (1992), 4834: 1113.
Cheers,
Andrej
interval
function to my ‘propagate’ package (now on CRAN) that conducts error propagation based on interval arithmetics. It calculates the uncertainty of a model by using interval arithmetics based on (what I call) a “combinatorial sequence grid evaluation” approach, thereby avoiding the classical dependency problem that often inflates the result interval.So for a function with variables, we have to create all combinations , evaluate their function values and select .
The socalled dependency problem is a major obstacle to the application of interval arithmetic and arises when the same variable exists in several terms of a complicated and often nonlinear function. In these cases, overestimation can cover a range that is significantly larger, i.e. . For an example, see here under “Dependency problem”. A partial solution to this problem is to refine by dividing into smaller subranges to obtain sequence . Again, all combinations are evaluated as described above, resulting in a larger number of in which and may be closer to and , respectively. This is the “combinatorial sequence grid evaluation” approach which works quite well in scenarios where monotonicity changes direction, obviating the need to create multivariate derivatives (Hessians) or use some multivariate minimization algorithm.
If the interval is of type , a zero is included into the middle of the sequence to avoid wrong results in case of even powers, i.e. when actually the correct interval is , as exemplified by curve(x^2, 1, 1)
. Some examples to illustrate:
## Example 2: A complicated nonlinear model.
## Reduce sequence length to 2 => original interval
## for quicker evaluation.
EXPR2 < expression(C * sqrt((520 * H * P)/(M *(t + 460))))
H < c(64, 65)
M < c(16, 16.2)
P < c(361, 365)
t < c(165, 170)
C < c(38.4, 38.5)
DAT2 < makeDat(EXPR2)
interval(DAT2, EXPR2, seq = 2)
[1317.494, 1352.277]
## Example 5: Overestimation from dependency problem.
# Original interval with seq = 2 => [1, 7]
EXPR5 < expression(x^2  x + 1)
x < c(2, 1)
DAT5 < makeDat(EXPR5)
interval(DAT5, EXPR5, seq = 2)
[1, 7]
# Refine with large sequence => [0.75, 7]
interval(DAT5, EXPR5, seq = 100)
[0.7502296, 7]
# Tallies with curve function.
curve(x^2  x + 1, 2, 1)
Have fun!
Cheers,
ans
1  pchisq(chi^2, nu)
in R.
To see that this actually works, we can Monte Carlo simulate some heteroscedastic data with defined variance as a function of magnitude and compare unweighted and weighted NLS.
First we take the example from the documentation to nls and fit an enzyme kinetic model:
DNase1 < subset(DNase, Run == 1)
fm3DNase1 < nls(density ~ Asym/(1 + exp((xmid  log(conc))/scal)),
data = DNase1,
start = list(Asym = 3, xmid = 0, scal = 1))
Then we take the fitted values (which are duplicated because of the initial replicates), create a new unique dataset on which we create 20 response values for each concentration sampled from a normal distribution with 2% random heteroscedastic gaussian noise as a function of the value’s magnitude :
FITTED < unique(fitted(fm3DNase1))
DAT < sapply(FITTED, function(x) rnorm(20, mean = x, sd = 0.02 * x))
matplot(t(DAT), type = "p", pch = 16, lty = 1, col = 1)
lines(FITTED, col = 2)
Now we create the new dataframe to be fitted. For this we have to stack the unique – and values into a 2column dataframe:
CONC < unique(DNase1$conc)
fitDAT < data.frame(conc = rep(CONC, each = 20), density = matrix(DAT))
First we create the unweighted fit:
FIT1 < nls(density ~ Asym/(1 + exp((xmid  log(conc))/scal)),
data = fitDAT,
start = list(Asym = 3, xmid = 0, scal = 1))
Then we fit the data with weights . IMPORTANT: we need to replicate the weight values by 20 in order to match the data length.
VAR < tapply(fitDAT$density, fitDAT$conc, var)
VAR < rep(VAR, each = 20)
FIT2 < nls(density ~ Asym/(1 + exp((xmid  log(conc))/scal)),
data = fitDAT, weights = 1/VAR,
start = list(Asym = 3, xmid = 0, scal = 1))
For calculation of and its corresponding pvalue, we use the fitchisq function of my ‘qpcR’ package:
library(qpcR)
> fitchisq(FIT1)
$chi2
[1] 191.7566
$chi2.red
[1] 1.22138
$p.value
[1] 0.03074883
> fitchisq(FIT2)
$chi2
[1] 156.7153
$chi2.red
[1] 0.9981866
$p.value
[1] 0.4913983
Now we see the benefit of weighted fitting: Only the weighted model shows us with it’s reduced chisquare value of almost exactly 1 and its high pvalue that our fitted model approximates the parent model. And of course it does, because we simulated our data from it…
Cheers,
Andrej
* propagate
: A general function for the calculation of uncertainty propagation by first/secondorder Taylor expansion and Monte Carlo simulation including covariances. Input data can be any symbolic/numeric differentiable expression and data based on replicates, summaries (mean & s.d.) or sampled from a distribution. Uncertainty propagation is based completely on matrix calculus accounting for full covariance structure. Monte Carlo simulation is conducted using multivariate normal or tdistributions with covariance structure. The secondorder Taylor approximation is the new aspect, because it is not based on the assumption of linearity around but uses a secondorder polynomial to account for nonlinearities, making heavy use of numerical or symbolical Hessian matrices. Interestingly, the secondorder approximation gives results quite similar to the MC simulations!
* plot.propagate
: Graphing error propagation with the histograms of the MC simulations and MC/Taylorbased confidence intervals.
* predictNLS
: The propagate
function is used to calculate the propagated error to the fitted values of a nonlinear model of type nls
or nlsLM
. Please refer to my post here: https://rmazing.wordpress.com/2013/08/26/predictnlspart2taylorapproximationconfidenceintervalsfornlsmodels/.
* makeGrad, makeHess, numGrad, numHess
are functions to create symbolical or numerical gradient and Hessian matrices from an expression
containing first/secondorder partial derivatives. These can then be evaluated in an environment with evalDerivs
.
* fitDistr
: This function fits 21 different continuous distributions by (weighted) NLS to the histogram or kernel density of the Monte Carlo simulation results as obtained by propagate or any other vector containing largescale observations. Finally, the fits are sorted by ascending AIC.
* random samplers for 15 continuous distributions under one hood, some of them previously unavailable:
Skewednormal distribution, Generalized normal distributionm, Scaled and shifted tdistribution, Gumbel distribution, Johnson SU distribution, Johnson SB distribution, 3P Weibull distribution, 4P Beta distribution, Triangular distribution, Trapezoidal distribution, Curvilinear Trapezoidal distribution, Generalized trapezoidal distribution, Laplacian distribution, Arcsine distribution, von Mises distribution.
Most of them sample from the inverse cumulative distribution function, but 11, 12 and 15 use a vectorized version of “Rejection Sampling” giving roughly 100000 random numbers/s.
An example (without covariance for simplicity):
:
>DAT < data.frame(a = c(5, 0.1), b = c(10, 0.1), x = c(1, 0.1))
>EXPR < expression(a^b*x)
>res < propagate(EXPR, DAT)
Results from error propagation:
Mean.1 Mean.2 sd.1 sd.2 2.5% 97.5%
9765625 10067885 2690477 2739850 4677411 15414333
Results from Monte Carlo simulation:
Mean sd Median MAD 2.5% 97.5%
10072640 2826027 9713207 2657217 5635222 16594123
The plot reveals the resulting distribution obtained from Monte Carlo simulation:
>plot(res)
Seems like a skewed distributions. We can now use fitDistr
to find out which comes closest:
> fitDistr(res$resSIM)
Fitting Normal distribution...Done.
Fitting Skewednormal distribution...Done.
Fitting Generalized normal distribution...Done.
Fitting Lognormal distribution...Done.
Fitting Scaled/shifted t distribution...Done.
Fitting Logistic distribution...Done.
Fitting Uniform distribution...Done.
Fitting Triangular distribution...Done.
Fitting Trapezoidal distribution...Done.
Fitting Curvilinear Trapezoidal distribution...Done.
Fitting Generalized Trapezoidal distribution...Done.
Fitting Gamma distribution...Done.
Fitting Cauchy distribution...Done.
Fitting Laplace distribution...Done.
Fitting Gumbel distribution...Done.
Fitting Johnson SU distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting Johnson SB distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting 3P Weibull distribution...........10.........20.......Done.
Fitting 4P Beta distribution...Done.
Fitting Arcsine distribution...Done.
Fitting von Mises distribution...Done.
$aic
Distribution AIC
4 Lognormal 4917.823
16 Johnson SU 4861.960
15 Gumbel 4595.917
19 4P Beta 4509.716
12 Gamma 4469.780
9 Trapezoidal 4340.195
1 Normal 4284.706
5 Scaled/shifted t 4283.070
6 Logistic 4266.171
3 Generalized normal 4264.102
14 Laplace 4144.870
13 Cauchy 4099.405
2 Skewednormal 4060.936
11 Generalized Trapezoidal 4032.484
10 Curvilinear Trapezoidal 3996.495
8 Triangular 3970.993
7 Uniform 3933.513
20 Arcsine 3793.793
18 3P Weibull 3783.041
21 von Mises 3715.034
17 Johnson SB 3711.034
Lognormal wins, which makes perfect sense after using an exponentiation function...
Have fun with the package. Comments welcome!
Cheers,
Andrej
Filed under: General, R Internals Tagged: confidence interval, firstorder, fitting, Monte Carlo, nls, nonlinear, predict, secondorder, Taylor approximation ]]>
https://rmazing.wordpress.com/2013/08/31/introducingpropagate/feed/
15
anspiess
propagate

predictNLS (Part 2, Taylor approximation): confidence intervals for ‘nls’ models
https://rmazing.wordpress.com/2013/08/26/predictnlspart2taylorapproximationconfidenceintervalsfornlsmodels/
https://rmazing.wordpress.com/2013/08/26/predictnlspart2taylorapproximationconfidenceintervalsfornlsmodels/#comments
Mon, 26 Aug 2013 11:15:26 +0000
http://rmazing.wordpress.com/?p=365
]]>
Initial Remark: Reload this page if formulas don’t display well!
As promised, here is the second part on how to obtain confidence intervals for fitted values obtained from nonlinear regression via nls or nlsLM (package ‘minpack.lm’).
I covered a Monte Carlo approach in https://rmazing.wordpress.com/2013/08/14/predictnlspart1montecarlosimulationconfidenceintervalsfornlsmodels/, but here we will take a different approach: First and secondorder Taylor approximation around : .
Using Taylor approximation for calculating confidence intervals is a matter of propagating the uncertainties of the parameter estimates obtained from vcov(model) to the fitted value. When using firstorder Taylor approximation, this is also known as the “Delta method”. Those familiar with error propagation will know the formula
.
Heavily underused is the matrix notation of the famous formula above, for which a good derivation can be found at http://www.nada.kth.se/~kaia/papers/arrasTR9801R3.pdf:
,
where is the gradient vector of firstorder partial derivatives and is the variancecovariance matrix. This formula corresponds to the firstorder Taylor approximation. Now the problem with firstorder approximations is that they assume linearity around . Using the “Delta method” for nonlinear confidence intervals in R has been discussed in http://thebiobucket.blogspot.de/2011/04/fitsigmoidcurvewithconfidence.html or http://finzi.psych.upenn.edu/R/Rhelp02a/archive/42932.html.
For highly nonlinear functions we need (at least) a secondorder polynomial around to realistically estimate the surrounding interval (red is linear approximation, blue is secondorder polynomial on a sine function around ):
Interestingly, there are also matrixlike notations for the secondorder mean and variance in the literature (see http://dml.cz/dmlcz/141418 or http://iopscience.iop.org/00261394/44/3/012/pdf/00261394_44_3_012.pdf):
Secondorder mean: .
Secondorder variance: ,
where is the Hessian matrix of secondorder partial derivatives and is the matrix trace (sum of diagonals).
Enough theory, for wrapping this all up we need three utility functions:
1) numGrad for calculating numerical firstorder partial derivatives.
numGrad < function(expr, envir = .GlobalEnv)
{
f0 < eval(expr, envir)
vars < all.vars(expr)
p < length(vars)
x < sapply(vars, function(a) get(a, envir))
eps < 1e04
d < 0.1
r < 4
v < 2
zero.tol < sqrt(.Machine$double.eps/7e07)
h0 < abs(d * x) + eps * (abs(x) < zero.tol)
D < matrix(0, length(f0), p)
Daprox < matrix(0, length(f0), r)
for (i in 1:p) {
h < h0
for (k in 1:r) {
x1 < x2 < x
x1 < x1 + (i == (1:p)) * h
f1 < eval(expr, as.list(x1))
x2 < x2  (i == (1:p)) * h
f2 < eval(expr, envir = as.list(x2))
Daprox[, k] < (f1  f2)/(2 * h[i])
h < h/v
}
for (m in 1:(r  1)) for (k in 1:(r  m)) {
Daprox[, k] < (Daprox[, k + 1] * (4^m)  Daprox[, k])/(4^m  1)
}
D[, i] < Daprox[, 1]
}
return(D)
}
2) numHess for calculating numerical secondorder partial derivatives.
numHess < function(expr, envir = .GlobalEnv)
{
f0 < eval(expr, envir)
vars < all.vars(expr)
p < length(vars)
x < sapply(vars, function(a) get(a, envir))
eps < 1e04
d < 0.1
r < 4
v < 2
zero.tol < sqrt(.Machine$double.eps/7e07)
h0 < abs(d * x) + eps * (abs(x) < zero.tol)
Daprox < matrix(0, length(f0), r)
Hdiag < matrix(0, length(f0), p)
Haprox < matrix(0, length(f0), r)
H < matrix(NA, p, p)
for (i in 1:p) {
h < h0
for (k in 1:r) {
x1 < x2 < x
x1 < x1 + (i == (1:p)) * h
f1 < eval(expr, as.list(x1))
x2 < x2  (i == (1:p)) * h
f2 < eval(expr, envir = as.list(x2))
Haprox[, k] < (f1  2 * f0 + f2)/h[i]^2
h < h/v
}
for (m in 1:(r  1)) for (k in 1:(r  m)) {
Haprox[, k] < (Haprox[, k + 1] * (4^m)  Haprox[, k])/(4^m  1)
}
Hdiag[, i] < Haprox[, 1]
}
for (i in 1:p) {
for (j in 1:i) {
if (i == j) {
H[i, j] < Hdiag[, i]
}
else {
h < h0
for (k in 1:r) {
x1 < x2 < x
x1 < x1 + (i == (1:p)) * h + (j == (1:p)) *
h
f1 < eval(expr, as.list(x1))
x2 < x2  (i == (1:p)) * h  (j == (1:p)) *
h
f2 < eval(expr, envir = as.list(x2))
Daprox[, k] < (f1  2 * f0 + f2  Hdiag[, i] * h[i]^2  Hdiag[, j] * h[j]^2)/(2 * h[i] * h[j])
h < h/v
}
for (m in 1:(r  1)) for (k in 1:(r  m)) {
Daprox[, k] < (Daprox[, k + 1] * (4^m)  Daprox[, k])/(4^m  1)
}
H[i, j] < H[j, i] < Daprox[, 1]
}
}
}
return(H)
}
And a small function for the matrix trace:
tr < function(mat) sum(diag(mat), na.rm = TRUE)
1) and 2) are modified versions of the genD function in the “numDeriv” package that can handle expressions.
Now we need the predictNLS function that wraps it all up:
predictNLS < function(
object,
newdata,
interval = c("none", "confidence", "prediction"),
level = 0.95,
...
)
{
require(MASS, quietly = TRUE)
interval < match.arg(interval)
## get righthand side of formula
RHS < as.list(object$call$formula)[[3]]
EXPR < as.expression(RHS)
## all variables in model
VARS < all.vars(EXPR)
## coefficients
COEF < coef(object)
## extract predictor variable
predNAME < setdiff(VARS, names(COEF))
## take fitted values, if 'newdata' is missing
if (missing(newdata)) {
newdata < eval(object$data)[predNAME]
colnames(newdata) < predNAME
}
## check that 'newdata' has same name as predVAR
if (names(newdata)[1] != predNAME) stop("newdata should have name '", predNAME, "'!")
## get parameter coefficients
COEF < coef(object)
## get variancecovariance matrix
VCOV < vcov(object)
## augment variancecovariance matrix for 'mvrnorm'
## by adding a column/row for 'error in x'
NCOL < ncol(VCOV)
ADD1 < c(rep(0, NCOL))
ADD1 < matrix(ADD1, ncol = 1)
colnames(ADD1) < predNAME
VCOV < cbind(VCOV, ADD1)
ADD2 < c(rep(0, NCOL + 1))
ADD2 < matrix(ADD2, nrow = 1)
rownames(ADD2) < predNAME
VCOV < rbind(VCOV, ADD2)
NR < nrow(newdata)
respVEC < numeric(NR)
seVEC < numeric(NR)
varPLACE < ncol(VCOV)
outMAT < NULL
## define counter function
counter < function (i)
{
if (i%%10 == 0)
cat(i)
else cat(".")
if (i%%50 == 0)
cat("\n")
flush.console()
}
## calculate residual variance
r < residuals(object)
w < weights(object)
rss < sum(if (is.null(w)) r^2 else r^2 * w)
df < df.residual(object)
res.var < rss/df
## iterate over all entries in 'newdata' as in usual 'predict.' functions
for (i in 1:NR) {
counter(i)
## get predictor values and optional errors
predVAL < newdata[i, 1]
if (ncol(newdata) == 2) predERROR < newdata[i, 2] else predERROR < 0
names(predVAL) < predNAME
names(predERROR) < predNAME
## create mean vector
meanVAL < c(COEF, predVAL)
## create augmented variancecovariance matrix
## by putting error^2 in lowerright position of VCOV
newVCOV < VCOV
newVCOV[varPLACE, varPLACE] < predERROR^2
SIGMA < newVCOV
## firstorder mean: eval(EXPR), firstorder variance: G.S.t(G)
MEAN1 < try(eval(EXPR, envir = as.list(meanVAL)), silent = TRUE)
if (inherits(MEAN1, "tryerror")) stop("There was an error in evaluating the firstorder mean!")
GRAD < try(numGrad(EXPR, as.list(meanVAL)), silent = TRUE)
if (inherits(GRAD, "tryerror")) stop("There was an error in creating the numeric gradient!")
VAR1 < GRAD %*% SIGMA %*% matrix(GRAD)
## secondorder mean: firstMEAN + 0.5 * tr(H.S),
## secondorder variance: firstVAR + 0.5 * tr(H.S.H.S)
HESS < try(numHess(EXPR, as.list(meanVAL)), silent = TRUE)
if (inherits(HESS, "tryerror")) stop("There was an error in creating the numeric Hessian!")
valMEAN2 < 0.5 * tr(HESS %*% SIGMA)
valVAR2 < 0.5 * tr(HESS %*% SIGMA %*% HESS %*% SIGMA)
MEAN2 < MEAN1 + valMEAN2
VAR2 < VAR1 + valVAR2
## confidence or prediction interval
if (interval != "none") {
tfrac < abs(qt((1  level)/2, df))
INTERVAL < tfrac * switch(interval, confidence = sqrt(VAR2),
prediction = sqrt(VAR2 + res.var))
LOWER < MEAN2  INTERVAL
UPPER < MEAN2 + INTERVAL
names(LOWER) < paste((1  level)/2 * 100, "%", sep = "")
names(UPPER) < paste((1  (1 level)/2) * 100, "%", sep = "")
} else {
LOWER < NULL
UPPER < NULL
}
RES < c(mu.1 = MEAN1, mu.2 = MEAN2, sd.1 = sqrt(VAR1), sd.2 = sqrt(VAR2), LOWER, UPPER)
outMAT < rbind(outMAT, RES)
}
cat("\n")
rownames(outMAT) < NULL
return(outMAT)
}
With all functions at hand, we can now got through the same example as used in the Monte Carlo post:
DNase1 < subset(DNase, Run == 1)
fm1DNase1 < nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
> predictNLS(fm1DNase1, newdata = data.frame(conc = 5), interval = "confidence")
.
mu.1 mu.2 sd.1 sd.2 2.5% 97.5%
[1,] 1.243631 1.243288 0.03620415 0.03620833 1.165064 1.321511
The errors/confidence intervals are larger than with the MC approch (who knows why?) but it is very interesting to see how close the secondorder corrected mean (1.243288) comes to the mean of the simulated values from the Monte Carlo approach (1.243293)!
The two approach (MC/Taylor) will be found in the predictNLS function that will be part of the “propagate” package in a few days at CRAN…
Cheers,
Andrej
Filed under: General, R Internals Tagged: confidence interval, firstorder, fitting, Monte Carlo, nls, nonlinear, predict, secondorder, Taylor approximation ]]>
https://rmazing.wordpress.com/2013/08/26/predictnlspart2taylorapproximationconfidenceintervalsfornlsmodels/feed/
12
anspiess
Taylor

predictNLS (Part 1, Monte Carlo simulation): confidence intervals for ‘nls’ models
https://rmazing.wordpress.com/2013/08/14/predictnlspart1montecarlosimulationconfidenceintervalsfornlsmodels/
https://rmazing.wordpress.com/2013/08/14/predictnlspart1montecarlosimulationconfidenceintervalsfornlsmodels/#comments
Wed, 14 Aug 2013 06:15:37 +0000
http://rmazing.wordpress.com/?p=306
]]>
Those that do a lot of nonlinear fitting with the nls function may have noticed that predict.nls does not have a way to calculate a confidence interval for the fitted value. Using confint you can obtain the error of the fit parameters, but how about the error in fitted values? ?predict.nls says: “At present se.fit and interval are ignored.” What a pity… This is largely to the fact that confidence intervals for nonlinear fits are not easily calculated and under some debate, see http://r.789695.n4.nabble.com/plottingconfidencebandsfrompredictnlstd3505012.html or http://thr3ads.net/rhelp/2011/05/1053390plottingconfidencebandsfrompredict.nls. In principle, since calculating the error in the fitted values is a matter of “error propagation”, two different approaches can be used:
1) Error propagation using approximation by (firstorder) Taylor expansion around ,
2) Error propagation using Monte Carlo simulation.
Topic 1) will be subject of my next post, today we will stick with the MC approach.
When calculating the error in the fitted values, we need to propagate the error of all variables, i.e. the error in all predictor variables and the error of the fit parameters , to the response . Often (as in the ‘Examples’ section of nls), there is no error in the values. The errors of the fit parameters are obtained, together with their correlations, in the variancecovariance matrix from vcov(object).
A Monte Carlo approach to nonlinear error propagation does the following:
1) Use as input and of all predictor variables and the vcov matrix of the fit parameters .
2) For each variable , we create samples from a multivariate normal distribution using the variancecovariance matrix: .
3) We evaluate the function on each simulated variable:
4) We calculate statistics (mean, s.d., median, mad) and quantilebased confidence intervals on the vector .
This is exactly what the following function does: It takes an nls object, extracts the variables/parameter values/parameter variancecovariance matrix, creates an “augmented” covariance matrix (with the variance/covariance values from the parameters and predictor variables included, the latter often being zero), simulates from a multivariate normal distribution (using mvrnorm of the ‘MASS’ package), evaluates the function (object$call$formula) on the values and finally collects statistics. Here we go:
predictNLS < function(
object,
newdata,
level = 0.95,
nsim = 10000,
...
)
{
require(MASS, quietly = TRUE)
## get righthand side of formula
RHS < as.list(object$call$formula)[[3]]
EXPR < as.expression(RHS)
## all variables in model
VARS < all.vars(EXPR)
## coefficients
COEF < coef(object)
## extract predictor variable
predNAME < setdiff(VARS, names(COEF))
## take fitted values, if 'newdata' is missing
if (missing(newdata)) {
newdata < eval(object$data)[predNAME]
colnames(newdata) < predNAME
}
## check that 'newdata' has same name as predVAR
if (names(newdata)[1] != predNAME) stop("newdata should have name '", predNAME, "'!")
## get parameter coefficients
COEF < coef(object)
## get variancecovariance matrix
VCOV < vcov(object)
## augment variancecovariance matrix for 'mvrnorm'
## by adding a column/row for 'error in x'
NCOL < ncol(VCOV)
ADD1 < c(rep(0, NCOL))
ADD1 < matrix(ADD1, ncol = 1)
colnames(ADD1) < predNAME
VCOV < cbind(VCOV, ADD1)
ADD2 < c(rep(0, NCOL + 1))
ADD2 < matrix(ADD2, nrow = 1)
rownames(ADD2) < predNAME
VCOV < rbind(VCOV, ADD2)
## iterate over all entries in 'newdata' as in usual 'predict.' functions
NR < nrow(newdata)
respVEC < numeric(NR)
seVEC < numeric(NR)
varPLACE < ncol(VCOV)
## define counter function
counter < function (i)
{
if (i%%10 == 0)
cat(i)
else cat(".")
if (i%%50 == 0)
cat("\n")
flush.console()
}
outMAT < NULL
for (i in 1:NR) {
counter(i)
## get predictor values and optional errors
predVAL < newdata[i, 1]
if (ncol(newdata) == 2) predERROR < newdata[i, 2] else predERROR < 0
names(predVAL) < predNAME
names(predERROR) < predNAME
## create mean vector for 'mvrnorm'
MU < c(COEF, predVAL)
## create variancecovariance matrix for 'mvrnorm'
## by putting error^2 in lowerright position of VCOV
newVCOV < VCOV
newVCOV[varPLACE, varPLACE] < predERROR^2
## create MC simulation matrix
simMAT < mvrnorm(n = nsim, mu = MU, Sigma = newVCOV, empirical = TRUE)
## evaluate expression on rows of simMAT
EVAL < try(eval(EXPR, envir = as.data.frame(simMAT)), silent = TRUE)
if (inherits(EVAL, "tryerror")) stop("There was an error evaluating the simulations!")
## collect statistics
PRED < data.frame(predVAL)
colnames(PRED) < predNAME
FITTED < predict(object, newdata = data.frame(PRED))
MEAN.sim < mean(EVAL, na.rm = TRUE)
SD.sim < sd(EVAL, na.rm = TRUE)
MEDIAN.sim < median(EVAL, na.rm = TRUE)
MAD.sim < mad(EVAL, na.rm = TRUE)
QUANT < quantile(EVAL, c((1  level)/2, level + (1  level)/2))
RES < c(FITTED, MEAN.sim, SD.sim, MEDIAN.sim, MAD.sim, QUANT[1], QUANT[2])
outMAT < rbind(outMAT, RES)
}
colnames(outMAT) < c("fit", "mean", "sd", "median", "mad", names(QUANT[1]), names(QUANT[2]))
rownames(outMAT) < NULL
cat("\n")
return(outMAT)
}
The input is an ‘nls’ object, a data.frame ‘newdata’ of values to be predicted with
the value in the first column and (optional) “errorsinx” (as ) in the second column.
The number of simulations can be tweaked with nsim as well as the alphalevel for the
confidence interval.
The output is (fitted value), (mean of simulation), (s.d. of simulation), (median of simulation), (mad of simulation) and the lower/upper confidence interval.
Ok, let’s go to it (taken from the ‘?nls’ documentation):
DNase1 < subset(DNase, Run == 1)
fm1DNase1 < nls(density ~ SSlogis(log(conc), Asym, xmid, scal), DNase1)
## usual predict.nls has no confidence intervals implemented
predict(fm1DNase1, newdata = data.frame(conc = 5), interval = "confidence")
[1] 1.243631
attr(,"gradient")
Asym xmid scal
[1,] 0.5302925 0.5608912 0.06804642
In the next post we will see how to use the gradient attribute to calculate a firstorder Taylor expansion around …
However, predictNLS gives us the error and confidence interval at :
predictNLS(fm1DNase1, newdata = data.frame(conc = 5))
.
fit mean sd median mad 2.5% 97.5%
[1,] 1.243631 1.243293 0.009462893 1.243378 0.009637439 1.224608 1.261575
Interesting to see, how close the mean of the simulation comes to the actual fitted value…
We could also add some error in to propagate to :
> predictNLS(fm1DNase1, newdata = data.frame(conc = 5, error = 0.1))
.
fit mean sd median mad 2.5% 97.5%
[1,] 1.243631 1.243174 0.01467673 1.243162 0.01488567 1.214252 1.272103
Have fun. If anyone know how to calculate a “prediction interval” (maybe quantile regression) give me hint…
Cheers,
Andrej
Filed under: General, R Internals Tagged: confidence interval, fitting, Monte Carlo, nls, nonlinear ]]>
https://rmazing.wordpress.com/2013/08/14/predictnlspart1montecarlosimulationconfidenceintervalsfornlsmodels/feed/
6
anspiess

Trivial, but useful: sequences with defined mean/s.d.
https://rmazing.wordpress.com/2013/07/31/trivialbutusefulsequenceswithdefinedmeansd/
https://rmazing.wordpress.com/2013/07/31/trivialbutusefulsequenceswithdefinedmeansd/#comments
Wed, 31 Jul 2013 11:51:16 +0000
http://rmazing.wordpress.com/?p=268
]]>
O.k., the following post may be (mathematically) trivial, but could be somewhat useful for people that do simulations/testing of statistical methods.
Let’s say we want to test the dependence of pvalues derived from a ttest to a) the ratio of means between two groups, b) the standard deviation or c) the sample size(s) of the two groups. For this setup we would need to i.e. generate two groups with defined and .
Often encountered in simulations is that groups are generated with rnorm and then plugged into the simulation. However (and evidently), it is clear that sampling from a normal distribution does not deliver a vector with exactly defined statistical properties (although the “law of large numbers” states that with enough large sample size it converges to that…).
For example,
> x < rnorm(1000, 5, 2)
> mean(x)
[1] 4.998388
> sd(x)
[1] 2.032262
shows what I meant above ().
Luckily, we can create vectors with exact mean and s.d. by a “scaledandshifted ztransformation” of an input vector :
where sd is the desired standard deviation and mean the desired mean of the output vector Z.
The code is simple enough:
statVec < function(x, mean, sd)
{
X < x
MEAN < mean
SD < sd
Z < (((X  mean(X, na.rm = TRUE))/sd(X, na.rm = TRUE))) * SD
MEAN + Z
}
So, using this on the rnormgenerated vector x from above:
> z < statVec(x, 5, 2)
> mean(z)
[1] 5
> sd(z)
[1] 2
we have created a vector with exact statistical properties, which is also normally distributed since multiplication and addition of a normal distribution preserves normality.
Cheers, Andrej
Filed under: General Tagged: mean, s.d., sequence, ztransformation ]]>
https://rmazing.wordpress.com/2013/07/31/trivialbutusefulsequenceswithdefinedmeansd/feed/
4
anspiess

wapply: A faster (but less functional) ‘rollapply’ for vector setups
https://rmazing.wordpress.com/2013/04/23/wapplyafasterbutlessfunctionalrollapplyforvectorsetups/
https://rmazing.wordpress.com/2013/04/23/wapplyafasterbutlessfunctionalrollapplyforvectorsetups/#comments
Tue, 23 Apr 2013 06:50:19 +0000
http://rmazing.wordpress.com/?p=241
]]>
For some cryptic reason I needed a function that calculates function values on sliding windows of a vector. Googling around soon brought me to ‘rollapply’, which when I tested it seems to be a very versatile function. However, I wanted to code my own version just for vector purposes in the hope that it may be somewhat faster.
This is what turned out (wapply for “window apply”):
wapply < function(x, width, by = NULL, FUN = NULL, ...)
{
FUN < match.fun(FUN)
if (is.null(by)) by < width
lenX < length(x)
SEQ1 < seq(1, lenX  width + 1, by = by)
SEQ2 < lapply(SEQ1, function(x) x:(x + width  1))
OUT < lapply(SEQ2, function(a) FUN(x[a], ...))
OUT < base:::simplify2array(OUT, higher = TRUE)
return(OUT)
}
It is much more restricted than ‘rollapply’ (no padding, left/center/right adjustment etc).
But interestingly, for some setups it is very much faster:
library(zoo)
x < 1:200000
large window, small slides:
> system.time(RES1 < rollapply(x, width = 1000, by = 50, FUN = fun))
User System verstrichen
3.71 0.00 3.84
> system.time(RES2 < wapply(x, width = 1000, by = 50, FUN = fun))
User System verstrichen
1.89 0.00 1.92
> all.equal(RES1, RES2)
[1] TRUE
small window, small slides:
> system.time(RES1 < rollapply(x, width = 50, by = 50, FUN = fun))
User System verstrichen
2.59 0.00 2.67
> system.time(RES2 < wapply(x, width = 50, by = 50, FUN = fun))
User System verstrichen
0.86 0.00 0.89
> all.equal(RES1, RES2)
[1] TRUE
small window, large slides:
> system.time(RES1 < rollapply(x, width = 50, by = 1000, FUN = fun))
User System verstrichen
1.68 0.00 1.77
> system.time(RES2 < wapply(x, width = 50, by = 1000, FUN = fun))
User System verstrichen
0.06 0.00 0.06
> all.equal(RES1, RES2)
[1] TRUE
There is about a 23 fold gain in speed for the above two setups but a 35fold gain in the small window/large slides setup. Interesting…
I noticed that zoo:::rollapply.zoo uses mapply internally, maybe there is some overhead for pure vector calculations…
Cheers,
Andrej
Filed under: General Tagged: function, rollapply, vector, window ]]>
https://rmazing.wordpress.com/2013/04/23/wapplyafasterbutlessfunctionalrollapplyforvectorsetups/feed/
12
anspiess

bigcor: Large correlation matrices in R
https://rmazing.wordpress.com/2013/02/22/bigcorlargecorrelationmatricesinr/
https://rmazing.wordpress.com/2013/02/22/bigcorlargecorrelationmatricesinr/#comments
Fri, 22 Feb 2013 14:33:30 +0000
http://rmazing.wordpress.com/?p=135
]]>
Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of , , etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.
This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation submatrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.
The function can found here: http://www.drspiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!
Filed under: General Tagged: big matrix, cor, correlation ]]>
https://rmazing.wordpress.com/2013/02/22/bigcorlargecorrelationmatricesinr/feed/
33
anspiess