Linear regression with random error giving EXACT predefined parameter estimates

January 26, 2016

When simulating linear models based on some defined slope/intercept and added gaussian noise, the parameter estimates vary after least-squares fitting. Here is some code I developed that does a double transform of these models as to obtain a fitted model with EXACT defined parameter estimates a (intercept) and b (slope).

It does so by:
1) Fitting a linear model #1 Y_i = \beta_0 + \beta_1X_i + \varepsilon_i to the x,y data.
2) Correcting y by \beta_1: Y_i = Y_i \cdot (\mathrm{b}/\beta_1).
3) Refitting linear model #2: Y_i = \beta_0 + \beta_1X_i + \varepsilon_i.
4) Correcting y by \beta_0: Y_i = Y_i + (\mathrm{a} - \beta_0).
5) Refitting linear model #3: Y_i = \beta_0 + \beta_1X_i + \varepsilon_i, which is the final model with parameter estimates a and b.

Below is the code:

exactLM <- function(
x = 1:100, ## predictor values
b = 0.01, ## slope
a = 3, ## intercept
error = NULL, ## homoscedastic error
n = 1, ## number of replicates
weights = NULL, ## possible weights, i.e. when heteroscedastic
plot = TRUE, ## plot data and regression
... ## other parameters to 'lm'
)
{
if (is.null(error)) stop("Please define some error!")

## create x and y-values
x <- rep(x, n)
y <- a + b * x
if (!is.null(error) & length(error) != length(x)) stop("'x' and 'error' must be of same length!")
if (!is.null(weights) & length(weights) != length(x)) stop("'x' and 'weights' must be of same length!")

## add error
y <- y + error

## create linear model #1
LM1 <- lm(y ~ x, weights = weights, ...)
COEF1 <- coef(LM1)

## correct slope and create linear model #2
y <- y * (b/COEF1[2])
LM2 <- lm(y ~ x, weights = weights, ...)
COEF2 <- coef(LM2)

## correct intercept and create linear model #3
y <- y + (a - COEF2[1])
LM3 <- lm(y ~ x, weights = weights, ...)

## plot data and regression
plot(x, y, pch = 16)
abline(LM3, lwd = 2, col = "darkred")

return(list(model = LM3, x = x, y = y))
}

Here are some applications using replicates and weighted fitting:

############ Examples #################
## n = 1
exactLM(x = 1:100, a = 0.5, b = 0.2, error = rnorm(100, 0, 2))

## n = 3
exactLM(x = 1:100, a = 0.5, b = 0.2, error = rnorm(300, 0, 2), n = 3)

## weighted by exact 1/var
x <- 1:100
error <- rnorm(100, 0, 0.1 * x)
weights <- 1/(0.1 * x)^2
exactLM(x = x, a = 0.5, b = 0.2, error = error, weights = weights)

## weighted by empirical 1/var
x <- rep(1:100, 3)
error <- rnorm(300, 0, 0.1 * x)
weights <- rep(1/(tapply(error, x, mean)^2), 3)
exactLM(x = x, a = 0.5, b = 0.2, error = error, weights = weights)

I am curious on comments concerning simplification and more importantly, application (other than cheating data…)!

Cheers,
Andrej

Advertisements

Introducing: Orthogonal Nonlinear Least-Squares Regression in R

January 18, 2015

With this post I want to introduce my newly bred ‘onls’ package which conducts Orthogonal Nonlinear Least-Squares Regression (ONLS):
http://cran.r-project.org/web/packages/onls/index.html.
Orthogonal nonlinear least squares (ONLS) is a not so frequently applied and maybe overlooked regression technique that comes into question when one encounters an “error in variables” problem. While classical nonlinear least squares (NLS) aims to minimize the sum of squared vertical residuals, ONLS minimizes the sum of squared orthogonal residuals. The method is based on finding points on the fitted line that are orthogonal to the data by minimizing for each (x_i, y_i) the Euclidean distance \|D_i\| to some point (x_{0i}, y_{0i}) on the fitted curve. There is a 25 year old FORTRAN implementation for ONLS available (ODRPACK, http://www.netlib.org/toms/869.zip), which has been included in the ‘scipy’ package for Python (http://docs.scipy.org/doc/scipy-0.14.0/reference/odr.html). Here, onls has been developed for easy future algorithm tweaking in R. The results obtained from onls are exactly similar to those found in the original implementation [1, 2]. It is based on an inner loop using optimize for each (x_i, y_i) to find \min \|D_i\| within some border [x_{i-w}, x_{i+w}] and an outer loop for the fit parameters using nls.lm of the ‘minpack’ package. Sensible starting parameters for onls are obtained by prior fitting with standard nls, as parameter values for ONLS are usually fairly similar to those from NLS.

There is a package vignette available with more details in the “/onls/inst” folder, especially on what to do if fitting fails or not all points are orthogonal. I will work through one example here, the famous DNase 1 dataset of the nls documentation, with 10% added error. The semantics are exactly as in nls, albeit with a (somewhat) different output:


> DNase1 <- subset(DNase, Run == 1)
> DNase1$density <- sapply(DNase1$density, function(x) rnorm(1, x, 0.1 * x))
> mod1 <- onls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))

Obtaining starting parameters from ordinary NLS...
Passed...
Relative error in the sum of squares is at most `ftol'.
Optimizing orthogonal NLS...
Passed... Relative error in the sum of squares is at most `ftol'.

The print.onls method gives, as in nls, the parameter values and the vertical residual sum-of-squares. However, the orthogonal residual sum-of-squares is also returned and MOST IMPORTANTLY, information on how many points (x_{0i}, y_{0i}) are actually orthogonal to (x_i, y_i) after fitting:

> print(mod1)
Nonlinear orthogonal regression model
model: density ~ Asym/(1 + exp((xmid - log(conc))/scal))
data: DNase1
Asym xmid scal
2.422 1.568 1.099
vertical residual sum-of-squares: 0.2282
orthogonal residual sum-of-squares: 0.2234
PASSED: 16 out of 16 fitted points are orthogonal.

Number of iterations to convergence: 2
Achieved convergence tolerance: 1.49e-08

Checking all points for orthogonality is accomplished using the independent checking routine check_o which calculates the angle between the slope \mathrm{m}_i of the tangent obtained from the first derivative at (x_{0i}, y_{0i}) and the slope \mathrm{n}_i of the onls-minimized Euclidean distance between (x_{0i}, y_{0i}) and (x_i, y_i):

\tan(\alpha_i) = \left|\frac{\mathrm{m}_i - \mathrm{n}_i}{1 + \mathrm{m}_i \cdot \mathrm{n}_i}\right|
\mathrm{m}_i = \frac{df(x, \beta)}{dx_{0i}}, \,\, \mathrm{n}_i = \frac{y_i - y_{0i}}{x_i - x_{0i}}
=> \alpha_i[^{\circ}] = \tan^{-1} \left( \left|\frac{\mathrm{m}_i - \mathrm{n}_i}{1 + \mathrm{m}_i \cdot \mathrm{n}_i}\right| \right) \cdot \frac{360}{2\pi} which should be 90^{\circ}, if the Euclidean distance has been minimized.

When plotting an ONLS model with the plot.onls function, it is important to know that orthogonality is only evident with equal scaling of both axes:

> plot(mod1, xlim = c(0, 0.5), ylim = c(0, 0.5))

orth

As with nls, all generics work:
print(mod1), plot(mod1), summary(mod1), predict(mod1, newdata = data.frame(conc = 6)), logLik(mod1), deviance(mod1), formula(mod1), weights(mod1), df.residual(mod1), fitted(mod1), residuals(mod1), vcov(mod1), coef(mod1), confint(mod1).
However, deviance and residuals deliver the vertical, standard NLS values. To calculate orthogonal deviance and obtain orthogonal residuals, use deviance_o and residuals_o.

[1] ALGORITHM 676 ODRPACK: Software for Weighted Orthogonal Distance Regression.
Boggs PT, Donaldson JR, Byrd RH and Schnabel RB.
ACM Trans Math Soft (1989), 15: 348-364.
[2] User’s Reference Guide for ODRPACK Version 2.01.
Software for Weighted Orthogonal Distance Regression.
Boggs PT, Byrd RH, Rogers JE and Schnabel RB.\\
NISTIR (1992), 4834: 1-113.

Cheers,
Andrej