Today I want to advocate weighted nonlinear regression. Why so?

Minimum-variance estimation of the adjustable parameters in linear and non-linear least squares requires that the data be weighted inversely as their variances . Only then is the BLUE (Best Linear Unbiased Estimator) for linear regression and nonlinear regression with small errors (http://en.wikipedia.org/wiki/Weighted_least_squares#Weighted_least_squares), an important fact frequently neglected, especially in scenarios with heteroscedasticity.

The variance of a fit is also characterized by the statistic defined as followed:

The relationship between and can be seen most easily by comparison with the reduced :

whereas = degrees of freedom (N – p), and is the weighted average of the individual variances. If the fitting function is a good approximation to the parent function, the value of the reduced chi-square should be approximately unity, . If the fitting function is not appropriate for describing the data, the deviations will be larger and the estimated variance will be too large, yielding a value greater than 1. A value less than 1 can be a consequence of the fact that there exists an uncertainty in the determination of , and the observed values of will fluctuate from experiment to experiment. To assign significance to the value, one can use the integral probability which describes the probability that a random set of data points sampled from the parent distribution would yield a value of equal to or greater than the calculated one. This can be calculated by `1 - pchisq(chi^2, nu)`

in *R*.

To see that this actually works, we can Monte Carlo simulate some heteroscedastic data with defined variance as a function of -magnitude and compare unweighted and weighted NLS.

First we take the example from the documentation to *nls* and fit an enzyme kinetic model:

DNase1 <- subset(DNase, Run == 1)

fm3DNase1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),

data = DNase1,

start = list(Asym = 3, xmid = 0, scal = 1))

Then we take the fitted values (which are duplicated because of the initial replicates), create a new unique dataset on which we create 20 response values for each concentration sampled from a normal distribution with 2% random heteroscedastic gaussian noise as a function of the value’s magnitude :

FITTED <- unique(fitted(fm3DNase1))

DAT <- sapply(FITTED, function(x) rnorm(20, mean = x, sd = 0.02 * x))

matplot(t(DAT), type = "p", pch = 16, lty = 1, col = 1)

lines(FITTED, col = 2)

Now we create the new dataframe to be fitted. For this we have to stack the unique – and -values into a 2-column dataframe:

CONC <- unique(DNase1$conc)

fitDAT <- data.frame(conc = rep(CONC, each = 20), density = matrix(DAT))

First we create the unweighted fit:

FIT1 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),

data = fitDAT,

start = list(Asym = 3, xmid = 0, scal = 1))

Then we fit the data with weights . IMPORTANT: we need to replicate the weight values by 20 in order to match the data length.

VAR <- tapply(fitDAT$density, fitDAT$conc, var)

VAR <- rep(VAR, each = 20)

FIT2 <- nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),

data = fitDAT, weights = 1/VAR,

start = list(Asym = 3, xmid = 0, scal = 1))

For calculation of and its corresponding p-value, we use the *fitchisq* function of my ‘qpcR’ package:

library(qpcR)

> fitchisq(FIT1)

$chi2

[1] 191.7566

$chi2.red

[1] 1.22138

$p.value

[1] 0.03074883

```
```

`> fitchisq(FIT2)`

$chi2

[1] 156.7153

$chi2.red

[1] 0.9981866

$p.value

[1] 0.4913983

Now we see the benefit of weighted fitting: Only the weighted model shows us with it’s reduced chi-square value of almost exactly 1 and its high p-value that our fitted model approximates the parent model. And of course it does, because we simulated our data from it…

Cheers,

Andrej

Two small comments:

Instead of messing with

`tapply`

and`rep`

, just use the little-known but very useful`ave`

function:`VAR = ave(fitDAT$density, fitDAT$conc, FUN=var)`

I haven’t actually used the

`gnls`

function before, but it seems better to use this toestimatethe variance at each point in the actual fitting procedure than to manually estimate them beforehand and then treating the estimates as real/known parameters. Example:`FIT3 <- gnls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),`

data = fitDAT, weights = varIdent(form=~1|conc),

start = list(Asym = 3, xmid = 0, scal = 1))

summary(FIT3)

coef(FIT3)

intervals(FIT3)

But unfortunately

`fitchisq()`

doesn’t work with`gnls`

objects.Hi Karl,

thanks for making me aware of ‘ave’, probably heavily underused because the name is not very intuitive 😉

If you can tell me how to extract the ‘gnls’ calculated error estimates, a reduced chi-square could also be calculated from ‘gnls’ objects. There is the FIT3$apVar item (covariance matrix) however the diagonal does not even nearly match my calculated variance estimates. Any clue? Thanks in advance!

How to extract the error estimates is not very intuitive, but here’s how to do it:

First, let’s just check that the estimates from the

`summary()`

function agrees (approximately) with the ones calculated manually:`SD <- tapply(fitDAT$density, fitDAT$conc, FUN=sd)`

SD/SD[1]

summary(FIT3)

(Look at the last line under heading

`Variance function:`

.)These are the

relativeerror estimates (i.e., scaled so that the first value is 1). To extract the absolute error estimates, we can use:`FIT3$sigma * coef(FIT3$modelStruct$varStruct, unconstrained=FALSE, allCoef=TRUE)`

The values are again very close to the values in the

`SD`

vector.Note that it’s possible to defined other variance functions, e.g., for letting the variance increase as a power of the mean. See

`?varClasses`

for an overview.@Karl,

thanks for the description! If I use the ‘gnls’-derived variances to calculate reduced chi-square I get exactly 1, which surely comes from the estimation of the variances during the fitting procedure which minimizes residual-sum-squares so it must be 1. However, when I use my empirical variances in VAR on FIT3 I get 1.05 which is a bit inferior to FIT2. Or is a comparison this way out of line?

I’m not sure I understand the question. The empirical variances and the variances from

`gnls`

arealmost exactlythe same (and the latter ones will of course change a bit too if you make some adjustments to the`control`

argument – like for all optimisation problems), so in practice there is no problem.But the

`gnls`

estimates are in theory different (and if the model is correct, better) than the empirical ones. The empirical variances estimates the average square differences from the mean at each concentration, while the`gnls`

estimates the average square differences frommodel mean(estimated by the fitted values). If the model is correct, the latter will be better estimators (since they always useall</em the data instead of only the data for the given concentration).If the model is wrong, the empirical variances might very well be better (but then the estimates means will be wrong, so what’s the use of the model?). Actually, if the model is wrong but nottoowrong, I guess it’s still more useful to use the`gnls`

estimates, as these measure the deviations from the fitted/predicted values (instead of from the ‘real’ values), which is what you’re really interested in.BTW, for this dataset, where we

knowthe correct form of the variance function (it’s proportional to the mean), we could have just used`varPower()`

:FIT4 <- gnls(density ~ Asym/(1 + exp((xmid – log(conc))/scal)),

data = fitDAT, weights = varPower(),

start = list(Asym = 3, xmid = 0, scal = 1))

summary(FIT4)

The estimate of the coefficient 0.02 (from ‘sd = 0.02 * x’) is now listed as the ‘Residual standard error’.

@Karl: Thanks for your insights! Will use ‘gnls’ more in the future…