## Introducing ‘propagate’

With this post, I want to introduce the new ‘propagate’ package on CRAN.
It has one single purpose: propagation of uncertainties (“error propagation”). There is already one package on CRAN available for this task, named ‘metRology’ (http://cran.r-project.org/web/packages/metRology/index.html).
‘propagate’ has some additional functionality that some may find useful. The most important functions are:

* propagate: A general function for the calculation of uncertainty propagation by first-/second-order Taylor expansion and Monte Carlo simulation including covariances. Input data can be any symbolic/numeric differentiable expression and data based on replicates, summaries (mean & s.d.) or sampled from a distribution. Uncertainty propagation is based completely on matrix calculus accounting for full covariance structure. Monte Carlo simulation is conducted using multivariate normal or t-distributions with covariance structure. The second-order Taylor approximation is the new aspect, because it is not based on the assumption of linearity around $f(x)$ but uses a second-order polynomial to account for nonlinearities, making heavy use of numerical or symbolical Hessian matrices. Interestingly, the second-order approximation gives results quite similar to the MC simulations!
* plot.propagate: Graphing error propagation with the histograms of the MC simulations and MC/Taylor-based confidence intervals.
* predictNLS: The propagate function is used to calculate the propagated error to the fitted values of a nonlinear model of type nls or nlsLM. Please refer to my post here: https://rmazing.wordpress.com/2013/08/26/predictnls-part-2-taylor-approximation-confidence-intervals-for-nls-models/.
* makeGrad, makeHess, numGrad, numHess are functions to create symbolical or numerical gradient and Hessian matrices from an expression containing first/second-order partial derivatives. These can then be evaluated in an environment with evalDerivs.
* fitDistr: This function fits 21 different continuous distributions by (weighted) NLS to the histogram or kernel density of the Monte Carlo simulation results as obtained by propagate or any other vector containing large-scale observations. Finally, the fits are sorted by ascending AIC.
* random samplers for 15 continuous distributions under one hood, some of them previously unavailable:
Skewed-normal distribution, Generalized normal distributionm, Scaled and shifted t-distribution, Gumbel distribution, Johnson SU distribution, Johnson SB distribution, 3P Weibull distribution, 4P Beta distribution, Triangular distribution, Trapezoidal distribution, Curvilinear Trapezoidal distribution, Generalized trapezoidal distribution, Laplacian distribution, Arcsine distribution, von Mises distribution.
Most of them sample from the inverse cumulative distribution function, but 11, 12 and 15 use a vectorized version of “Rejection Sampling” giving roughly 100000 random numbers/s.

An example (without covariance for simplicity): $\mu_a = 5, \sigma_a = 0.1, \mu_b = 10, \sigma_b = 0.1, \mu_x = 1, \sigma_x = 0.1$
$f(x) = a^{bx}$:

>DAT <- data.frame(a = c(5, 0.1), b = c(10, 0.1), x = c(1, 0.1))
>EXPR <- expression(a^b*x)
>res <- propagate(EXPR, DAT)

Results from error propagation:
Mean.1 Mean.2 sd.1 sd.2 2.5% 97.5%
9765625 10067885 2690477 2739850 4677411 15414333

Results from Monte Carlo simulation:
Mean sd Median MAD 2.5% 97.5%
10072640 2826027 9713207 2657217 5635222 16594123

The plot reveals the resulting distribution obtained from Monte Carlo simulation:

>plot(res)

Seems like a skewed distributions. We can now use fitDistr to find out which comes closest:

> fitDistr(res$resSIM) Fitting Normal distribution...Done. Fitting Skewed-normal distribution...Done. Fitting Generalized normal distribution...Done. Fitting Log-normal distribution...Done. Fitting Scaled/shifted t- distribution...Done. Fitting Logistic distribution...Done. Fitting Uniform distribution...Done. Fitting Triangular distribution...Done. Fitting Trapezoidal distribution...Done. Fitting Curvilinear Trapezoidal distribution...Done. Fitting Generalized Trapezoidal distribution...Done. Fitting Gamma distribution...Done. Fitting Cauchy distribution...Done. Fitting Laplace distribution...Done. Fitting Gumbel distribution...Done. Fitting Johnson SU distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting Johnson SB distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting 3P Weibull distribution...........10.........20.......Done. Fitting 4P Beta distribution...Done. Fitting Arcsine distribution...Done. Fitting von Mises distribution...Done.$aic
Distribution AIC
4 Log-normal -4917.823
16 Johnson SU -4861.960
15 Gumbel -4595.917
19 4P Beta -4509.716
12 Gamma -4469.780
9 Trapezoidal -4340.195
1 Normal -4284.706
5 Scaled/shifted t- -4283.070
6 Logistic -4266.171
3 Generalized normal -4264.102
14 Laplace -4144.870
13 Cauchy -4099.405
2 Skewed-normal -4060.936
11 Generalized Trapezoidal -4032.484
10 Curvilinear Trapezoidal -3996.495
8 Triangular -3970.993
7 Uniform -3933.513
20 Arcsine -3793.793
18 3P Weibull -3783.041
21 von Mises -3715.034
17 Johnson SB -3711.034

Log-normal wins, which makes perfect sense after using an exponentiation function...

Have fun with the package. Comments welcome!
Cheers,
Andrej

### 15 Responses to Introducing ‘propagate’

1. Paul Prew says:

Is EXPR <- expression(a^b*x) in R the same as EXPR <- expression(a^(b*x))? I'm not familiar with the function 'expression'. Thanks for the useful package.

2. Katharine Mullen says:

This is a great package. Congrats on its publication, which significantly extends the framework for uncertainty quantification in R.

3. Karin Rivera says:

Hello,

I am new using this package, I am working in the uncertainty analysis of the biomass data obtained from a forest inventory (variables: DBH, height and wood density) and my formula for biomass is : 0.112*(wood.density*DBH^2*height)^0.916

Actually I used “propagate” as:

DAT <- inventory
EXPR <- function(DBH, height, wood.density)0.112*(wood.density*DBH^2*height)^0.916

res <- propagate(EXPR, DAT)

my results:

Results from error propagation:
Mean.1 Mean.2 sd.1 sd.2 2.5% 97.5%
13.82681 -188.77689 247.42702 390.59668 -952.64302 582.66181

Results from Monte Carlo simulation:
Mean sd Median MAD 2.5% 97.5%
39.89066744 69.42375773 13.90524871 19.39363104 0.03365433 232.22693809

and I obtained as the best distribution : 3P Weibull with an AIC of: -2874.265

I would like if someone can clear me the meaning of the results from error propagation??
and also if I need the standar error of the biomass, how can i get it?

Thanks and best regards

• anspiess says:

• Karin Rivera says:

Sorry????

4. Antoine Stevens says:

Hi,
First thank you for this very useful package that helped me already quite a lot. I am using it to compute uncertainties of a variable Y that is function of two variable X1 and X2, both of which being themeselves estimated from two linear regression models. I would like to have your advice on which values I am supposed to use for the estimates of the uncertainties (ie second row in the data.frame supplied to the argument ‘data’ in the ‘propagate’ function) in this case : (1) standard errors as returned by predict(lm.obj, se.fit=T)$se.fit ; (2) residual standard error of my linear models ; or (3) ?. Currently I use (1), which gives much lower uncertainties in Y than solution (2). Is it legitimate? • anspiess says: Hi Antoine, so X1 and X2 are actually two response values from two different linear models? And you are plugging them into a new equation to obtain Y? Then you should take the standard errors coming from predict(…)$se.fit. If the equation you are using to obtain Y also needs the the fit parameters of your linear models (slope1, intercept1, slope2, intercept2), then you should also take their parameter errors sqrt(diag(vcov(your_linear_model)).

Cheers,
Andrej

• Antoine Stevens says:

Thank for your answer. So yes, X1 and X2 are estimated from two different linear models. But Y can be simply derived by multiplying X1 and X2. Using predict(…)$se.fit yields very low uncertainties in Y, because for both linear models used to derive X1 and X2, I have a lots of data (and standard errors tend to zero when n is increasing). This is a bit counter-intuitive to me because the residual standard errors of the two models (sqrt(deviance(lm_fit)/df.residual(lm_fit))) are relatively high. Since predict(…)$se.fit gives the standard error of estimation of mu_x (used for confidence intervals), I was wondering whether I would rather need to compute the standard errors of prediction (used for prediction intervals and are un-compressible with n –> infinity) and propagate these values to compute the uncertainty in Y.

5. anspiess says:

Hi Antoine,

yes, of course, on second thought you are totally right, you have to use the “prediction standard deviation” for uncertainty propagation and not the error of the prediction mean.

Cheers,
Andrej

• Antoine Stevens says:

Perfect, thanks!

6. Karin says:

hi,
Could you help me telling me how can obtain s.e.m after runnig the package?

my results after running the package are:
Results from error propagation:
Mean.1 Mean.2 sd.1 sd.2 2.5% 97.5%
11.38983 -39.16052 217.16837 245.36785 -519.91888 443.65640

Results from Monte Carlo simulation:
37.65579274 70.53757552 11.62218452 16.43243847 0.01784693
97.5%
233.16307179

Karin

• anspiess says:

Hi Karin,

either something is wrong with your data, or you used the ‘propagate’ function wrong. Did you use raw data?
I’m asking because there is a very strong discrepancy between first-order mean, second-order mean and Monte Carlo mean.
Also the first-order s.d. and Monte Carlo s.d. are totally different! Does your data have extreme variance? Or did you supply raw data to ‘propagate’ without setting type = “raw”?

Cheers,
Andrej

7. Karin says:

Hi Andrej,

Thanks for your answer. Actually, I am working with a data from a forest inventory , (variables of each tree measurement: diameter, height, and wood density). in order to obtain the Biomass of the forest I use a general equation proposed by chave(2014):
0.0673*(wood.density*diameter^2*height)^0.976.

I would like to obtain the uncertainty equation with my data.
Errors in the three variables all result in error in estimating the biomass, This error is propagated to the AGB estimate through the general model.

I obtain the descriptive statistics I found a great variance.
Par example , for ecozone Selva Baja:
> summary(SB_30vd$correc_biomasa) Min. 1st Qu. Median Mean 3rd Qu. Max. 54.03 702.60 1041.00 1558.00 1750.00 21700.00 > sd(SB_30vd$correc_biomasa)
[1] 1617.079
> se se(SB_30vd\$correc_biomasa)
[1] 27.67981

> data_PRO DAT EXPR res res

Results from error propagation:
Mean.1 Mean.2 sd.1 sd.2 2.5% 97.5%
16639.76 -6833930.64 3045084.91 10251866.51 -26959008.91 13146021.03

Results from Monte Carlo simulation:
Mean sd Median MAD 2.5% 97.5%
46458.7920 85999.1315 14469.0721 20510.8366 23.0063 286044.0463

I am not sure if I am using the propagate package in the correct form. I would be appreciate it if you can give me some advices…

Thank yo very much

Karin

• anspiess says:

Hi Karin,

without your data and your function call I cannot be of real help. Also WordPress does some funny formatting of text.
I can helip you better if you mail me a.spiess@uke,de, maybe some example data and how you use the ‘propagate’ function. Anyway, your biomass function is highly nonlinear (two powers!), which inflates propagated error quite a lot!
Drop me a mail, that should work better…

Cheers,
Andrej