Rmazing

bigcor: Large correlation matrices in R

February 22, 2013

Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of $A_1 ... A_n$ , $B_1 ... B_n$ , etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function can found here: http://www.dr-spiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!

40 Comments | General | Tagged: big matrix, cor, correlation | Permalink
Posted by anspiess

The magic empty bracket

January 30, 2013

I have been working with R for some time now, but once in a while, basic functions catch my eye that I was not aware of…
For some project I wanted to transform a correlation matrix into a covariance matrix. Now, since cor2cov does not exist, I thought about “reversing” the cov2cor function (stats:::cov2cor).
Inside the code of this function, a specific line jumped into my retina:

r[] <- Is * V * rep(Is, each = p)

What’s this [ ]?

Well, it stands for every element $E_{ij}$ of matrix $E$ . Consider this:

mat <- matrix(NA, nrow = 5, ncol = 5)

> mat
     [,1] [,2] [,3] [,4] [,5]
[1,]   NA   NA   NA   NA   NA
[2,]   NA   NA   NA   NA   NA
[3,]   NA   NA   NA   NA   NA
[4,]   NA   NA   NA   NA   NA
[5,]   NA   NA   NA   NA   NA

With the empty bracket, we can now substitute ALL values by a new value:

mat[] <- 1

> mat
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    1    1    1    1
[3,]    1    1    1    1    1
[4,]    1    1    1    1    1
[5,]    1    1    1    1    1

Interestingly, this also works with lists:

L <- list(a = 1, b = 2, c = 3)

>L

$a

[1] 1
$b

[1] 2
$c

[1] 3
L[] <- 5

> L
$a
[1] 5

$b
[1] 5

$c
[1] 5

Cheers,
Andrej

13 Comments | R Internals | Tagged: all elements, empty bracket, matrix | Permalink
Posted by anspiess

Peer-reviewed R packages?

November 23, 2012

Dear R-Users,

a question: I am the author of the ‘qpcR’ package. Within this, there is a function ‘propagate’ that does error propagation based on Monte Carlo Simulation, permutation-based confidence intervals and Taylor expansion. For the latter I recently implemented a second-order Taylor expansion term that can correct for nonlinearity. The formulas are quite complex and I had quite a hard time to transform the second-order term that has covariance included into a matrix form. The first order expansion can be represented by GRAD * SIGMA * t(GRAD) with GRAD being the gradient matrix and SIGMA the covariance matrix. The second order term is 0.5 * trace(HESS * SIGMA * HESS * SIGMA) with HESS being the Hessian matrix of partial second derivatives.
So here the problem: Being a molecular biologist, my statistical skills have limitations. I am pretty sure my implementation is right (I checked the examples in GUM “Guide to the evaluation of uncertainties in experiments” and got the same results). BUT: Being sure says nothing. It could well be that I’m on the wrong track and people use a function that gives essentially wrong (maybe slightly wrong which may still be fatal…) results because they rely on me doing things right.
Now let’s suppose I’m not the only one with a queer feeling when submitting packages to CRAN in the hope that everything I did is right.
Which brings me to the question: What are the merits of a peer-review system for R packages? I would suppose (if the paradigm is right that peer-reviewing DOES increase quality…) that if packages were independently peer-reviewed in a way that a handful of other people inspect the code and check for detrimental bugs that are enigmatic to the packages author, then this would significantly increase the credibility of the packages functions.
Now I know that there is the “R Journal” or the “Journal of Statistical Software” that regularly publish papers where new R packages are explained in detail. However, they are restricted to the statistical background of the implementations and I’m not sure if the lines of code are actually checked.
An idea would be that such a review system is voluntary, but if a package has been reviewed and declared as “passed”, then this would be some sort of quality criterion. For scientists like me that need to write grant applications, this would give more impact on the work one has dedicated into developing a “good” package that could maybe be cited as being “peer-reviewed”.

Pros:
* increase of quality
* maybe citable
* beneficial for grant applications or C.V.

Cons:
* reviewers must meticulously inspect the code
* longer time to CRAN submission
* no platform yet to publish the reviews which can be cited like a journal

Comments welcome!

Cheers,
Andrej

8 Comments | General | Tagged: packages, peer-review | Permalink
Posted by anspiess

A weighting function for ‘nls’ / ‘nlsLM’

July 19, 2012

Standard nonlinear regression assumes homoscedastic data, that is, all response values $y_i$ are distributed normally. In case of heteroscedastic data (i.e. when the variance is dependent on the magnitude of the data), weighting the fit is essential. In nls (or nlsLM of the minpack.lm package), weighting can be conducted by two different methods:

1) by supplying a vector of weighting values $w_i$ for every $y_i$ that is supplied to the ‘weights’ argument.

2) by modifying the objective function in a way that it has the weights included in the body of the function. An example is given for this one in the documentation to nls.

1) is easy of course, but all $w_i$ values have to be derived from somewhere, for example the reciproce of the response values $w_i = 1/y_i$ . 2) is a bit more cumbersome and one does not always want to define a new weighted version of the objective function.

I found it about time to design a weighting function that can be supplied to the ‘weights’ argument and that makes weighting by different regimes a breeze. wfct() returns a vector of weights that are calculated from a user-defined expression and transfers this vector within nls.

wfct <- function(expr)
{
expr <- deparse(substitute(expr))

## create new environment
newEnv <- new.env()

## get call
mc <- sys.calls()[[1]]
mcL <- as.list(mc)

## get data and write to newEnv
DATA <- mcL[["data"]]
DATA <- eval(DATA)
DATA <- as.list(DATA)
NAMES <- names(DATA)
for (i in 1:length(DATA)) assign(NAMES[i], DATA[[i]], envir = newEnv)

## get parameter, response and predictor names
formula <- as.formula(mcL[[2]])
VARS <- all.vars(formula)
RESP <- VARS[1]
RHS <- VARS[-1]
PRED <- match(RHS, names(DATA))
PRED <- names(DATA)[na.omit(PRED)]

## calculate variances for response values if "error" is in expression
## and write to newEnv
if (length(grep("error", expr)) > 0) {
y <- DATA[[RESP]]
x <- DATA[[PRED]]
## test for replication
if (!any(duplicated(x))) stop("No replicates available to calculate error from!")
## calculate error
error <- tapply(y, x, function(e) var(e, na.rm = TRUE))
error <- as.numeric(sqrt(error))
## convert to original repititions
error <- rep(error, as.numeric(table(x)))
assign("error", error, envir = newEnv)
}

## calculate fitted or residual values if "fitted"/"resid" is in expression
## and write to newEnv
if (length(grep("fitted", expr)) > 0 || length(grep("resid", expr)) > 0) {
mc2 <- mc
mc2$weights <- NULL
MODEL <- eval(mc2)
fitted <- fitted(MODEL)
resid <- residuals(MODEL)
assign("fitted", fitted, newEnv)
assign("resid", resid, newEnv)
}

## return evaluation in newEnv: vector of weights
OUT <- eval(parse(text = expr), envir = newEnv)
return(OUT)
}

The weighting function can take 5 different variable definitions and combinations thereof:
the name of the predictor (independent) variable,
the name of the response (dependent) variable,
error: if replicates $y_{ij}$ exist, the error $\sigma(y_{ij})$ ,
fitted: the fitted values $\hat{y}_i$ of the model,
resid: the residuals $y_i - \hat{y}_i$ of the model,
For the last two, the model is fit unweighted, fitted values and residuals are extracted and the model is refit by the defined weights.

Let’s have some examples (taken from the nls doc):

Treated <- Puromycin[Puromycin$state == "treated", ]

Weighting by inverse of response $1/y_i$ :

nls(rate ~ Vm * conc/(K + conc), data = Treated, 
    start = c(Vm = 200, K = 0.05), weights = wfct(1/rate))
Nonlinear regression model
  model:  rate ~ Vm * conc/(K + conc) 
   data:  Treated 
       Vm         K 
209.59669   0.06065 
 weighted residual sum-of-squares: 12.27

Number of iterations to convergence: 5 
Achieved convergence tolerance: 5.2e-06

Weighting by square root of predictor $\sqrt{x_i}$ :

nls(rate ~ Vm * conc/(K + conc), data = Treated, 
    start = c(Vm = 200, K = 0.05), weights = wfct(sqrt(conc)))
Nonlinear regression model
  model:  rate ~ Vm * conc/(K + conc) 
   data:  Treated 
       Vm         K 
216.74134   0.07106 
 weighted residual sum-of-squares: 332.6

Number of iterations to convergence: 5 
Achieved convergence tolerance: 2.403e-06

Weighting by inverse square of fitted values $1/\hat{y_i}^2$ :

nls(rate ~ Vm * conc/(K + conc), data = Treated, 
    start = c(Vm = 200, K = 0.05), weights = wfct(1/fitted^2))
Nonlinear regression model
  model:  rate ~ Vm * conc/(K + conc) 
   data:  Treated 
       Vm         K 
200.84098   0.04943 
 weighted residual sum-of-squares: 0.2272

Number of iterations to convergence: 4 
Achieved convergence tolerance: 1.47e-06

A common weighting regime is to use the inverse variance $1/\sigma^2(y_i)$ :

6 Comments | R Internals | Permalink
Posted by anspiess

A better ‘nls’ (?)

July 5, 2012

Those that do a lot of nonlinear regression will love the nls function of R. In most of the cases it works really well, but there are some mishaps that can occur when using bad starting values for the parameters. One of the most dreaded is the “singular gradient matrix at initial parameter estimates” which brings the function to a stop because the gradient check in stats:::nlsModel will terminate if the QR decomposition is not of full column rank.

Nearly all nonlinear fitting programs out there use the Levenberg-Marquardt algorithm for nonlinear regression. This is because its switching between Gauss-Newton and gradient descent is highly robust against far-off-optimal starting values. Unfortunately, the standard nls function has no LM implemented, instead it houses the Gauss-Newton type, the PORT routines and a partial linear fitter. The fabulous minpack.lm package from Katherine M. Mullen offers an R frontend to a Fortran LM implementation of the MINPACK package. The nls.lm function must be supplied with an objective function that returns a vector of residuals to be minimized. Together with Kate I developed a function nlsLM that has the gee wizz formulaic interface of nls, but it calls LM instead of Gauss-Newton. This has some advantages as we will see below. The function returns the usual result of class ‘nls’, and due to some modifications, all standard generics work. The modifications were made so that the formula is transformed into a function that returns a vector of (weighted) residuals whose sum square is minimized by nls.lm. The optimized parameters are then transferred to stats:::nlsModel in order to obtain an object of class ‘nlsModel’. The internal C functions C_nls_iter and stats:::nls_port_fit were removed to avoid subsequent “Gauss-Newton”, “port” or “plinear” optimization of nlsModel.

So, what’s similar and what’s, well, better…

library(minpack.lm)
### Examples from 'nls' doc ###
DNase1 <- subset(DNase, Run == 1)
fm1DNase1 <- nlsLM(density ~ Asym/(1 + exp((xmid - log(conc))/scal)), 
data = DNase1, start = list(Asym = 3, xmid = 0, scal = 1))
## all generics are applicable
coef(fm1DNase1)
confint(fm1DNase1)
deviance(fm1DNase1)
df.residual(fm1DNase1)
fitted(fm1DNase1)
formula(fm1DNase1)
logLik(fm1DNase1)
predict(fm1DNase1)
print(fm1DNase1)
profile(fm1DNase1)
residuals(fm1DNase1)
summary(fm1DNase1)
update(fm1DNase1)
vcov(fm1DNase1)
weights(fm1DNase1)

nlsLM can fit zero noise data, when nls fails:
x <- 1:10
y <- 2*x + 3                           
nls(y ~ a + b * x, start = list(a = 0.12345, b = 0.54321))
nlsLM(y ~ a + b * x, start = list(a = 0.12345, b = 0.54321))

Nonlinear regression model

  model:  y ~ a + b * x

   data:  parent.frame()

a b

3 2

 residual sum-of-squares: 0
Number of iterations to convergence: 3

Achieved convergence tolerance: 1.49e-08 
nlsLM often converges when nls gives the dreaded "singular gradient" error.

Example taken from here:
x <- 0:140
y <- 200 / (1 + exp(17 - x)/2) * exp(-0.02*x)
yeps <- y + rnorm(length(y), sd = 2)
nls(yeps ~ p1 / (1 + exp(p2 - x)) * exp(p4 * x), start=list(p1=410,p2=18,p4=-.03))

Optimal starting parameters work:

Nonlinear regression model

  model:  yeps ~ p1/(1 + exp(p2 - x)) * exp(p4 * x)

   data:  parent.frame()

       p1        p2        p4

200.67285  16.32430  -0.02005

 residual sum-of-squares: 569
Number of iterations to convergence: 6

Achieved convergence tolerance: 2.611e-07 
But off-optimal parameters give error:
nls(yeps ~ p1 / (1 + exp(p2 - x)) * exp(p4 * x), start=list(p1=10,p2=18,p4=-.03))

Fehler in nls(yeps ~ p1/(1 + exp(p2 - x)) * exp(p4 * x), start = list(p1 = 10,  :

  singulärer Gradient
nlsLM is more robust with these starting parameters:
nlsLM(yeps ~ p1 / (1 + exp(p2 - x)) * exp(p4 * x), start=list(p1=10,p2=18,p4=-.03))

Nonlinear regression model

  model:  yeps ~ p1/(1 + exp(p2 - x)) * exp(p4 * x)

   data:  parent.frame()

       p1        p2        p4

200.67285  16.32430  -0.02005

 residual sum-of-squares: 569
Number of iterations to convergence: 10

Achieved convergence tolerance: 1.49e-08 
Have fun!

Cheers, Andrej

16 Comments | R Internals | Permalink
Posted by anspiess

Don’t recycle me!

June 19, 2012

For me, one of the most annoying features of R is that by default, rbind, cbind and data.frame recycle the shorter vector to the length of the longer vector. I still don’t understand why the standard generics don’t have a parameter like cbind(1:10, 1:5, fill = TRUE) to fill up with ‘NA’s. There may be a deeper reason…

> cbind(1:10, 1:5)
      [,1] [,2]
 [1,]    1    1
 [2,]    2    2
 [3,]    3    3
 [4,]    4    4
 [5,]    5    5
 [6,]    6    1
 [7,]    7    2
 [8,]    8    3
 [9,]    9    4
[10,]   10    5
> rbind(1:10, 1:5)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]    1    2    3    4    5    1    2    3    4     5

> data.frame(x = 1:10, y = 1:5)
    x y
1   1 1
2   2 2
3   3 3
4   4 4
5   5 5
6   6 1
7   7 2
8   8 3
9   9 4
10 10 5

Even more, using unequal size matrices gives an error:

> mat1 <- matrix(rnorm(10), nrow = 5)
> mat2 <- matrix(rnorm(10), nrow = 2)
> cbind(mat1, mat2)
Fehler in cbind(mat1, mat2) : 
  Anzahl der Zeilen der Matrizen muss übereinstimmen (siehe Argument 2)

(You can learn some german here. Well, I suppose you know the error…)

Enter cbind.na, rbind.na, data.frame.na !

I modified the three generic functions so that all arguments are filled with ‘NA’s up to maximal occuring length (vector), rows (cbind) or columns (rbind). One must ‘source’ all three functions as cbind.na calls rbind.na etc:

cbind.na <- function (..., deparse.level = 1)
{
 na <- nargs() - (!missing(deparse.level))
 deparse.level <- as.integer(deparse.level)
 stopifnot(0 <= deparse.level, deparse.level <= 2)
 argl <- list(...)
 while (na > 0 && is.null(argl[[na]])) {
 argl <- argl[-na]
 na <- na - 1
 }
 if (na == 0)
 return(NULL)
 if (na == 1) {
 if (isS4(..1))
 return(cbind2(..1))
 else return(matrix(...)) ##.Internal(cbind(deparse.level, ...)))
 }
 if (deparse.level) {
 symarg <- as.list(sys.call()[-1L])[1L:na]
 Nms <- function(i) {
 if (is.null(r <- names(symarg[i])) || r == "") {
 if (is.symbol(r <- symarg[[i]]) || deparse.level ==
 2)
 deparse(r)
 }
 else r
 }
 }
 ## deactivated, otherwise no fill in with two arguments
 if (na == 0) {
 r <- argl[[2]]
 fix.na <- FALSE
 }
 else {
 nrs <- unname(lapply(argl, nrow))
 iV <- sapply(nrs, is.null)
 fix.na <- identical(nrs[(na - 1):na], list(NULL, NULL))
 ## deactivated, otherwise data will be recycled
 #if (fix.na) {
 # nr <- max(if (all(iV)) sapply(argl, length) else unlist(nrs[!iV]))
 # argl[[na]] <- cbind(rep(argl[[na]], length.out = nr),
 # deparse.level = 0)
 #}
 if (deparse.level) {
 if (fix.na)
 fix.na <- !is.null(Nna <- Nms(na))
 if (!is.null(nmi <- names(argl)))
 iV <- iV & (nmi == "")
 ii <- if (fix.na)
 2:(na - 1)
 else 2:na
 if (any(iV[ii])) {
 for (i in ii[iV[ii]]) if (!is.null(nmi <- Nms(i)))
 names(argl)[i] <- nmi
 }
 }

 ## filling with NA's to maximum occuring nrows
 nRow <- as.numeric(sapply(argl, function(x) NROW(x)))
 maxRow <- max(nRow, na.rm = TRUE)
 argl <- lapply(argl, function(x) if (is.null(nrow(x))) c(x, rep(NA, maxRow - length(x)))
 else rbind.na(x, matrix(, maxRow - nrow(x), ncol(x))))
 r <- do.call(cbind, c(argl[-1L], list(deparse.level = deparse.level)))
 }
 d2 <- dim(r)
 r <- cbind2(argl[[1]], r)
 if (deparse.level == 0)
 return(r)
 ism1 <- !is.null(d1 <- dim(..1)) && length(d1) == 2L
 ism2 <- !is.null(d2) && length(d2) == 2L && !fix.na
 if (ism1 && ism2)
 return(r)
 Ncol <- function(x) {
 d <- dim(x)
 if (length(d) == 2L)
 d[2L]
 else as.integer(length(x) > 0L)
 }
 nn1 <- !is.null(N1 <- if ((l1 <- Ncol(..1)) && !ism1) Nms(1))
 nn2 <- !is.null(N2 <- if (na == 2 && Ncol(..2) && !ism2) Nms(2))
 if (nn1 || nn2 || fix.na) {
 if (is.null(colnames(r)))
 colnames(r) <- rep.int("", ncol(r))
 setN <- function(i, nams) colnames(r)[i] <<- if (is.null(nams))
 ""
 else nams
 if (nn1)
 setN(1, N1)
 if (nn2)
 setN(1 + l1, N2)
 if (fix.na)
 setN(ncol(r), Nna)
 }
 r
}

rbind.na <- function (..., deparse.level = 1)
{
    na <- nargs() - (!missing(deparse.level))
    deparse.level <- as.integer(deparse.level)
    stopifnot(0 <= deparse.level, deparse.level <= 2)
    argl <- list(...)
    while (na > 0 && is.null(argl[[na]])) {
        argl <- argl[-na]
        na <- na - 1
    }
    if (na == 0)
        return(NULL)
    if (na == 1) {
        if (isS4(..1))
            return(rbind2(..1))
        else return(matrix(..., nrow = 1)) ##.Internal(rbind(deparse.level, ...)))
    }
    if (deparse.level) {
        symarg <- as.list(sys.call()[-1L])[1L:na]
        Nms <- function(i) {
            if (is.null(r <- names(symarg[i])) || r == "") {
                if (is.symbol(r <- symarg[[i]]) || deparse.level ==
                  2)
                  deparse(r)
            }
            else r
        }
    }

    ## deactivated, otherwise no fill in with two arguments
    if (na == 0) {
        r <- argl[[2]]
        fix.na <- FALSE
    }
    else {
        nrs <- unname(lapply(argl, ncol))
        iV <- sapply(nrs, is.null)
        fix.na <- identical(nrs[(na - 1):na], list(NULL, NULL))
        ## deactivated, otherwise data will be recycled
        #if (fix.na) {
        #    nr <- max(if (all(iV)) sapply(argl, length) else unlist(nrs[!iV]))
        #    argl[[na]] <- rbind(rep(argl[[na]], length.out = nr),
        #        deparse.level = 0)
        #}
        if (deparse.level) {
            if (fix.na)
                fix.na <- !is.null(Nna <- Nms(na))
            if (!is.null(nmi <- names(argl)))
                iV <- iV & (nmi == "")
            ii <- if (fix.na)
                2:(na - 1)
            else 2:na
            if (any(iV[ii])) {
                for (i in ii[iV[ii]]) if (!is.null(nmi <- Nms(i)))
                  names(argl)[i] <- nmi
            }
        }

        ## filling with NA's to maximum occuring ncols
        nCol <- as.numeric(sapply(argl, function(x) if (is.null(ncol(x))) length(x)
                                                    else ncol(x)))
        maxCol <- max(nCol, na.rm = TRUE)
        argl <- lapply(argl, function(x)  if (is.null(ncol(x))) c(x, rep(NA, maxCol - length(x)))
                                          else cbind(x, matrix(, nrow(x), maxCol - ncol(x))))

        ## create a common name vector from the
        ## column names of all 'argl' items
        namesVEC <- rep(NA, maxCol)
        for (i in 1:length(argl)) {
          CN <- colnames(argl[[i]])
          m <- !(CN %in% namesVEC)
          namesVEC[m] <- CN[m]
        }

        ## make all column names from common 'namesVEC'
        for (j in 1:length(argl)) {
          if (!is.null(ncol(argl[[j]]))) colnames(argl[[j]]) <- namesVEC
        }

        r <- do.call(rbind, c(argl[-1L], list(deparse.level = deparse.level)))
    }

    d2 <- dim(r)

    ## make all column names from common 'namesVEC'
    colnames(r) <- colnames(argl[[1]])

    r <- rbind2(argl[[1]], r)

    if (deparse.level == 0)
        return(r)
    ism1 <- !is.null(d1 <- dim(..1)) && length(d1) == 2L
    ism2 <- !is.null(d2) && length(d2) == 2L && !fix.na
    if (ism1 && ism2)
        return(r)
    Nrow <- function(x) {
        d <- dim(x)
        if (length(d) == 2L)
            d[1L]
        else as.integer(length(x) > 0L)
    }
    nn1 <- !is.null(N1 <- if ((l1 <- Nrow(..1)) && !ism1) Nms(1))
    nn2 <- !is.null(N2 <- if (na == 2 && Nrow(..2) && !ism2) Nms(2))
    if (nn1 || nn2 || fix.na) {
        if (is.null(rownames(r)))
            rownames(r) <- rep.int("", nrow(r))
        setN <- function(i, nams) rownames(r)[i] <<- if (is.null(nams))
            ""
        else nams
        if (nn1)
            setN(1, N1)
        if (nn2)
            setN(1 + l1, N2)
        if (fix.na)
            setN(nrow(r), Nna)
    }
    r
}

data.frame.na <- function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE)
{
    data.row.names <- if (check.rows && is.null(row.names))
        function(current, new, i) {
            if (is.character(current))
                new <- as.character(new)
            if (is.character(new))
                current <- as.character(current)
            if (anyDuplicated(new))
                return(current)
            if (is.null(current))
                return(new)
            if (all(current == new) || all(current == ""))
                return(new)
            stop(gettextf("mismatch of row names in arguments of 'data.frame', item %d",
                i), domain = NA)
        }
    else function(current, new, i) {
        if (is.null(current)) {
            if (anyDuplicated(new)) {
                warning("some row.names duplicated: ", paste(which(duplicated(new)),
                  collapse = ","), " --> row.names NOT used")
                current
            }
            else new
        }
        else current
    }
    object <- as.list(substitute(list(...)))[-1L]
    mrn <- is.null(row.names)
    x <- list(...)
    n <- length(x)
    if (n < 1L) {
        if (!mrn) {
            if (is.object(row.names) || !is.integer(row.names))
                row.names <- as.character(row.names)
            if (any(is.na(row.names)))
                stop("row names contain missing values")
            if (anyDuplicated(row.names))
                stop("duplicate row.names: ", paste(unique(row.names[duplicated(row.names)]),
                  collapse = ", "))
        }
        else row.names <- integer(0L)
        return(structure(list(), names = character(0L), row.names = row.names,
            class = "data.frame"))
    }
    vnames <- names(x)
    if (length(vnames) != n)
        vnames <- character(n)
    no.vn <- !nzchar(vnames)
    vlist <- vnames <- as.list(vnames)
    nrows <- ncols <- integer(n)
    for (i in seq_len(n)) {
        xi <- if (is.character(x[[i]]) || is.list(x[[i]]))
            as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
        else as.data.frame(x[[i]], optional = TRUE)
        nrows[i] <- .row_names_info(xi)
        ncols[i] <- length(xi)
        namesi <- names(xi)
        if (ncols[i] > 1L) {
            if (length(namesi) == 0L)
                namesi <- seq_len(ncols[i])
            if (no.vn[i])
                vnames[[i]] <- namesi
            else vnames[[i]] <- paste(vnames[[i]], namesi, sep = ".")
        }
        else {
            if (length(namesi))
                vnames[[i]] <- namesi
            else if (no.vn[[i]]) {
                tmpname <- deparse(object[[i]])[1L]
                if (substr(tmpname, 1L, 2L) == "I(") {
                  ntmpn <- nchar(tmpname, "c")
                  if (substr(tmpname, ntmpn, ntmpn) == ")")
                    tmpname <- substr(tmpname, 3L, ntmpn - 1L)
                }
                vnames[[i]] <- tmpname
            }
        }
        if (missing(row.names) && nrows[i] > 0L) {
            rowsi <- attr(xi, "row.names")
            nc <- nchar(rowsi, allowNA = FALSE)
            nc <- nc[!is.na(nc)]
            if (length(nc) && any(nc))
                row.names <- data.row.names(row.names, rowsi,
                  i)
        }
        nrows[i] <- abs(nrows[i])
        vlist[[i]] <- xi
    }
    nr <- max(nrows)
    for (i in seq_len(n)[nrows < nr]) {
        xi <- vlist[[i]]
        if (nrows[i] > 0L) {
            xi <- unclass(xi)
            fixed <- TRUE
            for (j in seq_along(xi)) {
                ### added NA fill to max length/nrow
                xi1 <- xi[[j]]
                if (is.vector(xi1) || is.factor(xi1))
                  xi[[j]] <- c(xi1, rep(NA, nr - nrows[i]))
                else if (is.character(xi1) && class(xi1) == "AsIs")
                  xi[[j]] <- structure(c(xi1, rep(NA, nr - nrows[i])),
                    class = class(xi1))
                else if (inherits(xi1, "Date") || inherits(xi1,
                  "POSIXct"))
                  xi[[j]] <- c(xi1, rep(NA, nr - nrows[i]))
                else {
                  fixed <- FALSE
                  break
                }
            }
            if (fixed) {
                vlist[[i]] <- xi
                next
            }
        }
        stop("arguments imply differing number of rows: ", paste(unique(nrows),
            collapse = ", "))
    }
    value <- unlist(vlist, recursive = FALSE, use.names = FALSE)
    vnames <- unlist(vnames[ncols > 0L])
    noname <- !nzchar(vnames)
    if (any(noname))
        vnames[noname] <- paste("Var", seq_along(vnames), sep = ".")[noname]
    if (check.names)
        vnames <- make.names(vnames, unique = TRUE)
    names(value) <- vnames
    if (!mrn) {
        if (length(row.names) == 1L && nr != 1L) {
            if (is.character(row.names))
                row.names <- match(row.names, vnames, 0L)
            if (length(row.names) != 1L || row.names < 1L ||
                row.names > length(vnames))
                stop("row.names should specify one of the variables")
            i <- row.names
            row.names <- value[[i]]
            value <- value[-i]
        }
        else if (!is.null(row.names) && length(row.names) !=
            nr)
            stop("row names supplied are of the wrong length")
    }
    else if (!is.null(row.names) && length(row.names) != nr) {
        warning("row names were found from a short variable and have been discarded")
        row.names <- NULL
    }
    if (is.null(row.names))
        row.names <- .set_row_names(nr)
    else {
        if (is.object(row.names) || !is.integer(row.names))
            row.names <- as.character(row.names)
        if (any(is.na(row.names)))
            stop("row names contain missing values")
        if (anyDuplicated(row.names))
            stop("duplicate row.names: ", paste(unique(row.names[duplicated(row.names)]),
                collapse = ", "))
    }
    attr(value, "row.names") <- row.names
    attr(value, "class") <- "data.frame"
    value
}

So, let’s start!

> cbind.na(1:8, 1:7, 1:5, 1:10)
      [,1] [,2] [,3] [,4]
 [1,]    1    1    1    1
 [2,]    2    2    2    2
 [3,]    3    3    3    3
 [4,]    4    4    4    4
 [5,]    5    5    5    5
 [6,]    6    6   NA    6
 [7,]    7    7   NA    7
 [8,]    8   NA   NA    8
 [9,]   NA   NA   NA    9
[10,]   NA   NA   NA   10
> rbind.na(1:8, 1:7, 1:5, 1:10)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8   NA    NA
[2,]    1    2    3    4    5    6    7   NA   NA    NA
[3,]    1    2    3    4    5   NA   NA   NA   NA    NA
[4,]    1    2    3    4    5    6    7    8    9    10
> data.frame.na(A = 1:7, B = 1:5, C = letters[1:3], 
+               D = factor(c(1, 1, 2, 2)))
  A  B    C  D
1 1  1    a  1
2 2  2    b  1
3 3  3    c  2
4 4  4 <NA>  2
5 5  5 <NA> NA
6 6 NA <NA> NA
7 7 NA <NA> NA

System time overhead for NA-filling is not too much:
> testList <- vector("list", length = 10000)
> for (i in 1:10000) testList[[i]] <- sample(1:100)
> system.time(res1 <- do.call(cbind, testList))
   user  system elapsed 
   0.01    0.00    0.02 

> testList <- vector("list", length = 10000)
> for (i in 1:10000) testList[[i]] <- sample(1:100, sample(1:100))
> system.time(res2 <- do.call(cbind.na, testList))
   user  system elapsed 
   0.23    0.00    0.24

Have fun!

12 Comments | R Internals | Permalink
Posted by anspiess

Rmazing

bigcor: Large correlation matrices in R

The magic empty bracket

Peer-reviewed R packages?

A weighting function for ‘nls’ / ‘nlsLM’

A better ‘nls’ (?)

Don’t recycle me!

Enter cbind.na, rbind.na, data.frame.na !

Recent Posts

Archives

Categories

Meta