## wapply: A faster (but less functional) ‘rollapply’ for vector setups

April 23, 2013

For some cryptic reason I needed a function that calculates function values on sliding windows of a vector. Googling around soon brought me to ‘rollapply’, which when I tested it seems to be a very versatile function. However, I wanted to code my own version just for vector purposes in the hope that it may be somewhat faster.
This is what turned out (wapply for “window apply”):

wapply <- function(x, width, by = NULL, FUN = NULL, ...)
{
FUN <- match.fun(FUN)
if (is.null(by)) by <- width

lenX <- length(x)
SEQ1 <- seq(1, lenX - width + 1, by = by)
SEQ2 <- lapply(SEQ1, function(x) x:(x + width - 1))

OUT <- lapply(SEQ2, function(a) FUN(x[a], ...))
OUT <- base:::simplify2array(OUT, higher = TRUE)
return(OUT)
}

It is much more restricted than ‘rollapply’ (no padding, left/center/right adjustment etc).
But interestingly, for some setups it is very much faster:

library(zoo)
x <- 1:200000

large window, small slides:

> system.time(RES1 <- rollapply(x, width = 1000, by = 50, FUN = fun))
User      System verstrichen
3.71        0.00        3.84
> system.time(RES2 <- wapply(x, width = 1000, by = 50, FUN = fun))
User      System verstrichen
1.89        0.00        1.92
> all.equal(RES1, RES2)
[1] TRUE

small window, small slides:

> system.time(RES1 <- rollapply(x, width = 50, by = 50, FUN = fun))
User      System verstrichen
2.59        0.00        2.67
> system.time(RES2 <- wapply(x, width = 50, by = 50, FUN = fun))
User      System verstrichen
0.86        0.00        0.89
> all.equal(RES1, RES2)
[1] TRUE

small window, large slides:

> system.time(RES1 <- rollapply(x, width = 50, by = 1000, FUN = fun))
User      System verstrichen
1.68        0.00        1.77
> system.time(RES2 <- wapply(x, width = 50, by = 1000, FUN = fun))
User      System verstrichen
0.06        0.00        0.06
> all.equal(RES1, RES2)
[1] TRUE

There is about a 2-3 fold gain in speed for the above two setups but a 35-fold gain in the small window/large slides setup. Interesting…
I noticed that zoo:::rollapply.zoo uses mapply internally, maybe there is some overhead for pure vector calculations…

Cheers,
Andrej

## bigcor: Large correlation matrices in R

February 22, 2013

As I am working with large gene expression matrices (microarray data) in my job, it is sometimes important to look at the correlation in gene expression of different genes. It has been shown that by calculating the Pearson correlation between genes, one can identify (by high $\varrho$ values, i.e. > 0.9) genes that share a common regulation mechanism such as being induced/repressed by the same transcription factors:

http://www.jbc.org/content/279/17/17905.long

I had an idea. How about using my microarray data of gene expression of 40000 genes in 28 samples and calculate the correlation between all 40000 genes (variables). For this blog entry, we will use simulated data but the message is the same. So let’s create a 40000 x 28 matrix with random normal values and calculate all correlations:

MAT <- matrix(rnorm(40000 * 28), nrow = 28)
cor(MAT)

“Fehler in cor(MAT) : kann Vektor der Länge 1600000000 nicht allozieren”

Ok, R can’t allocate the correlation matrix. No wonder actually:
Each number of class “double” will hold 32 bytes. For a 40000 x 40000 correlation matrix we have
(32 * 40000 * 40000)/1024/1024/1024 = 47.68 GB (!!) which definitely is more than most people’s RAM.

I came across the ‘ff’ package and here we can preallocate an empty 40000 x 40000 matrix:

corMAT <- ff(vmode = "single", dim = c(40000, 40000))

Nice. However, since cor doesn’t calculate a correlation matrix this size, what to do?
With a few sleepless nights and a little help (http://stackoverflow.com/questions/14911489/large-covariance-matrix-in-r) a new function was born: bigcor.

What does it do?
1) It splits a matrix with N columns into equal size blocks of $A_1 ... A_n$$B_1 ... B_n$, etc. Block size can be defined by user, 1000-2000 is a good value because cor can handle this quite quickly.
2) For all combinations of blocks, the correlation matrix is calculated, so A/A, A/B, B/B etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function:

bigcor <- function(x, nblocks = 10, verbose = TRUE, ...)
{
library(ff, quietly = TRUE)
NCOL <- ncol(x)

## test if ncol(x) %% nblocks gives remainder 0
if (NCOL %% nblocks != 0) stop("Choose different 'nblocks' so that ncol(x) %% nblocks = 0!")

## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
corMAT <- ff(vmode = "single", dim = c(NCOL, NCOL))

## split column numbers into 'nblocks' groups
SPLIT <- split(1:NCOL, rep(1:nblocks, each = NCOL/nblocks))

## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)

## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
if (verbose) cat("Block", COMB[1], "with Block", COMB[2], "\n")
flush.console()
COR <- cor(MAT[, G1], MAT[, G2], ...)
corMAT[G1, G2] <- COR
corMAT[G2, G1] <- t(COR)
COR <- NULL
}

gc()
return(corMAT)
}

It is advised to choose ‘nblocks’ so that blocksize is max. 2000 – 5000. It should also be so that number of colums divided by ‘nblocks’ gives a remainder of 0, so with 10 colums, either 2 or 5 (10 %% 5 = 0, 10 %% 2 = 0), but the function stops if this is not the case.
An example with a small matrix:

> MAT <- matrix(rnorm(8 * 10), nrow = 10)
> res <- bigcor(MAT, nblocks = 2)
Block 1 with Block 1
Block 1 with Block 2
Block 2 with Block 2
> res
ff (open) single length=64 (64) dim=c(8,8) dimorder=c(1,2)
[,1]         [,2]         [,3]         [,4]         [,5]         [,6]         [,7]         [,8]
[1,]  1.000000000 -0.046641227  0.324235857  0.262680113 -0.665063083 -0.004026082 -0.606948912 -0.192466438
[2,] -0.046641227  1.000000000  0.506007075  0.440847754  0.380443692 -0.748274207 -0.396612495  0.124028035
[3,]  0.324235857  0.506007075  1.000000000  0.472342789 -0.095279060 -0.391388297 -0.298102200  0.412145644
[4,]  0.262680113  0.440847754  0.472342789  1.000000000 -0.001338097 -0.515028238 -0.566384733  0.500462592
[5,] -0.665063083  0.380443692 -0.095279060 -0.001338097  1.000000000 -0.162814990  0.173953563  0.235193089
[6,] -0.004026082 -0.748274207 -0.391388297 -0.515028238 -0.162814990  1.000000000  0.564979374 -0.124034598
[7,] -0.606948912 -0.396612495 -0.298102200 -0.566384733  0.173953563  0.564979374  1.000000000  0.186116025
[8,] -0.192466438  0.124028035  0.412145644  0.500462592  0.235193089 -0.124034598  0.186116025  1.000000000

Similar to original 'cor':
> cor(MAT)
[,1]        [,2]        [,3]         [,4]         [,5]         [,6]       [,7]       [,8]
[1,]  1.000000000 -0.04664123  0.32423586  0.262680111 -0.665063068 -0.004026082 -0.6069489 -0.1924664
[2,] -0.046641225  1.00000000  0.50600710  0.440847754  0.380443686 -0.748274237 -0.3966125  0.1240280
[3,]  0.324235855  0.50600710  1.00000000  0.472342793 -0.095279062 -0.391388298 -0.2981022  0.4121456
[4,]  0.262680111  0.44084775  0.47234279  1.000000000 -0.001338097 -0.515028219 -0.5663847  0.5004626
[5,] -0.665063068  0.38044369 -0.09527906 -0.001338097  1.000000000 -0.162814989  0.1739536  0.2351931
[6,] -0.004026082 -0.74827424 -0.39138830 -0.515028219 -0.162814989  1.000000000  0.5649793 -0.1240346
[7,] -0.606948889 -0.39661248 -0.29810221 -0.566384734  0.173953557  0.564979348  1.0000000  0.1861160
[8,] -0.192466433  0.12402803  0.41214564  0.500462599  0.235193095 -0.124034600  0.1861160  1.0000000

Timings for a 20000 x 20000 matrix:
> MAT <- matrix(rnorm(20000 * 10), nrow = 10)
> system.time(res <- bigcor(MAT, nblocks = 10))
Block 1 with Block 1
Block 1 with Block 2
Block 1 with Block 3
Block 1 with Block 4
Block 1 with Block 5
Block 1 with Block 6
Block 1 with Block 7
Block 1 with Block 8
Block 1 with Block 9
Block 1 with Block 10
Block 2 with Block 2
Block 2 with Block 3
Block 2 with Block 4
Block 2 with Block 5
Block 2 with Block 6
Block 2 with Block 7
Block 2 with Block 8
Block 2 with Block 9
Block 2 with Block 10
Block 3 with Block 3
Block 3 with Block 4
Block 3 with Block 5
Block 3 with Block 6
Block 3 with Block 7
Block 3 with Block 8
Block 3 with Block 9
Block 3 with Block 10
Block 4 with Block 4
Block 4 with Block 5
Block 4 with Block 6
Block 4 with Block 7
Block 4 with Block 8
Block 4 with Block 9
Block 4 with Block 10
Block 5 with Block 5
Block 5 with Block 6
Block 5 with Block 7
Block 5 with Block 8
Block 5 with Block 9
Block 5 with Block 10
Block 6 with Block 6
Block 6 with Block 7
Block 6 with Block 8
Block 6 with Block 9
Block 6 with Block 10
Block 7 with Block 7
Block 7 with Block 8
Block 7 with Block 9
Block 7 with Block 10
Block 8 with Block 8
Block 8 with Block 9
Block 8 with Block 10
Block 9 with Block 9
Block 9 with Block 10
Block 10 with Block 10
User      System verstrichen
38.98        7.54       50.28

Roughly a minute. I think that's o.k.
Comments welcome (improvements, speed, memory management,
parallel execution...)

Cheers,
Andrej

## Peer-reviewed R packages?

November 23, 2012

Dear R-Users,

a question: I am the author of the ‘qpcR’ package. Within this, there is a function ‘propagate’ that does error propagation based on Monte Carlo Simulation, permutation-based confidence intervals and Taylor expansion. For the latter I recently implemented a second-order Taylor expansion term that can correct for nonlinearity. The formulas are quite complex and I had quite a hard time to transform the second-order term that has covariance included into a matrix form. The first order expansion can be represented by GRAD * SIGMA * t(GRAD) with GRAD being the gradient matrix and SIGMA the covariance matrix. The second order term is 0.5 * trace(HESS * SIGMA * HESS * SIGMA) with HESS being the Hessian matrix of partial second derivatives.
So here the problem: Being a molecular biologist, my statistical skills have limitations. I am pretty sure my implementation is right (I checked the examples in GUM “Guide to the evaluation of uncertainties in experiments” and got the same results). BUT: Being sure says nothing. It could well be that I’m on the wrong track and people use a function that gives essentially wrong (maybe slightly wrong which may still be fatal…) results because they rely on me doing things right.
Now let’s suppose I’m not the only one with a queer feeling when submitting packages to CRAN in the hope that everything I did is right.
Which brings me to the question: What are the merits of a peer-review system for R packages? I would suppose (if the paradigm is right that peer-reviewing DOES increase quality…) that if packages were independently peer-reviewed in a way that a handful of other people inspect the code and check for detrimental bugs that are enigmatic to the packages author, then this would significantly increase the credibility of the packages functions.
Now I know that there is the “R Journal” or the “Journal of Statistical Software” that regularly publish papers where new R packages are explained in detail. However, they are restricted to the statistical background of the implementations and I’m not sure if the lines of code are actually checked.
An idea would be that such a review system is voluntary, but if a package has been reviewed and declared as “passed”, then this would be some sort of quality criterion. For scientists like me that need to write grant applications, this would give more impact on the work one has dedicated into developing a “good” package that could maybe be cited as being “peer-reviewed”.

Pros:
* increase of quality
* maybe citable
* beneficial for grant applications or C.V.

Cons:
* reviewers must meticulously inspect the code
* longer time to CRAN submission
* no platform yet to publish the reviews which can be cited like a journal