As I am working with large gene expression matrices (microarray data) in my job, it is sometimes important to look at the correlation in gene expression of different genes. It has been shown that by calculating the Pearson correlation between genes, one can identify (by high values, i.e. > 0.9) genes that share a common regulation mechanism such as being induced/repressed by the same transcription factors:
http://www.jbc.org/content/279/17/17905.long
I had an idea. How about using my microarray data of gene expression of 40000 genes in 28 samples and calculate the correlation between all 40000 genes (variables). For this blog entry, we will use simulated data but the message is the same. So let’s create a 40000 x 28 matrix with random normal values and calculate all correlations:
MAT <- matrix(rnorm(40000 * 28), nrow = 28) cor(MAT)
“Fehler in cor(MAT) : kann Vektor der Länge 1600000000 nicht allozieren”
Ok, R can’t allocate the correlation matrix. No wonder actually:
Each number of class “double” will hold 32 bytes. For a 40000 x 40000 correlation matrix we have
(32 * 40000 * 40000)/1024/1024/1024 = 47.68 GB (!!) which definitely is more than most people’s RAM.
I came across the ‘ff’ package and here we can preallocate an empty 40000 x 40000 matrix:
corMAT <- ff(vmode = "single", dim = c(40000, 40000))
Nice. However, since cor doesn’t calculate a correlation matrix this size, what to do?
Answer: All-combinations-blockwise correlation (Hu?).
With a few sleepless nights and a little help (http://stackoverflow.com/questions/14911489/large-covariance-matrix-in-r) a new function was born: bigcor.
What does it do?
1) It splits a matrix with N columns into equal size blocks of ,
, etc. Block size can be defined by user, 1000-2000 is a good value because cor can handle this quite quickly.
2) For all combinations of blocks, the correlation matrix is calculated, so A/A, A/B, B/B etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.
This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.
The function:
bigcor <- function(x, nblocks = 10, verbose = TRUE, ...)
{
library(ff, quietly = TRUE)
NCOL <- ncol(x)
## test if ncol(x) %% nblocks gives remainder 0
if (NCOL %% nblocks != 0) stop("Choose different 'nblocks' so that ncol(x) %% nblocks = 0!")
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
corMAT <- ff(vmode = "single", dim = c(NCOL, NCOL))
## split column numbers into 'nblocks' groups
SPLIT <- split(1:NCOL, rep(1:nblocks, each = NCOL/nblocks))
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
if (verbose) cat("Block", COMB[1], "with Block", COMB[2], "\n")
flush.console()
COR <- cor(MAT[, G1], MAT[, G2], ...)
corMAT[G1, G2] <- COR
corMAT[G2, G1] <- t(COR)
COR <- NULL
}
gc()
return(corMAT)
}
It is advised to choose ‘nblocks’ so that blocksize is max. 2000 – 5000. It should also be so that number of colums divided by ‘nblocks’ gives a remainder of 0, so with 10 colums, either 2 or 5 (10 %% 5 = 0, 10 %% 2 = 0), but the function stops if this is not the case.
An example with a small matrix:
> MAT <- matrix(rnorm(8 * 10), nrow = 10)
> res <- bigcor(MAT, nblocks = 2)
Block 1 with Block 1
Block 1 with Block 2
Block 2 with Block 2
> res
ff (open) single length=64 (64) dim=c(8,8) dimorder=c(1,2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.000000000 -0.046641227 0.324235857 0.262680113 -0.665063083 -0.004026082 -0.606948912 -0.192466438
[2,] -0.046641227 1.000000000 0.506007075 0.440847754 0.380443692 -0.748274207 -0.396612495 0.124028035
[3,] 0.324235857 0.506007075 1.000000000 0.472342789 -0.095279060 -0.391388297 -0.298102200 0.412145644
[4,] 0.262680113 0.440847754 0.472342789 1.000000000 -0.001338097 -0.515028238 -0.566384733 0.500462592
[5,] -0.665063083 0.380443692 -0.095279060 -0.001338097 1.000000000 -0.162814990 0.173953563 0.235193089
[6,] -0.004026082 -0.748274207 -0.391388297 -0.515028238 -0.162814990 1.000000000 0.564979374 -0.124034598
[7,] -0.606948912 -0.396612495 -0.298102200 -0.566384733 0.173953563 0.564979374 1.000000000 0.186116025
[8,] -0.192466438 0.124028035 0.412145644 0.500462592 0.235193089 -0.124034598 0.186116025 1.000000000
Similar to original 'cor':
> cor(MAT)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.000000000 -0.04664123 0.32423586 0.262680111 -0.665063068 -0.004026082 -0.6069489 -0.1924664
[2,] -0.046641225 1.00000000 0.50600710 0.440847754 0.380443686 -0.748274237 -0.3966125 0.1240280
[3,] 0.324235855 0.50600710 1.00000000 0.472342793 -0.095279062 -0.391388298 -0.2981022 0.4121456
[4,] 0.262680111 0.44084775 0.47234279 1.000000000 -0.001338097 -0.515028219 -0.5663847 0.5004626
[5,] -0.665063068 0.38044369 -0.09527906 -0.001338097 1.000000000 -0.162814989 0.1739536 0.2351931
[6,] -0.004026082 -0.74827424 -0.39138830 -0.515028219 -0.162814989 1.000000000 0.5649793 -0.1240346
[7,] -0.606948889 -0.39661248 -0.29810221 -0.566384734 0.173953557 0.564979348 1.0000000 0.1861160
[8,] -0.192466433 0.12402803 0.41214564 0.500462599 0.235193095 -0.124034600 0.1861160 1.0000000
Timings for a 20000 x 20000 matrix:
> MAT <- matrix(rnorm(20000 * 10), nrow = 10)
> system.time(res <- bigcor(MAT, nblocks = 10))
Block 1 with Block 1
Block 1 with Block 2
Block 1 with Block 3
Block 1 with Block 4
Block 1 with Block 5
Block 1 with Block 6
Block 1 with Block 7
Block 1 with Block 8
Block 1 with Block 9
Block 1 with Block 10
Block 2 with Block 2
Block 2 with Block 3
Block 2 with Block 4
Block 2 with Block 5
Block 2 with Block 6
Block 2 with Block 7
Block 2 with Block 8
Block 2 with Block 9
Block 2 with Block 10
Block 3 with Block 3
Block 3 with Block 4
Block 3 with Block 5
Block 3 with Block 6
Block 3 with Block 7
Block 3 with Block 8
Block 3 with Block 9
Block 3 with Block 10
Block 4 with Block 4
Block 4 with Block 5
Block 4 with Block 6
Block 4 with Block 7
Block 4 with Block 8
Block 4 with Block 9
Block 4 with Block 10
Block 5 with Block 5
Block 5 with Block 6
Block 5 with Block 7
Block 5 with Block 8
Block 5 with Block 9
Block 5 with Block 10
Block 6 with Block 6
Block 6 with Block 7
Block 6 with Block 8
Block 6 with Block 9
Block 6 with Block 10
Block 7 with Block 7
Block 7 with Block 8
Block 7 with Block 9
Block 7 with Block 10
Block 8 with Block 8
Block 8 with Block 9
Block 8 with Block 10
Block 9 with Block 9
Block 9 with Block 10
Block 10 with Block 10
User System verstrichen
38.98 7.54 50.28
Roughly a minute. I think that's o.k.
Comments welcome (improvements, speed, memory management,
parallel execution...)
Cheers,
Andrej
This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.
Thanks for the notice, my fault.
But can you explain this:
> object.size(5.25)
32 bytes
> object.size(c(5.25, 5.25))
40 bytes
> object.size(c(5.25, 5.25, 5.25))
56 bytes
> object.size(c(5.25, 5.25, 5.25, 5.25))
56 bytes
> object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))
72 bytes
Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.
Nice! I implemented a parallel version to speed up things a bit. http://brainchronicle.blogspot.com/2013/02/large-correlation-in-parallel.html
Wow, awesome! Thanks David!
Cheers,
Andrej
I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE
Hmm, of course, export of the large matrix is essential…
Give me one or two days to figure it out!
Cheers,
Andrej
Got it! This works.
res <- as.ffdf(res)
write.csv.ffdf(res, "c:\\temp\\test.csv")
For a 10000 x 10000 correlation matrix timings are:
res 35s
write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
resulting in a 1.8 GB csv-file.
If there is a quicker way to export, please let me know!
Cheers,
Andrej
@Andrej
Thanks for sharing this function.
Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.
If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?
@David {couldn’t comment at your blog directly}
Same issue as above. Additionally it would have been great to mention two further points within your post/function:
- require(multicore) at the function head
- doMC/multicore packages are *not* available for Windows
Hence, a Platform independent version (loading the appropriate packeges) would be excellent.
Nevertheless, thanks to both of you again. Happy science!
Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?
Cheers,
Andrej
You can query the Platform using:
.Platform
You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
library(doSNOW, quietly = TRUE)
stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
registerDoSNOW(cl)
foreach(i = 1:nrow(x.subs.comb)) %dopar% {
…
}
stopCluster(cl)
However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.
Quick and dirty solution is to store the resulting list from
foreach () %*dopar% {}
and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).
Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.