Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of , , etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

This entry was posted on Friday, February 22nd, 2013 at 3:33 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

21 Responses to bigcor: Large correlation matrices in R

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

Got it! This works.
res <- as.ffdf(res)
write.csv.ffdf(res, "c:\\temp\\test.csv")

For a 10000 x 10000 correlation matrix timings are:
res 35s
write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
resulting in a 1.8 GB csv-file.
If there is a quicker way to export, please let me know!

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}
Same issue as above. Additionally it would have been great to mention two further points within your post/function:
– require(multicore) at the function head
– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
library(doSNOW, quietly = TRUE)
stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
registerDoSNOW(cl)
foreach(i = 1:nrow(x.subs.comb)) %dopar% {
…
}
stopCluster(cl)

However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

Quick and dirty solution is to store the resulting list from
foreach () %*dopar% {}
and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

Hmm, have you sped up my function with ‘crossprod’? Please tell me how!
You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.
What is your setup at the moment?

In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)
D <- diag(G)
C <- sweep(G,2,D,"/")
C <- sweep(C,1,D,"/")

[…] values (think arrayfun, matfun, etc.) This didn't work. I then learned about a function called bigcor in R from a colleague, and exported my data to try that out. Even my data was too big for it, and I […]

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

Thanks for the notice, my fault.

But can you explain this:

> object.size(5.25)

32 bytes

> object.size(c(5.25, 5.25))

40 bytes

> object.size(c(5.25, 5.25, 5.25))

56 bytes

> object.size(c(5.25, 5.25, 5.25, 5.25))

56 bytes

> object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))

72 bytes

Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

Nice! I implemented a parallel version to speed up things a bit. http://brainchronicle.blogspot.com/2013/02/large-correlation-in-parallel.html

Wow, awesome! Thanks David!

Cheers,

Andrej

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

Hmm, of course, export of the large matrix is essential…

Give me one or two days to figure it out!

Cheers,

Andrej

Got it! This works.

res <- as.ffdf(res)

write.csv.ffdf(res, "c:\\temp\\test.csv")

For a 10000 x 10000 correlation matrix timings are:

res 35s

write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!

resulting in a 1.8 GB csv-file.

If there is a quicker way to export, please let me know!

Cheers,

Andrej

@Andrej

Thanks for sharing this function.

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}

Same issue as above. Additionally it would have been great to mention two further points within your post/function:

– require(multicore) at the function head

– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

Cheers,

Andrej

You can query the Platform using:

.Platform

You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):

library(doSNOW, quietly = TRUE)

stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))

cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")

registerDoSNOW(cl)

foreach(i = 1:nrow(x.subs.comb)) %dopar% {

…

}

stopCluster(cl)

However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

Quick and dirty solution is to store the resulting list from

foreach () %*dopar% {}

and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

Hmm, have you sped up my function with ‘crossprod’? Please tell me how!

You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.

What is your setup at the moment?

Cheers,

Andrej

In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)

D <- diag(G)

C <- sweep(G,2,D,"/")

C <- sweep(C,1,D,"/")

Oh, and I split dopar into 88 blocks, so 4928/88=56 people with 340K markers.

Now, on 16 cores, it finishes the full matrix in under and hour and uses about 30GB of my 64GB of ram.

Indeed cross product have fascinating property. I use RcppEigen to compute it. It can take pretty huge matrices. The function is here http://stackoverflow.com/questions/16938508/speeding-up-computation-of-dice-coefficient-in-c-rcpp/16943857#16943857

I managed to compute the cross product on a 20,000×10,000 matrix in under 3 min compared to 15+ min with cross prod()

Might it be worthwhile to publish your complete code using Rcpp on your blog?

I could then crossreference to the final function from Rmazing…

Cheers,

Andrej

I like your crossprod function with Rcpp, do you know how to make it work for something like:

MAT <- matrix(rnorm(20000 * 10), nrow = 10)

?

[…] values (think arrayfun, matfun, etc.) This didn't work. I then learned about a function called bigcor in R from a colleague, and exported my data to try that out. Even my data was too big for it, and I […]

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

I think you need a colored representation of your correlations (heatmap). You could use “image”, “heatmap” or “levelplot” (lattice).

Cheers,

Andrej