Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of , , etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

This entry was posted on Friday, February 22nd, 2013 at 3:33 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

31 Responses to bigcor: Large correlation matrices in R

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

Got it! This works.
res <- as.ffdf(res)
write.csv.ffdf(res, "c:\\temp\\test.csv")

For a 10000 x 10000 correlation matrix timings are:
res 35s
write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
resulting in a 1.8 GB csv-file.
If there is a quicker way to export, please let me know!

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}
Same issue as above. Additionally it would have been great to mention two further points within your post/function:
– require(multicore) at the function head
– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
library(doSNOW, quietly = TRUE)
stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
registerDoSNOW(cl)
foreach(i = 1:nrow(x.subs.comb)) %dopar% {
…
}
stopCluster(cl)

However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

Quick and dirty solution is to store the resulting list from
foreach () %*dopar% {}
and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

Hmm, have you sped up my function with ‘crossprod’? Please tell me how!
You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.
What is your setup at the moment?

In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)
D <- diag(G)
C <- sweep(G,2,D,"/")
C <- sweep(C,1,D,"/")

[…] values (think arrayfun, matfun, etc.) This didn't work. I then learned about a function called bigcor in R from a colleague, and exported my data to try that out. Even my data was too big for it, and I […]

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

I would be very happy if you could add the possibility from the original cor() function where it is possible to calculate correlations not only between all columns of x, but also between x and y (where y is a vector or matrix).

Well, there are situations where you have milions of columns in x and just want each of their correlation with a vector y. To cbind x and y in this case would lead to unnecessary computations and actually an unfeasible problem. So, yes, I insist!

Thanks for creating this package!
I installed R and loaded “propagate” after I found your website through Google.

I loaded my datafile (n=2000; 8602 variables) and got bigcor to run fine and quickly. However only a few correlations are being displayed in the console; not the 8602 x 8602 matrix I was looking for.

I used: bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

Is there a way to display the whole correlation matrix in the console? Or to export it in csv (or any other format)?

It might be because i do not understand what ff is doing?

to look at it in the console, you have to convert it to a matrix:
COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)
COR <- COR[1:nrow(COR), 1:ncol(COR)]

However, if your data is too big to view in R, you can convert it to ‘ffdf’ (ff dataframe) and export it to .csv:
COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)
COR <- as.ffdf(COR)
write.table(COR, file = "C:\\temp\\test.csv")

Hi Andrej,
Thanks so much for your answer. It worked!

In case there are more beginners in R; the script I finally used is:
COR <- bigcor(mydata, fun="cor", size=2000, verbose= TRUE)
COR <- as.ffdf(COR)
write.table(COR, file = "C:/myexportfile.csv", sep = ",", qmethod = "double")

Hi, I would respectfully ask you something related with bigcor (and ff apparently). Maybe I am wrong, but with a small test matrix all is Ok with bigcor, but with my real matrix I am receiving this warning:

Loading required package: minpack.lm
Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
missing value where TRUE/FALSE needed
Calls: bigcor -> ff
In addition: Warning message:
In ff(vmode = “double”, dim = c(NCOL, NCOL)) :
NAs introduced by coercion to integer range
Execution halted

What does it means? I am wrong with something or is a kind of ff constraint?

> dim(dat)
[1] 70736 72

I am sure that the matrix does not have missed values. So I am really lost, do you understand the warning? Any clue could be great.

Happy New Year!
Is “dat” your complete matrix? If so, you can use the R function “cor” without any problems, because it only has 72 columns:

res <- cor(dat)

If your real matrix is bigger, then try the "size" parameter smaller than the number of columns, e.g. if you have 2000 columns, use "size = 1000". Otherwise I don't really know, is class(dat) = "matrix"?

Thanks a lot for your reply, I did´t received any alert by email. Just now I saw your reply by chance (maybe I forgot set the notification).
I was wrong, my suitable data set is the transposed matrix
dim(dat)
[1] 72 70736

but anyway the problem was related with Rstudio and some permissions problems in the server apparently. Now is working. Thanks a lot !!!!!

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

Thanks for the notice, my fault.

But can you explain this:

> object.size(5.25)

32 bytes

> object.size(c(5.25, 5.25))

40 bytes

> object.size(c(5.25, 5.25, 5.25))

56 bytes

> object.size(c(5.25, 5.25, 5.25, 5.25))

56 bytes

> object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))

72 bytes

Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

Nice! I implemented a parallel version to speed up things a bit. http://brainchronicle.blogspot.com/2013/02/large-correlation-in-parallel.html

Wow, awesome! Thanks David!

Cheers,

Andrej

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

Hmm, of course, export of the large matrix is essential…

Give me one or two days to figure it out!

Cheers,

Andrej

Got it! This works.

res <- as.ffdf(res)

write.csv.ffdf(res, "c:\\temp\\test.csv")

For a 10000 x 10000 correlation matrix timings are:

res 35s

write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!

resulting in a 1.8 GB csv-file.

If there is a quicker way to export, please let me know!

Cheers,

Andrej

@Andrej

Thanks for sharing this function.

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}

Same issue as above. Additionally it would have been great to mention two further points within your post/function:

– require(multicore) at the function head

– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

Cheers,

Andrej

You can query the Platform using:

.Platform

You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):

library(doSNOW, quietly = TRUE)

stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))

cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")

registerDoSNOW(cl)

foreach(i = 1:nrow(x.subs.comb)) %dopar% {

…

}

stopCluster(cl)

However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

Quick and dirty solution is to store the resulting list from

foreach () %*dopar% {}

and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

Hmm, have you sped up my function with ‘crossprod’? Please tell me how!

You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.

What is your setup at the moment?

Cheers,

Andrej

In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)

D <- diag(G)

C <- sweep(G,2,D,"/")

C <- sweep(C,1,D,"/")

Oh, and I split dopar into 88 blocks, so 4928/88=56 people with 340K markers.

Now, on 16 cores, it finishes the full matrix in under and hour and uses about 30GB of my 64GB of ram.

Indeed cross product have fascinating property. I use RcppEigen to compute it. It can take pretty huge matrices. The function is here http://stackoverflow.com/questions/16938508/speeding-up-computation-of-dice-coefficient-in-c-rcpp/16943857#16943857

I managed to compute the cross product on a 20,000×10,000 matrix in under 3 min compared to 15+ min with cross prod()

Might it be worthwhile to publish your complete code using Rcpp on your blog?

I could then crossreference to the final function from Rmazing…

Cheers,

Andrej

I like your crossprod function with Rcpp, do you know how to make it work for something like:

MAT <- matrix(rnorm(20000 * 10), nrow = 10)

?

[…] values (think arrayfun, matfun, etc.) This didn't work. I then learned about a function called bigcor in R from a colleague, and exported my data to try that out. Even my data was too big for it, and I […]

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

I think you need a colored representation of your correlations (heatmap). You could use “image”, “heatmap” or “levelplot” (lattice).

Cheers,

Andrej

I would be very happy if you could add the possibility from the original cor() function where it is possible to calculate correlations not only between all columns of x, but also between x and y (where y is a vector or matrix).

Hi Patrik,

is that so important, because all you have to do is ‘cbind’?

## two vectors

x <- rnorm(10)

y <- rnorm(10)

mat <- cbind(x, y)

bigcor(mat)

cor(mat)

## two matrices

x <- matrix(rnorm(100), ncol = 10)

y <- matrix(rnorm(100), ncol = 10)

mat <- cbind(x, y)

res1 <- bigcor(mat)

res1 <- res1[1:20, 1:20] # convert 'ff' to 'matrix

res2 TRUE

But if you insist, I might do it😉

Cheers,

Andrej

Well, there are situations where you have milions of columns in x and just want each of their correlation with a vector y. To cbind x and y in this case would lead to unnecessary computations and actually an unfeasible problem. So, yes, I insist!

Ok, that makes sense…

Will see what I can do, maybe not a package update but just function posted here…

Cheers,

Andrej

Hello,

Thanks for creating this package!

I installed R and loaded “propagate” after I found your website through Google.

I loaded my datafile (n=2000; 8602 variables) and got bigcor to run fine and quickly. However only a few correlations are being displayed in the console; not the 8602 x 8602 matrix I was looking for.

I used: bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

Is there a way to display the whole correlation matrix in the console? Or to export it in csv (or any other format)?

It might be because i do not understand what ff is doing?

Best regards,

Erwan

Hi Erwan,

to look at it in the console, you have to convert it to a matrix:

COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

COR <- COR[1:nrow(COR), 1:ncol(COR)]

However, if your data is too big to view in R, you can convert it to ‘ffdf’ (ff dataframe) and export it to .csv:

COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

COR <- as.ffdf(COR)

write.table(COR, file = "C:\\temp\\test.csv")

Does that work?

Cheers,

Andrej

Hi Andrej,

Thanks so much for your answer. It worked!

In case there are more beginners in R; the script I finally used is:

COR <- bigcor(mydata, fun="cor", size=2000, verbose= TRUE)

COR <- as.ffdf(COR)

write.table(COR, file = "C:/myexportfile.csv", sep = ",", qmethod = "double")

Kind regards,

Erwan

Dear Andrej,

Hi, I would respectfully ask you something related with bigcor (and ff apparently). Maybe I am wrong, but with a small test matrix all is Ok with bigcor, but with my real matrix I am receiving this warning:

Loading required package: minpack.lm

Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :

missing value where TRUE/FALSE needed

Calls: bigcor -> ff

In addition: Warning message:

In ff(vmode = “double”, dim = c(NCOL, NCOL)) :

NAs introduced by coercion to integer range

Execution halted

What does it means? I am wrong with something or is a kind of ff constraint?

> dim(dat)

[1] 70736 72

I am sure that the matrix does not have missed values. So I am really lost, do you understand the warning? Any clue could be great.

Thank you very much,

alex

Hi Alex,

Happy New Year!

Is “dat” your complete matrix? If so, you can use the R function “cor” without any problems, because it only has 72 columns:

res <- cor(dat)

If your real matrix is bigger, then try the "size" parameter smaller than the number of columns, e.g. if you have 2000 columns, use "size = 1000". Otherwise I don't really know, is class(dat) = "matrix"?

Cheers, Andrej

Dear anspiess,

Thanks a lot for your reply, I did´t received any alert by email. Just now I saw your reply by chance (maybe I forgot set the notification).

I was wrong, my suitable data set is the transposed matrix

dim(dat)

[1] 72 70736

but anyway the problem was related with Rstudio and some permissions problems in the server apparently. Now is working. Thanks a lot !!!!!