## bigcor: Large correlation matrices in R

Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of $A_1 ... A_n$$B_1 ... B_n$, etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so \$latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function can found here: http://www.dr-spiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!

### 33 Responses to bigcor: Large correlation matrices in R

1. Hristo Iliev says:

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

• anspiess says:

Thanks for the notice, my fault.
But can you explain this:

> object.size(5.25)
32 bytes
> object.size(c(5.25, 5.25))
40 bytes
> object.size(c(5.25, 5.25, 5.25))
56 bytes
> object.size(c(5.25, 5.25, 5.25, 5.25))
56 bytes
> object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))
72 bytes

• Hristo Iliev says:

Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

2. David says:

Nice! I implemented a parallel version to speed up things a bit. http://brainchronicle.blogspot.com/2013/02/large-correlation-in-parallel.html

• anspiess says:

Wow, awesome! Thanks David!

Cheers,
Andrej

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

• anspiess says:

Hmm, of course, export of the large matrix is essential…
Give me one or two days to figure it out!

Cheers,
Andrej

• anspiess says:

Got it! This works.
res <- as.ffdf(res)
write.csv.ffdf(res, "c:\\temp\\test.csv")

For a 10000 x 10000 correlation matrix timings are:
res 35s
write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
resulting in a 1.8 GB csv-file.
If there is a quicker way to export, please let me know!

Cheers,
Andrej

4. Jan says:

@Andrej
Thanks for sharing this function.

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}
Same issue as above. Additionally it would have been great to mention two further points within your post/function:
– require(multicore) at the function head
– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

• anspiess says:

Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

Cheers,
Andrej

• Jan says:

You can query the Platform using:
.Platform

You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
library(doSNOW, quietly = TRUE)
stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
registerDoSNOW(cl)
foreach(i = 1:nrow(x.subs.comb)) %dopar% {

}
stopCluster(cl)

However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

Quick and dirty solution is to store the resulting list from
foreach () %*dopar% {}
and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

5. I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

• anspiess says:

Hmm, have you sped up my function with ‘crossprod’? Please tell me how!
You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.
What is your setup at the moment?

Cheers,
Andrej

6. In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)
D <- diag(G)
C <- sweep(G,2,D,"/")
C <- sweep(C,1,D,"/")

7. Oh, and I split dopar into 88 blocks, so 4928/88=56 people with 340K markers.

Now, on 16 cores, it finishes the full matrix in under and hour and uses about 30GB of my 64GB of ram.

8. […] values (think arrayfun, matfun, etc.)  This didn't work.  I then learned about a function called bigcor in R from a colleague, and exported my data to try that out.  Even my data was too big for it, and I […]

9. prettygurrly says:

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

• anspiess says:

I think you need a colored representation of your correlations (heatmap). You could use “image”, “heatmap” or “levelplot” (lattice).

Cheers,
Andrej

10. Patrik Waldmann says:

I would be very happy if you could add the possibility from the original cor() function where it is possible to calculate correlations not only between all columns of x, but also between x and y (where y is a vector or matrix).

• anspiess says:

Hi Patrik,

is that so important, because all you have to do is ‘cbind’?

## two vectors
x <- rnorm(10)
y <- rnorm(10)
mat <- cbind(x, y)
bigcor(mat)
cor(mat)

## two matrices
x <- matrix(rnorm(100), ncol = 10)
y <- matrix(rnorm(100), ncol = 10)
mat <- cbind(x, y)
res1 <- bigcor(mat)
res1 <- res1[1:20, 1:20] # convert 'ff' to 'matrix
res2 TRUE

But if you insist, I might do it 😉

Cheers,
Andrej

• Patrik Waldmann says:

Well, there are situations where you have milions of columns in x and just want each of their correlation with a vector y. To cbind x and y in this case would lead to unnecessary computations and actually an unfeasible problem. So, yes, I insist!

• anspiess says:

Ok, that makes sense…
Will see what I can do, maybe not a package update but just function posted here…

Cheers,
Andrej

11. Erwan says:

Hello,

Thanks for creating this package!

I loaded my datafile (n=2000; 8602 variables) and got bigcor to run fine and quickly. However only a few correlations are being displayed in the console; not the 8602 x 8602 matrix I was looking for.

I used: bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

Is there a way to display the whole correlation matrix in the console? Or to export it in csv (or any other format)?

It might be because i do not understand what ff is doing?

Best regards,
Erwan

• anspiess says:

Hi Erwan,

to look at it in the console, you have to convert it to a matrix:
``` COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE) COR <- COR[1:nrow(COR), 1:ncol(COR)] ```

However, if your data is too big to view in R, you can convert it to ‘ffdf’ (ff dataframe) and export it to .csv:
``` COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE) COR <- as.ffdf(COR) write.table(COR, file = "C:\\temp\\test.csv") ```

Does that work?

Cheers,
Andrej

• Erwan says:

Hi Andrej,

In case there are more beginners in R; the script I finally used is:
COR <- bigcor(mydata, fun="cor", size=2000, verbose= TRUE)
COR <- as.ffdf(COR)
write.table(COR, file = "C:/myexportfile.csv", sep = ",", qmethod = "double")

Kind regards,
Erwan

12. Dear Andrej,

Hi, I would respectfully ask you something related with bigcor (and ff apparently). Maybe I am wrong, but with a small test matrix all is Ok with bigcor, but with my real matrix I am receiving this warning:

Error in if (length .Machine\$integer.max) stop(“length must be between 1 and .Machine\$integer.max”) :
missing value where TRUE/FALSE needed
Calls: bigcor -> ff
In ff(vmode = “double”, dim = c(NCOL, NCOL)) :
NAs introduced by coercion to integer range
Execution halted

What does it means? I am wrong with something or is a kind of ff constraint?

> dim(dat)
[1] 70736 72

I am sure that the matrix does not have missed values. So I am really lost, do you understand the warning? Any clue could be great.

Thank you very much,

alex

• anspiess says:

Hi Alex,

Happy New Year!
Is “dat” your complete matrix? If so, you can use the R function “cor” without any problems, because it only has 72 columns:

res <- cor(dat)

If your real matrix is bigger, then try the "size" parameter smaller than the number of columns, e.g. if you have 2000 columns, use "size = 1000". Otherwise I don't really know, is class(dat) = "matrix"?

Cheers, Andrej

13. Dear anspiess,

I was wrong, my suitable data set is the transposed matrix
dim(dat)
[1] 72 70736

but anyway the problem was related with Rstudio and some permissions problems in the server apparently. Now is working. Thanks a lot !!!!!

14. Nica says:

Hi,

Nice post!

I’ve been trying to run the bigcor function on a matrix with nrow=144 y ncol=156267, but get an error message, like this one below:

Error in if (length .Machine\$integer.max) stop(“length must be between 1 and .Machine\$integer.max”) : missing value where TRUE/FALSE needed In addition: Warning message: In ff(vmode = “single”, dim = c(NCOL, NCOL)) :

What can I do?

• anspiess says:

Hi Nica,

when your matrix has 156267 columns, library ‘ff’ needs to preallocate a matrix with 156267 * 156267 cells:

``` > traceback() 2: ff(vmode = "double", dim = c(NCOL, NCOL)) 1: bigcor(X) ```

The cells are of class double (8 Byte), so that the preallocated matrix has 156267 * 156267 * 8 Byte = 1.95355E11 Byte,
which is 1.95355E11 Byte /1024/1024/1024 = 181 Gbyte! I suppose that exceeds your RAM…
Maybe you can split your data somehow…

Cheers,
Andrej