bigcor: Large correlation matrices in R

Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of $A_1 ... A_n$ , $B_1 ... B_n$ , etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function can found here: http://www.dr-spiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!

This entry was posted on Friday, February 22nd, 2013 at 3:33 pm and is filed under General. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

40 Responses to bigcor: Large correlation matrices in R

Hristo Iliev says:

February 23, 2013 at 12:25 pm

This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

Reply
- anspiess says:
  
  February 23, 2013 at 1:53 pm
  
  Thanks for the notice, my fault.
  But can you explain this:
  
  > object.size(5.25)
  32 bytes
  > object.size(c(5.25, 5.25))
  40 bytes
  > object.size(c(5.25, 5.25, 5.25))
  56 bytes
  > object.size(c(5.25, 5.25, 5.25, 5.25))
  56 bytes
  > object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))
  72 bytes
  
  Reply
  - Hristo Iliev says:
    
    February 23, 2013 at 9:29 pm
    
    Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.
David says:

February 24, 2013 at 6:14 pm

Nice! I implemented a parallel version to speed up things a bit. http://brainchronicle.blogspot.com/2013/02/large-correlation-in-parallel.html

Reply
- anspiess says:
  
  February 25, 2013 at 9:53 am
  
  Wow, awesome! Thanks David!
  
  Cheers,
  Andrej
  
  Reply
adub1979 says:

March 1, 2013 at 11:31 pm

I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

Reply
- anspiess says:
  
  March 2, 2013 at 8:10 am
  
  Hmm, of course, export of the large matrix is essential…
  Give me one or two days to figure it out!
  
  Cheers,
  Andrej
  
  Reply
- anspiess says:
  
  March 2, 2013 at 10:24 pm
  
  Got it! This works.
  res <- as.ffdf(res)
  write.csv.ffdf(res, "c:\\temp\\test.csv")
  
  For a 10000 x 10000 correlation matrix timings are:
  res 35s
  write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
  resulting in a 1.8 GB csv-file.
  If there is a quicker way to export, please let me know!
  
  Cheers,
  Andrej
  
  Reply
Jan says:

May 16, 2013 at 10:17 am

@Andrej
Thanks for sharing this function.

Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

@David {couldn’t comment at your blog directly}
Same issue as above. Additionally it would have been great to mention two further points within your post/function:
– require(multicore) at the function head
– doMC/multicore packages are *not* available for Windows

Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

Nevertheless, thanks to both of you again. Happy science!

Reply
- anspiess says:
  
  May 16, 2013 at 10:54 am
  
  Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?
  
  Cheers,
  Andrej
  
  Reply
  - Jan says:
    
    May 16, 2013 at 1:20 pm
    
    You can query the Platform using:
    .Platform
    
    You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
    library(doSNOW, quietly = TRUE)
    stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
    cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
    registerDoSNOW(cl)
    foreach(i = 1:nrow(x.subs.comb)) %dopar% {
    …
    }
    stopCluster(cl)
    
    However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.
    
    Quick and dirty solution is to store the resulting list from
    foreach () %*dopar% {}
    and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).
    
    Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.
kipkestonkeston says:

June 19, 2013 at 5:51 am

I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

Reply
- anspiess says:
  
  June 19, 2013 at 8:28 am
  
  Hmm, have you sped up my function with ‘crossprod’? Please tell me how!
  You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.
  What is your setup at the moment?
  
  Cheers,
  Andrej
  
  Reply
kipkestonkeston says:

June 20, 2013 at 1:40 am

In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

G <- crossprod(G)
D <- diag(G)
C <- sweep(G,2,D,"/")
C <- sweep(C,1,D,"/")

Reply
kipkestonkeston says:

June 20, 2013 at 1:42 am

Oh, and I split dopar into 88 blocks, so 4928/88=56 people with 340K markers.

Now, on 16 cores, it finishes the full matrix in under and hour and uses about 30GB of my 64GB of ram.

Reply
- David says:
  
  June 20, 2013 at 11:32 am
  
  Indeed cross product have fascinating property. I use RcppEigen to compute it. It can take pretty huge matrices. The function is here http://stackoverflow.com/questions/16938508/speeding-up-computation-of-dice-coefficient-in-c-rcpp/16943857#16943857
  I managed to compute the cross product on a 20,000×10,000 matrix in under 3 min compared to 15+ min with cross prod()
  
  Reply
  - anspiess says:
    
    June 20, 2013 at 1:40 pm
    
    Might it be worthwhile to publish your complete code using Rcpp on your blog?
    I could then crossreference to the final function from Rmazing…
    
    Cheers,
    Andrej
  - kipkestonkeston says:
    
    June 21, 2013 at 12:50 am
    
    I like your crossprod function with Rcpp, do you know how to make it work for something like:
    
    MAT <- matrix(rnorm(20000 * 10), nrow = 10)
    
    ?
Tracing a Functional Brain Network | vbmis says:

September 10, 2013 at 12:30 am

[…] values (think arrayfun, matfun, etc.) This didn't work. I then learned about a function called bigcor in R from a colleague, and exported my data to try that out. Even my data was too big for it, and I […]

Reply
prettygurrly says:

August 22, 2014 at 1:02 pm

I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

Reply
- anspiess says:
  
  September 1, 2014 at 8:00 am
  
  I think you need a colored representation of your correlations (heatmap). You could use “image”, “heatmap” or “levelplot” (lattice).
  
  Cheers,
  Andrej
  
  Reply
Patrik Waldmann says:

September 29, 2014 at 10:39 pm

I would be very happy if you could add the possibility from the original cor() function where it is possible to calculate correlations not only between all columns of x, but also between x and y (where y is a vector or matrix).

Reply
- anspiess says:
  
  September 30, 2014 at 8:16 am
  
  Hi Patrik,
  
  is that so important, because all you have to do is ‘cbind’?
  
  ## two vectors
  x <- rnorm(10)
  y <- rnorm(10)
  mat <- cbind(x, y)
  bigcor(mat)
  cor(mat)
  
  ## two matrices
  x <- matrix(rnorm(100), ncol = 10)
  y <- matrix(rnorm(100), ncol = 10)
  mat <- cbind(x, y)
  res1 <- bigcor(mat)
  res1 <- res1[1:20, 1:20] # convert 'ff' to 'matrix
  res2 TRUE
  
  But if you insist, I might do it 😉
  
  Cheers,
  Andrej
  
  Reply
  - Patrik Waldmann says:
    
    September 30, 2014 at 10:14 am
    
    Well, there are situations where you have milions of columns in x and just want each of their correlation with a vector y. To cbind x and y in this case would lead to unnecessary computations and actually an unfeasible problem. So, yes, I insist!
  - anspiess says:
    
    September 30, 2014 at 10:25 am
    
    Ok, that makes sense…
    Will see what I can do, maybe not a package update but just function posted here…
    
    Cheers,
    Andrej
Erwan says:

April 7, 2015 at 2:50 pm

Hello,

Thanks for creating this package!
I installed R and loaded “propagate” after I found your website through Google.

I loaded my datafile (n=2000; 8602 variables) and got bigcor to run fine and quickly. However only a few correlations are being displayed in the console; not the 8602 x 8602 matrix I was looking for.

I used: bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

Is there a way to display the whole correlation matrix in the console? Or to export it in csv (or any other format)?

It might be because i do not understand what ff is doing?

Best regards,
Erwan

Reply
- anspiess says:
  
  April 7, 2015 at 9:02 pm
  
  Hi Erwan,
  
  to look at it in the console, you have to convert it to a matrix:
  COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE) COR <- COR[1:nrow(COR), 1:ncol(COR)]
  
  However, if your data is too big to view in R, you can convert it to ‘ffdf’ (ff dataframe) and export it to .csv:
  COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE) COR <- as.ffdf(COR) write.table(COR, file = "C:\\temp\\test.csv")
  
  Does that work?
  
  Cheers,
  Andrej
  
  Reply
  - Erwan says:
    
    April 9, 2015 at 6:19 pm
    
    Hi Andrej,
    Thanks so much for your answer. It worked!
    
    In case there are more beginners in R; the script I finally used is:
    COR <- bigcor(mydata, fun="cor", size=2000, verbose= TRUE)
    COR <- as.ffdf(COR)
    write.table(COR, file = "C:/myexportfile.csv", sep = ",", qmethod = "double")
    
    Kind regards,
    Erwan
alexander vergara says:

December 30, 2015 at 12:19 am

Dear Andrej,

Hi, I would respectfully ask you something related with bigcor (and ff apparently). Maybe I am wrong, but with a small test matrix all is Ok with bigcor, but with my real matrix I am receiving this warning:

Loading required package: minpack.lm
Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
missing value where TRUE/FALSE needed
Calls: bigcor -> ff
In addition: Warning message:
In ff(vmode = “double”, dim = c(NCOL, NCOL)) :
NAs introduced by coercion to integer range
Execution halted

What does it means? I am wrong with something or is a kind of ff constraint?

> dim(dat)
[1] 70736 72

I am sure that the matrix does not have missed values. So I am really lost, do you understand the warning? Any clue could be great.

Thank you very much,

alex

Reply
- anspiess says:
  
  January 3, 2016 at 12:03 pm
  
  Hi Alex,
  
  Happy New Year!
  Is “dat” your complete matrix? If so, you can use the R function “cor” without any problems, because it only has 72 columns:
  
  res <- cor(dat)
  
  If your real matrix is bigger, then try the "size" parameter smaller than the number of columns, e.g. if you have 2000 columns, use "size = 1000". Otherwise I don't really know, is class(dat) = "matrix"?
  
  Cheers, Andrej
  
  Reply
  - Khan Irfan says:
    
    October 9, 2016 at 12:11 am
    
    Dear Andrej
    Hi. I want to calculate the correlation for the following dimension (assuming bigcor calculates the column * column correlation so I took transpose of my original matrix and converted to the following because I want to calculate correlation between isoforms of zea mays)
    
    dim(data)
    [1] 1803 63480
    
    I get the following errors. Kindly let me know how can I calculate the correlation and how do I fix the errors and avoid the warnings
    
    system.time(res <- bigcor(data, nblocks = 10))
    Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
    missing value where TRUE/FALSE needed
    In addition: Warning message:
    In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
    NAs introduced by coercion to integer range
    Timing stopped at: 0.001 0 0.001
alexander vergara says:

February 23, 2016 at 3:57 pm

Dear anspiess,

Thanks a lot for your reply, I did´t received any alert by email. Just now I saw your reply by chance (maybe I forgot set the notification).
I was wrong, my suitable data set is the transposed matrix
dim(dat)
[1] 72 70736

but anyway the problem was related with Rstudio and some permissions problems in the server apparently. Now is working. Thanks a lot !!!!!

Reply
Nica says:

August 31, 2016 at 1:51 am

Hi,

Nice post!

I’ve been trying to run the bigcor function on a matrix with nrow=144 y ncol=156267, but get an error message, like this one below:

Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) : missing value where TRUE/FALSE needed In addition: Warning message: In ff(vmode = “single”, dim = c(NCOL, NCOL)) :

What can I do?

Reply
- anspiess says:
  
  August 31, 2016 at 6:07 pm
  
  Hi Nica,
  
  when your matrix has 156267 columns, library ‘ff’ needs to preallocate a matrix with 156267 * 156267 cells:
  
  > traceback() 2: ff(vmode = "double", dim = c(NCOL, NCOL)) 1: bigcor(X)
  
  The cells are of class double (8 Byte), so that the preallocated matrix has 156267 * 156267 * 8 Byte = 1.95355E11 Byte,
  which is 1.95355E11 Byte /1024/1024/1024 = 181 Gbyte! I suppose that exceeds your RAM…
  Maybe you can split your data somehow…
  
  Cheers,
  Andrej
  
  Reply
none says:

October 6, 2016 at 3:15 pm

Great stuff!! Very useful!
Is it possible to include “corr.test” from psych package instead of “cor” or something similar (https://sites.google.com/site/manabusakamoto/home/r-functions-and-scripts/pair-cor) that gives p-values after multiple correction. thanks

Reply
Muhammad Irfan Khan says:

October 8, 2016 at 11:34 pm

Dear Andrej
I am trying to compute correlation for the following (assuming that your function only calculates column to column correlation so i took transponse of my matrix and converted it to following)

1803 63480

I get the following errors

Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
NAs introduced by coercion to integer range
Timing stopped at: 0 0.001 0.001

Kindly guide me what could be wrong

Reply
Khan Irfan says:

October 8, 2016 at 11:52 pm

Dear Andrej
Hi. I want to calculate the correlation for the following dimension (assuming bigcor calculates the column * column correlation so I took transpose of my original matrix and converted to the following because I want to calculate correlation between isoforms of zea mays)

dim(data)
[1] 1803 63480

I get the following errors. Kindly let me know how can I calculate the correlation and how do I fix the errors and avoid the warnings

system.time(res <- bigcor(data, nblocks = 10))
Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
NAs introduced by coercion to integer range
Timing stopped at: 0.001 0 0.001

Cheers
Khan

Reply
yijianhuang says:

August 21, 2017 at 10:13 pm

I have a few missing values. Is it possible to use bigcor?

Reply
kb472@georgetown.edu says:

May 24, 2019 at 2:16 am

Is there a way to do pairwise Pearson correlation test as well ? Thanks.

Reply
anspiess says:

May 24, 2019 at 4:50 pm

… from a few requests lately, I’m in the process of integrating a cor.test version into bigcor. Probably in 1 to 2 weeks.

Reply