bigcor: Large correlation matrices in R

Here is the update to my bigcor function for creating large correlation/covariance matrices.
What does it do?
1) It splits a matrix with N columns into equal size blocks (+ a remainder block) of A_1 ... A_nB_1 ... B_n, etc. Block size can be defined by user, 2000 is a good value because cor can handle this quite quickly. If the matrix has 13796 columns, the split will be 2000;2000;2000;2000;2000;2000;1796.
2) For all combinations of blocks, the correlation matrix is calculated, so $latex A_n/A_n, A_n/B_n, B_n/B_n etc.
3) The blockwise correlation matrices are transferred into the preallocated empty matrix at the appropriate position (where the correlations would usually reside). To ensure symmetry around the diagonal, this is done twice in the upper and lower triangle.

This way, the N x N empty correlation matrix is filled like a checkerboard with patches of n x n correlation sub-matrices. In our case, in which we split the N = 40000 columns into n = 8 blocks, we obtain 36 combinations (combn(1:8, 2) + 8; + 8 because of A/A, B/B etc) of blocks with dimension 5000 x 5000 each. This gives 36 x 5000 x 5000 x 2 (filling both triangles) – 8 x 5000 x 5000 (because the diagonals are created twice) = 1.6E9 = 40000 x 40000.

The function can found here: http://www.dr-spiess.de/scripts/bigcor.r
Timings for a 21796 x 21796 matrix roughly 2 min!

40 Responses to bigcor: Large correlation matrices in R

  1. Hristo Iliev says:

    This might sound a bit like nitpicking, but each element in a numeric vector is 64 bits or 8 bytes in size, i.e. your vector would require 11,92 GiB, which is still over the RAM size of a typical computer those days.

    • anspiess says:

      Thanks for the notice, my fault.
      But can you explain this:

      > object.size(5.25)
      32 bytes
      > object.size(c(5.25, 5.25))
      40 bytes
      > object.size(c(5.25, 5.25, 5.25))
      56 bytes
      > object.size(c(5.25, 5.25, 5.25, 5.25))
      56 bytes
      > object.size(c(5.25, 5.25, 5.25, 5.25, 5.25))
      72 bytes

      • Hristo Iliev says:

        Of course – object.size gives the memory usage of the whole vector object, i.e. the size of the numeric data plus the size of any metadata and/or control structures, associated with the object (collectively called overhead). With huge vectors the overhead becomes negligible and their storage size is almost equal to the size of the pure numeric data.

  2. adub1979 says:

    I’m just playing with this and trying to write out the data into a csv or other file for another system to leverage, what is the easiest way to do that. I’ve been trying write.csv.ffdf but that isn’t working. Gives the error, Error: inherits(x, “ffdf”) is not TRUE

    • anspiess says:

      Hmm, of course, export of the large matrix is essential…
      Give me one or two days to figure it out!

      Cheers,
      Andrej

    • anspiess says:

      Got it! This works.
      res <- as.ffdf(res)
      write.csv.ffdf(res, "c:\\temp\\test.csv")

      For a 10000 x 10000 correlation matrix timings are:
      res 35s
      write.csv.ffdf(res, “c:\\temp\\test.csv”) => 17 min!!
      resulting in a 1.8 GB csv-file.
      If there is a quicker way to export, please let me know!

      Cheers,
      Andrej

  3. Jan says:

    @Andrej
    Thanks for sharing this function.

    Please change ‘MAT’ to ‘x’ within the for-loop as your function will fail if MAT is not specified globally beforehand.

    If you know that the optimal blocksize is ~2k, why not having an ‘automatic’ calculation for this parameter within the function depending on the input matrix. This would allow you as well to deal with an inequally sized final block. Or am I mistaken here?

    @David {couldn’t comment at your blog directly}
    Same issue as above. Additionally it would have been great to mention two further points within your post/function:
    – require(multicore) at the function head
    – doMC/multicore packages are *not* available for Windows

    Hence, a Platform independent version (loading the appropriate packeges) would be excellent.

    Nevertheless, thanks to both of you again. Happy science!

    • anspiess says:

      Thanks for pointing me to the bugs! I will make modifications to ‘x’, incorporate an automatic 2k size calculation and then give a link to the new version on my blog. Havn’t got a clue however in respect to platform-independent parallelization though… Can anyone give me hint?

      Cheers,
      Andrej

      • Jan says:

        You can query the Platform using:
        .Platform

        You need then to set up a different backend for the ad hoc cluster. On my machine this seems to work (using the framework of your function):
        library(doSNOW, quietly = TRUE)
        stopifnot(length(ncore)==1 && !is.na(as.integer(ncore)))
        cl <- makeCluster(spec=rep("localhost",ncore), type = "SOCK")
        registerDoSNOW(cl)
        foreach(i = 1:nrow(x.subs.comb)) %dopar% {

        }
        stopCluster(cl)

        However, I couldn't get David's function to work. Not sure if it is a Unix/Windows problem. The point is that I can not assign to the pre-defined output matrix (corMAT) within the %dopar% loop, which makes sense to me, as 2 parallel processes may try to write into the same object at the same time.

        Quick and dirty solution is to store the resulting list from
        foreach () %*dopar% {}
        and assign it's values in a second (standard) for-loop (I can paste the full code if it is unclear what I mean).

        Anyway, I expect that parallelization will unfortunately increase the demand of memory, a problem you tried to solve in the first place. So I wouldn't bother too much with it but just enjoy the memory save you already achieved.

  4. I’m trying this out in par with a 4901×340,000 matrix. A regular cor takes about 3 hours on my node, and a little under 2 with crossprod. I’d love to get this under an hour.

    • anspiess says:

      Hmm, have you sped up my function with ‘crossprod’? Please tell me how!
      You might increase speed if you increase the size of the submatrices, but there seems to be some magic border.
      What is your setup at the moment?

      Cheers,
      Andrej

  5. In my case, I wanted a correlation matrix of of genetic relatedness between 4901 people. So 4901 people genotyped at 340k markers.

    I just modified you and David’s replacing cor with crossprod. Which gives me a NxN matrix of people’s relatedness. Dividing row and then columns by the diagonal gets me the same as cor().

    G <- crossprod(G)
    D <- diag(G)
    C <- sweep(G,2,D,"/")
    C <- sweep(C,1,D,"/")

  6. Oh, and I split dopar into 88 blocks, so 4928/88=56 people with 340K markers.

    Now, on 16 cores, it finishes the full matrix in under and hour and uses about 30GB of my 64GB of ram.

  7. […] values (think arrayfun, matfun, etc.)  This didn't work.  I then learned about a function called bigcor in R from a colleague, and exported my data to try that out.  Even my data was too big for it, and I […]

  8. prettygurrly says:

    I’m writing up my results of a significantly smaller but still quite large matrix (75×75). I’m finding I’m using a lot of words but if I try to print the matrix at a table it takes up 8 pages. A journal is going to charge me a lot for that. I’ve done cluster analysis too which I will present as a graphic but that doesn’t show which variables are negatively or positively correlated, just the ones that have the strongest relationship with. does anyone know of a nifty way to summarise a correlation matrix?

    • anspiess says:

      I think you need a colored representation of your correlations (heatmap). You could use “image”, “heatmap” or “levelplot” (lattice).

      Cheers,
      Andrej

  9. Patrik Waldmann says:

    I would be very happy if you could add the possibility from the original cor() function where it is possible to calculate correlations not only between all columns of x, but also between x and y (where y is a vector or matrix).

    • anspiess says:

      Hi Patrik,

      is that so important, because all you have to do is ‘cbind’?

      ## two vectors
      x <- rnorm(10)
      y <- rnorm(10)
      mat <- cbind(x, y)
      bigcor(mat)
      cor(mat)

      ## two matrices
      x <- matrix(rnorm(100), ncol = 10)
      y <- matrix(rnorm(100), ncol = 10)
      mat <- cbind(x, y)
      res1 <- bigcor(mat)
      res1 <- res1[1:20, 1:20] # convert 'ff' to 'matrix
      res2 TRUE

      But if you insist, I might do it 😉

      Cheers,
      Andrej

      • Patrik Waldmann says:

        Well, there are situations where you have milions of columns in x and just want each of their correlation with a vector y. To cbind x and y in this case would lead to unnecessary computations and actually an unfeasible problem. So, yes, I insist!

      • anspiess says:

        Ok, that makes sense…
        Will see what I can do, maybe not a package update but just function posted here…

        Cheers,
        Andrej

  10. Erwan says:

    Hello,

    Thanks for creating this package!
    I installed R and loaded “propagate” after I found your website through Google.

    I loaded my datafile (n=2000; 8602 variables) and got bigcor to run fine and quickly. However only a few correlations are being displayed in the console; not the 8602 x 8602 matrix I was looking for.

    I used: bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)

    Is there a way to display the whole correlation matrix in the console? Or to export it in csv (or any other format)?

    It might be because i do not understand what ff is doing?

    Best regards,
    Erwan

    • anspiess says:

      Hi Erwan,

      to look at it in the console, you have to convert it to a matrix:

      COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)
      COR <- COR[1:nrow(COR), 1:ncol(COR)]

      However, if your data is too big to view in R, you can convert it to ‘ffdf’ (ff dataframe) and export it to .csv:

      COR <- bigcor(mydata, fun = “cor” , size=2000 , verbose= TRUE)
      COR <- as.ffdf(COR)
      write.table(COR, file = "C:\\temp\\test.csv")

      Does that work?

      Cheers,
      Andrej

      • Erwan says:

        Hi Andrej,
        Thanks so much for your answer. It worked!

        In case there are more beginners in R; the script I finally used is:
        COR <- bigcor(mydata, fun="cor", size=2000, verbose= TRUE)
        COR <- as.ffdf(COR)
        write.table(COR, file = "C:/myexportfile.csv", sep = ",", qmethod = "double")

        Kind regards,
        Erwan

  11. Dear Andrej,

    Hi, I would respectfully ask you something related with bigcor (and ff apparently). Maybe I am wrong, but with a small test matrix all is Ok with bigcor, but with my real matrix I am receiving this warning:

    Loading required package: minpack.lm
    Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
    missing value where TRUE/FALSE needed
    Calls: bigcor -> ff
    In addition: Warning message:
    In ff(vmode = “double”, dim = c(NCOL, NCOL)) :
    NAs introduced by coercion to integer range
    Execution halted

    What does it means? I am wrong with something or is a kind of ff constraint?

    > dim(dat)
    [1] 70736 72

    I am sure that the matrix does not have missed values. So I am really lost, do you understand the warning? Any clue could be great.

    Thank you very much,

    alex

    • anspiess says:

      Hi Alex,

      Happy New Year!
      Is “dat” your complete matrix? If so, you can use the R function “cor” without any problems, because it only has 72 columns:

      res <- cor(dat)

      If your real matrix is bigger, then try the "size" parameter smaller than the number of columns, e.g. if you have 2000 columns, use "size = 1000". Otherwise I don't really know, is class(dat) = "matrix"?

      Cheers, Andrej

      • Khan Irfan says:

        Dear Andrej
        Hi. I want to calculate the correlation for the following dimension (assuming bigcor calculates the column * column correlation so I took transpose of my original matrix and converted to the following because I want to calculate correlation between isoforms of zea mays)

        dim(data)
        [1] 1803 63480

        I get the following errors. Kindly let me know how can I calculate the correlation and how do I fix the errors and avoid the warnings

        system.time(res <- bigcor(data, nblocks = 10))
        Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
        missing value where TRUE/FALSE needed
        In addition: Warning message:
        In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
        NAs introduced by coercion to integer range
        Timing stopped at: 0.001 0 0.001

  12. Dear anspiess,

    Thanks a lot for your reply, I did´t received any alert by email. Just now I saw your reply by chance (maybe I forgot set the notification).
    I was wrong, my suitable data set is the transposed matrix
    dim(dat)
    [1] 72 70736

    but anyway the problem was related with Rstudio and some permissions problems in the server apparently. Now is working. Thanks a lot !!!!!

  13. Nica says:

    Hi,

    Nice post!

    I’ve been trying to run the bigcor function on a matrix with nrow=144 y ncol=156267, but get an error message, like this one below:

    Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) : missing value where TRUE/FALSE needed In addition: Warning message: In ff(vmode = “single”, dim = c(NCOL, NCOL)) :

    What can I do?

    • anspiess says:

      Hi Nica,

      when your matrix has 156267 columns, library ‘ff’ needs to preallocate a matrix with 156267 * 156267 cells:


      > traceback()
      2: ff(vmode = "double", dim = c(NCOL, NCOL))
      1: bigcor(X)

      The cells are of class double (8 Byte), so that the preallocated matrix has 156267 * 156267 * 8 Byte = 1.95355E11 Byte,
      which is 1.95355E11 Byte /1024/1024/1024 = 181 Gbyte! I suppose that exceeds your RAM…
      Maybe you can split your data somehow…

      Cheers,
      Andrej

  14. none says:

    Great stuff!! Very useful!
    Is it possible to include “corr.test” from psych package instead of “cor” or something similar (https://sites.google.com/site/manabusakamoto/home/r-functions-and-scripts/pair-cor) that gives p-values after multiple correction. thanks

  15. Muhammad Irfan Khan says:

    Dear Andrej
    I am trying to compute correlation for the following (assuming that your function only calculates column to column correlation so i took transponse of my matrix and converted it to following)

    1803 63480

    I get the following errors

    Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
    missing value where TRUE/FALSE needed
    In addition: Warning message:
    In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
    NAs introduced by coercion to integer range
    Timing stopped at: 0 0.001 0.001

    Kindly guide me what could be wrong

  16. Khan Irfan says:

    Dear Andrej
    Hi. I want to calculate the correlation for the following dimension (assuming bigcor calculates the column * column correlation so I took transpose of my original matrix and converted to the following because I want to calculate correlation between isoforms of zea mays)

    dim(data)
    [1] 1803 63480

    I get the following errors. Kindly let me know how can I calculate the correlation and how do I fix the errors and avoid the warnings

    system.time(res <- bigcor(data, nblocks = 10))
    Error in if (length .Machine$integer.max) stop(“length must be between 1 and .Machine$integer.max”) :
    missing value where TRUE/FALSE needed
    In addition: Warning message:
    In ff(vmode = “single”, dim = c(NCOL, NCOL)) :
    NAs introduced by coercion to integer range
    Timing stopped at: 0.001 0 0.001

    Cheers
    Khan

  17. yijianhuang says:

    I have a few missing values. Is it possible to use bigcor?

  18. kb472@georgetown.edu says:

    Is there a way to do pairwise Pearson correlation test as well ? Thanks.

  19. anspiess says:

    … from a few requests lately, I’m in the process of integrating a cor.test version into bigcor. Probably in 1 to 2 weeks.

Leave a comment