Original: 2017-11-20 Updated: 2017-12-02
Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries.
But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises:
I would add another good practise: "Use the tidyverse". Provided tidyverse packages such as
dplyr benefit from
Rcpp, having a C++ backend can be faster than using dplyr's equivalents in base (i.e plain vanilla) R.
The idea of this post is to clarify some ideas. R does not compete with C or C++ provided they are different languages. R and
data.table package may compete with Python and
numpy library. This does not mean that I'm defending R over Python or backwards. The reason behind this is that both R and Python implementations consists in an interpreter while in C and C++ it consists in a compiler, and this means that C and C++ will always be faster because in really over-simplifying terms compiler implementations are closer to the machine.
Open RStudio and run
sessionInfo() if you read something like:
Matrix products: default BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
Being the important to read
libopenblas at the end of these lines. Any of that means that you are ok and using your resources properly.
But, if you see something like this:
Matrix products: default BLAS: /opt/R/R-3.4.2-defaults/lib/R/lib/libRblas.so LAPACK: /opt/R/R-3.4.2-defaults/lib/R/lib/libRlapack.so
With variants like
liblapack at the end of the lines, then you are wasting time because setup inefficiencies and I invite you to reinstall R properly.
As an Ubuntu user I can say the basic R installation from Canonical or CRAN repositories work for most of the things I do on my laptop.
When I use RStudio Server Pro© that's a different story because I really want to optimize things because when I work with large data (i.e. 100GB in RAM) a 3% more of resources efficiency or reduced execution time is valuable.
Installing R with OpenBLAS will give you a tremedous performance boost, and that will work for most of laptop situations. I explain how to do that in detail for Ubuntu 17.10 and Ubuntu 16.04 but a general setup would be as simple as one of this two options:
Being option (1) a substitute of option (2). It's totally up to you which one to use and both will give you a really fast R compared to installing it without OpenBLAS.
LIke I explained on this post one reason to take some time to install R properly is when
data.table or other packages return curious messages when you load them. In particular, R binaries for OS X are not optimized and if you install and load
data.table it will show this message:
This installation of data.table has not detected OpenMP support. It will still work but in single-threaded mode. If this is a Mac and you obtained the Mac binary of data.table from CRAN, CRAN's Mac does not support OpenMP.
If that's your case you will be really benefited if you follow my OS X post and install R using homebrew. I know it is slow to compile the sources but is not cool to have a cool Macbook© and do data analysis using one core of the processor.
I am kinda ignorant in Windows. When I used it I realized there are no numerical libraries that can be installed easily or easier than what I explain in the rest of the post.
Being Microsoft© R Open an R instance that comes wth Intel© MKL numerical libraries enables by default, I'd install that on Windows and also the graphic installer is straightforward.
I already use R with OpenBLAS just like the setup above. I will compile parallel R instances to do the benchmarking.
My benchmarks do indicate that in my case it's convenient to take the time it takes to install Intel© MKL. The execution time is strongly reduced for some operations when compared to R with OpenBLAS performance.
Run this to install MKL:
To compile it from source (in this case it's the only option) run these lines:
Just not to interfere with my working installation (using apt-get) I decided to compile a parallel instance from source:
There is a lot of discussion and strong evidence from different stakeholders in the R community that do indicate that this is by far the most inefficient option. I compiled this just to make a complete benchmark:
This R version includes MKL by default and it's supposed to be easy to install. I could not make it run and that's bad because different articles (like this post by Brett Klamer) state that this R version is really efficient but no different to standard CRAN R with MKL numerical libraries.
In any case here's the code to install this version:
Update: I followed the same steps above and it works on Ubuntu 16.04 but I still can't install it on a machine with Ubuntu 17.10.
My scripts above do edit
~/.profile. This is to open RStudio and work with differently configured R instances on my computer.
I released the benchmark results and scripts on GitHub. The idea is to run the same scripts from AT&T© and Microsoft© to see how different setups perform.
To work with CRAN R with MKL I had to edit
~/.profile because of how I configurated the instances. So I had to run
nano ~/.profile and comment the last part of the file to obtain this result:
After that I log out and then log in to open RStudio to run the benchmark.
The other two cases are similar and the benchmark results were obtained editing
~/.profile, logging out and in and opening RStudio with the corresponding instance.
As an example, this result starts with the R version and the corresponding numerical libraries used in that sessions. Any other result are reported in the same way.
And here are the results from AT&T© script:
And here are the results from Microsoft© script:
|Task||CRAN R with MKL (seconds)||CRAN R with OpenBLAS (seconds)||CRAN R with no optimized libraries (seconds)|
|Singular Value Decomposition||7.268||18.325||47.076|
|Principal Components Analysis||14.932||40.612||162.338|
|Linear Discriminant Analysis||26.195||43.75||117.537|
The benchmarks exposed here are in no way a definitive end to the long discussion on numerical libraries. My results show some evidence that indicates, that because of more speed for some operations, I should use MKL.
One of the advantages of the setup I explained is that you can use MKL with Python. In that case
numpy calculations will be boosted.
Using MKL with AMD© processors might not provide an important improvement when compared to use OpenBLAS. This is because MKL uses specific processor instructions that work well with i3 or i5 processors but not neccesarily with non-Intel models.