In the first part of this series we discussed how to code in R efficiently on a data science project and we touched little bit of graphics as well. Now we will elaborate little bit more on graphics and then we will discuss speeding things up in R and extending R capabilities with external frameworks.
Business people are often interested in data science but rarely in a computer code. In order to bridge this gap we do slides, write reports, render charts and draw diagrams. If PowerPoint is on your mind now, please hold on. The choice is much wider.
- Knitr, Rmarkdown, sweave, bookdown and others are R packages for publishing documents that contain R code. The idea is similar to computational notebooks, but in this case focus is more on the final document than on interactivity. You can chose both markdown and LaTeX for text formatting and all of them allow to publish high quality documents with high degree of automation. The simplest way how to use the packages is via integration with RStudio where everything works smoothly and a journey from your code to your report is really short. Also any customization is possible. If you are not familiar with markdown or LaTeX it can feel bit intimidating at first, but imho it is worth to go beyond the initial steep learning curve. Note that depending on the package you have to install pandoc or TeX distribution on you machine for rendering PDF documents.
- Notebooks (e.g. Jupyter) are great for sharing ideas and explaining code, as mentioned already above. You can export the notebook to PDF and hide the code in order to present only results. It is quick and elegant.
R is appreciated for its wide and active community and great choice of packages designed to deal with R’s problems or do specialized tasks for you. The choice is really wide in the official R package repository – CRAN. Moreover, you can install packages from other sources (e.g. Bioconductor, MRAN, GitHub). In general if you think about almost any data related problem, with high probability, there is already R package written for it. Here I would like to bring to your attention couple of packages that proved to be useful in our data science projects and have helped us execute our analyses more efficiently.
- Tidyverse by Hadley Wickham. Hadley works as a chief data scientist in RStudio and is sort of R celebrity. His tidyverse package is a set of high quality packages that share common design and general idea of data tidyness. The packages and their vignettes are designed with focus on code readability and mutual compatibility. Based on my experience, using tidyverse makes your analysis significantly faster and cleaner. The most useful packages from tidyverse are in my experience:
- dplyr for data manipulation. It is efficient in terms of computation, easy to use as it shares data manipulation principles with SQL, allows for in-database code evaluation. Recently it became almost a standard tool for data wrangling in R. Definitely superior to base R capabilities.
- magrittr for using pipes
%>%that makes your code readable and less verbose. It took me a while to get used to piping in R, but it is worth the pain.
- ggplot2 for charting. We already discussed ggplot above. What is dplyr for data manipulation is basically ggplot for graphics in R.
- Caret package provides functions and workflow for machine learning in R. It contains tools for data pre-processing, model training, evaluation, feature selection and more. It provides clear and easy-to-use framework for machine learning in R and an abstraction layer so you can focus on results instead of technical details. Caret works smoothly with many machine learning algorithms from various packages (e.g. randomForest, elasticNet, ctree etc.). Also it is easy to train model in parallel which comes handy for computationally heavy tasks such as hyper parameter tuning. Last but not the least, caret makes it easy to reuse the code across datasets, algorithms or target variables.
- Data Table package is focused on efficient data manipulation in R. Its use-case is similar to dplyr but data table focuses more on computational speed and efficiency than on a clear and easy-to-use API. Personally I never got used to its syntax that is quite similar to base R. Though, I admit that it is really quick and usually able to significantly cut on running times especially when working with bigger data.
- odbc package is there for connection to a SQL database. There exist many different packages for that (e.g. RODBC, DBI, etc.) however in my experience, odbc (in fact DBI package serves as a backend for odbc) is the fastest and the most stable of them all. Also it is perfectly compatible with dplyr so you can do the majority of your data wrangling in a database without moving your data back and forth all the time. The fact that it works well across various database platforms is just a nice bonus.
- Packrat is here to take you out from the version hell. I mentioned couple of times that R is great because of its packages, right? Well, there is a dark side of this as well. Packages have dependencies and versions, R has versions, dependencies have dependencies and versions…Yes, it is a mess and packrat is here to help you with it. It basically creates a snapshot of your current package library and remembers it so the next time you run your scripts or app on different machine it knows which packages and corresponding versions should be used. Trust me, it comes handy. Typical use case for packrat is when you want to run a Shiny app on a different machine, but it is useful in general to keep track of package versions. Alternatively you can use checkpoint package by Microsoft, which is more lightweight and easier to use. Or, if you are paranoid enough you might wrap everything into a Docker container, which is probably the most robust approach.
R is indeed relatively slow. However there are many options how to make it run faster. First, you should always go through your code and clean it up from unnecessary evaluations and duplicates. Significant speed benefits can be unlocked by usage of optimized packages (e.g. data table for data manipulation). If it is still not enough, there are four main options for you:
- Vectorization: R is fully vectorized language i.e. every data type in R is a vector. Even single number is a vector with one element, data frame is a multidimensional vector etc. You can take advantage of that by using vectorized versions of functions instead of loops. Typical example is usage of the apply family of functions instead of a for loop. Alternatively you can use an interface for functional programming provided by the purrr package that does pretty much the same thing with slightly different syntax and it is easier to generalize for different use-cases.
- Parallelization (SMP) is a way to go if it is easy to divide your problem into smaller pieces and your machine has more than one core. Usual suspects for parallelization in R are: tuning hyper parameters, cross validation, applying the same function to many variables, for loops etc. There is bunch of packages for parallelization in R (pararell, foreach, doParallel, doSnow…). Application of the whole concept to your problem and code may seem intimidating at first but fortunately there are packages that works with parallelism by default or it is very easy to make them run in parallel (e.g. caret). The advantages of parallelism are significant especially when working with large datasets on a remote server with many processor cores.
- C++ is a lower level programming language. Also it is blazing fast compared to R. Hence, it often make sense to use R just as an orchestrator and let the C++ do the actual computation. Rcpp package is based on this idea. It provides interface between R and C++ so you do not have to deal with passing of variables and different data types. I recommend this comprehensive tutorial for using Rcpp.
- Different backend is a similar idea to the previous one. In this case you also use R just as an orchestrator while the actual computation is happening elsewhere. It may seem like making three right turns instead of one left, but it actually often make sense to have R as a simple and easy-to-use interface for data operations and a separate engine for storage and computation. Good example of this approach is a usage of Apache SPARK for heavy lifting from R. There are two main packages for this: Sparklyr and SparkR. The choice depends on environment and preferences, but I find the first one slightly easier to use. Another examples may be in-database computation with dplyr or Microsoft ML server.
There are of course more options than above. Also you should consider if R is the right tool for your problem.
- Microsoft R Open is Redmond’s idea on how R should look like. It is derived from Revolutionary R created by start-up company Revolutionary Analytics that was acquired by Microsoft in 2015. MRO is 100 % compatible with vanilla R, therefore you might not even notice using it. It has its own package repository – MRAN. Why you should give a damn about it? MRO allows for multithreading via taking advantage of more efficient backend libraries for linear algebra (BLAS and LAPACK). Given this and the fact that MRO is completely free, it sounds like pretty good deal. Another Microsft’s R related product – ML server goes even further. It aims to completely abstract from where data is stored. Hence you can evaluate R code in a database, on a remote server, on a hadoop cluster or just on your local machine. To achieve such a high level of abstraction it introduces the revoScale package which contains scalable and distributable functions for data manipulation, descriptive statistics or machine learning models. However, in my opinion the best feature of the revoScale library is that it brings its own file format .xdf that can be used for a persistent data storage. As a result it effectively breaks R’s dependency on operational memory. Yes, you can now use R with files bigger than your RAM and that is awesome.
- RTools is an extension toolkit for R on the MS Windows platform. Mainly it introduces its own C++ compiler and other utilities common on a UNIX-like operating system. RTools is required by some packages, therefore it is good to know about it.
- H2O is a machine learning and deep learning framework. There are many similar frameworks (especially for deep learning) and their number is growing so fast it is almost impossible to stay up to date. I chose H2O for this document because it has nice API for R, it is easy to use and it works smoothly out of the box. Main advantages of H2O and similar frameworks are scalability and abstraction. Indeed you can write you convolutional neural network in the base R but nobody would understand the code, the training would take ages and reusing the same code on a different data would be a pain. Frameworks solve all those problems. If you are up for deep learning make sure to check TensorFlow, Keras and CNTK as well.