R may appear as a dreadful programming language, especially if you are coming from computer science background. Having said that, there is no doubt that R is a brilliant tool for interactive data analysis.
Or as my friend puts it: “The difference between R and Python is that R was designed by statisticians who knew a bit of programming while Python was designed by programmers who knew a bit of statistics.”
We do not think it matters that much how good, expressive or fast is R as a programming language. The real comparative advantage of R lies in a huge and active community of users and developers, nearly infinite universe of high quality packages and easily accessible online resources that are on your finger tips for free. Be as it may, in this article we would like to give you an advice how to take advantage of R in practice and survive. The purpose of this document is not to serve as an introduction to R’s syntax, therefore you will not find here sections like How to concatenate vectors in R or similar. Rather than that, this mini series should help you feel more comfortable during your work with R, speed-up your analyses and avoid mistakes and inefficiencies we had to overcome the hard way. Regardless if you are taking your first steps with R or you are already an experienced R user we belive you will find here relevant advice or at least an inspiration.
In a typical data science workflow there is not as great pressure on code refactoring, readability, testing and expressivity as in a full-blown software development. Though, it is worth following at least the basic rules of coding etiquette, especially if it is a collaborative project or a code that goes into a production. For R specifically there are two really well designed style guides:
Personally I prefer the later one. Also, naming convention for files, functions or variables should be agreed at the beginning of every project. You might consider using code linting tools (e.g. lintr) that can help you with making your code nice and readable. Last but not the least: comment your code! You can trust us that your investment of time and effort into a nice and readable code will pay off.
For efficient and enjoyable coding you should set-up development environment that suits your needs and does not bother you with distracting workflow and ill-designed interface. Based on a type of a project, personal taste and experience, programmers tend to prefer either full-fledged IDE (integrated development environment) or a lightweight text editor based solution. The default option for R is its own GUI that is part of the package when you download R to your computer. However I do not recommend using it as it has no advantages over the following options.
- RStudio is very well designed IDE developed specifically for R. Everything is set out of the box so you do not have to tinker with settings. It offers integration with GIT and SQL databases, code templates, debugging tools and much more. RStudio is simply great and imho it should be your choice # 1. Check it out before you try anything else.
- Visual Studio is Microsoft’s flagship IDE designed primarily for dot net and C#, however since Microsoft supports R in recent years, Visual Studio integrates very well with R as well. Unlike RStudio, MS Visual Studio is designed as a heavy duty tool for software development. Hence it offers more sophisticated project management tools or better integration with GIT and SQL. In general it feels more robust but also more complicated. Probably it does not make much sense as a tool for your regression experiments, though it can be beneficial for larger collaborative projects.
- Text editors are popular choice for coding. After all code is nothing more than a text file. The main advantage of text editors is that they are simple, universal, lightweight and do not distract you from coding with plethora of panels, toolbars and options. Personally I have great experience with Sublime Text 3 which is open source text editor with clean design and package manager that allows you to enhance its functionality. Sublime text is very popular among web developers, so there is already bunch of packages that converts it to a lightweight yet capable IDE. What I like the most about Sublime text is sublimeREPL package which allows you to run multiple interpreters side-by-side (e.g. R, Python 2.7, Python 3.5 an LaTeX in one window). You can also give a try to a beautifully designed Atom editor by GitHub.
- Jupyter notebook is a web based application primarily designed for Python that allows to create documents containing code that can evaluated, visualizations, formatted text or interactive elements. It is brilliant for sharing and explaining code, ideas and insights. Its beauty is in reproducibility of an analysis and in the ability to effectively combine a human language for explanation and a code for computation. Unfortunately there are some drawbacks. Integration with version control tools is possible, but so far unsatisfactory. Also notebook might not be ideal format for larger projects with a complicated code structure and many dependencies. Note that Jupyter is not the only computational notebook project out there. Check out Zeppelin and R notebook as well.
Note that the list above is not exhaustive. You can easily set up R syntax highlighting and evaluation in majority of IDEs and text editors (e.g. Emacs, Eclipse, Notepad++ or VIM if you dare)
R is known and appreciated for its high quality graphical outputs. You can use base R functionality for getting charts quickly (e.g.
hist(dataset$numeric_variable) for histogram). However I really advice to check out and use the following packages that bring quality of your charts to a next level.
- ggplot: Great graphical package with almost infinite adjustability and well designed API based on Leland Wilkinson’s grammar of graphics. Resources:
- Plotly: API for interactive plots publishable on web. It doesn’t have as coherent and pragmatic syntax as ggplot, but it is great for sharing ideas and publishing via web. Also it contains great ggplotly function that instantly turns your ggplot chart into an interactive web based visualization. Resources:
- Plotly for R Documentation
Design of a chart is of course matter of personal taste, audience, use-case etc. But I recommend to your attention Edward Tufte’s book The Visual Display of Quantitative Information which covers great examples and best practices of charting and plotting. Most importantly, always remember that “pies are good for eating and not for charting”.
In the first part of Living with R series we discussed our tips on how to make coding in R efficient and how to produce top quality graphical outputs. In the next part we will take a look on publishing of results of an analysis, speeding things up and more.