R for newbs like me

R is a free, very powerful programming language that is used a lot in data science and increasingly often in biomedical science. Like any statistical program, it has a bit of a learning curve associated with it.

R is the base program, but its functionality depends on packages - different groups of functions and libraries that expand its capabilities. It can be pretty confusing.

What follows below are my preferences and recommendations as a non-statistician researcher who likes to do his own statistics and play with data (and, frankly, learn computer languages).


Posit (formerly R Studio) are the company that makes R Studio. They have recently re-branded to integrate Python and its data science capabilities into R Studio to expand its functionality, but you can still just learn R and use it through R Studio to keep things simple.

Link for the interested (not necessary to get started)


Posit have developed an entire ecosystem of packages mainly to enable data science in R. Called the Tidyverse, all these programs work together to help import data into R, clean it, transform it, and visualize it in order to run analyses. You can import all those packages into R by installing the meta-package "tidyverse." There are many packages and methods for handling data with R, but the Tidyverse is thoughtful, popular and makes sense, so I have decided it's enough to get good at these packages to handle the non-statistical parts of working with data.


Tidy Data and Reproducible Research

There is a strong philosophy behind all of Posit's work that involves two major concepts:

Tidy Data - Data is presented to us via spreadsheets and tables in all sorts of formats with header rows, pivot tables, multiple data elements per cell,  etc. It's all very messy and presents challenges for any program to analyze or visualize. The concept of tidy data can be summarized as "1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table." Sounds simple, but we often conflate variables and observations. You can read a theoretical paper about this from the founder of Posit/R Studio, or get more details from R for Data Science described below.

Reproducible Research - This is a big but important theory - especially for the Open Science movement - that is operationalized in the following way: keep your data sources unaltered and use your data science programs (R, tidyverse, etc.) to create a script that will perform all the necessary importation, data cleaning, transformation, visualization and analysis. Readers and users should be able to see exactly and reproducibly how you got from the raw data to your analysis at every step and replicate it for themselves.

Getting Started with R

Below are my recommendations to get started with  R.

If you just get R, you'll get only a command-line interface, which is very functional, but not that friendly. R Studio is a graphical user interface (GUI) for R that has become standard for new users. You will have to download both R and R Studio, which you can do conveniently at the link above. Once up and running, I recommend installing the "tidyverse" meta-package.

This is a web-based textbook, written by the head of Posit/R Studio, that outlines a coherent way to do data science using the tidyverse packages. It doesn't go into a lot of statistical analysis. It's clear and accessible and recently updated. Installing the "tidyverse" meta-package will enable you to use this book.

This e-book focuses on statsistical learning using R and some of the Tidyverse functions.

Additional Notes on Using R and R Studio

Other R statistics tutorial websites to try