# R for newbs like me

R is a free, very powerful programming language that is used a lot in data science and increasingly often in biomedical science. Like any statistical program, it has a bit of a learning curve associated with it.

R is the base program, but its functionality depends on packages - different groups of functions and libraries that expand its capabilities. It can be pretty confusing.

What follows below are my preferences and recommendations as a non-statistician researcher who likes to do his own statistics and play with data (and, frankly, learn computer languages).

## Posit

Posit (formerly R Studio) are the company that makes R Studio. They have recently re-branded to integrate Python and its data science capabilities into R Studio to expand its functionality, but you can still just learn R and use it through R Studio to keep things simple.

Link for the interested (not necessary to get started)

## Tidyverse

Posit have developed an entire ecosystem of packages mainly to enable data science in R. Called the Tidyverse, all these programs work together to help import data into R, clean it, transform it, and visualize it in order to run analyses. You can import all those packages into R by installing the meta-package "tidyverse." There are many packages and methods for handling data with R, but the Tidyverse is thoughtful, popular and makes sense, so I have decided it's enough to get good at these packages to handle the non-statistical parts of working with data.

## Tidy Data and Reproducible Research

There is a strong philosophy behind all of Posit's work that involves two major concepts:

Tidy Data - Data is presented to us via spreadsheets and tables in all sorts of formats with header rows, pivot tables, multiple data elements per cell, etc. It's all very messy and presents challenges for any program to analyze or visualize. The concept of tidy data can be summarized as "1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table." Sounds simple, but we often conflate variables and observations. You can read a theoretical paper about this from the founder of Posit/R Studio, or get more details from R for Data Science described below.

Reproducible Research - This is a big but important theory - especially for the Open Science movement - that is operationalized in the following way: keep your data sources unaltered and use your data science programs (R, tidyverse, etc.) to create a script that will perform all the necessary importation, data cleaning, transformation, visualization and analysis. Readers and users should be able to see exactly and reproducibly how you got from the raw data to your analysis at every step and replicate it for themselves.

## Getting Started with R

Below are my recommendations to get started with R.

If you just get R, you'll get only a command-line interface, which is very functional, but not that friendly. R Studio is a graphical user interface (GUI) for R that has become standard for new users. You will have to download both R and R Studio, which you can do conveniently at the link above. Once up and running, I recommend installing the "tidyverse" meta-package.

This is a web-based textbook, written by the head of Posit/R Studio, that outlines a coherent way to do data science using the tidyverse packages. It doesn't go into a lot of statistical analysis. It's clear and accessible and recently updated. Installing the "tidyverse" meta-package will enable you to use this book.

This e-book focuses on statsistical learning using R and some of the Tidyverse functions.

## Additional Notes on Using R and R Studio

Using R just for analyses (attention medical students!)

It's possible (desirable) to stick to the precepts of Reproducible Research and not learn the whole tidyverse approach to data, and, for students, this might be the most efficient approach.

When you get your data files, save a copy and lock it as the original data file, so that you always have a copy of the unaltered dataset.

Use some sort of version control if you're cleaning and preparing the data in a spreadsheet. The easiest way is to keep brief notes of what you did ("renamed columns A-F with single variable terms", "created a new SUM variable from the total of columns F-H) step-by-step so that you can reproduce it. And save a new version of your spreadsheet after each set of changes.

When your data is ready for analysis (work with the statistician/mentor to understand the format needed), then load the data into R and use visualization (ggplot or other) commands and the statistical commands to run analyses.

Citing R in papers

Cite R and its version, not R Studio. R is what's running your analyses. "Citing R Studio is like citing Microsoft Word" (sez the internet)

It's possible to use the citation("whateverpackagenameyou'reusing") command in R to get a citation for each package, but I've had these removed from papers by editors in the past. If you just stick a statement in that says something like: "All analyses were performed using R, v. 3.6.7, using the psych package" that should be sufficient. I don't think you should bother mentioning the tidyverse packages - just the package that contains your analysis methods.

NEW - JAMA allows this: "The analyses were performed using R version 4.1.2,22 and packages lme4,23 lmerTest,24 pbkrtest,25 nloptr,26 and MASS,27" with the references being the citations obtained from R for R base, but for the packages, the references are specific articles that have used the packages themselves (click on the citation numbers to see examples).

Getting help

Honestly, the way most data scientists learn R, R Studio, etc. is by Googling. It's certainly the way I have learned what I know.

For specific questions, consider Stack Overflow, but look for answers BEFORE posting things....or you risk getting flamed.

## Other R statistics tutorial websites to try

Introduction to Statistics Using the R Programming Language - brief intro tutorial. Larger site is good also.

UCLA Stats website has some good modules and tutorials HERE and HERE - graphics, R fundamentals, specific stats procedures, etc.

A pretty comprehensive biological statistics site with an R companion.