Basic R coding overview

This document aligns with the Stata cheat sheet.

1 Tips before you start

  • There is a lot of documentation and support available at your fingertips both within R and outside of R on blogs, other educational websites, Twitter, YouTube and more.

    • Within R, use help(function_name) to get an overview, more options and examples for a specific function. You can also use a question mark.

      help(function_name)
      ?function_name
    • You can also directly search for a package or function from the help panel in RStudio.

  • R is case sensitive (e.g. if a variable is ‘ID’, you must refer to it as ‘ID’, not ‘id’).

  • It is best to save and write your code in a R script (.R) or an RMarkdown (.Rmd) or Quarto (.qmd) file.

  • Notes/comments can be added for easy annotation of the program. A pound sign/hash at the beginning of the line denotes a comment that can be used interactively. For example:

    # This is a note that
    # R will not read
  • If you start to run a function and need to stop it, you can press the stop sign at the top right of the console.

2 Getting started with R

Open RStudio. If you have used R previously, an old workspace may still be active. However, you always want to start with a fresh session. Go to Tools -> Global Options, and under General, change these settings:

Now, you can quit RStudio if something goes wrong. You can also go to Session -> Restart R to clear your session.

2.1 Create and execute an R script

In RStudio, you can run R code directly in the console. This is helpful for running quick one-off code. To save your code, write it as an .R script, which allows one to run the same commands in the future to reproduce your work. To open a new .R script, one can:

  1. Click on the “New File” icon in the top left of RStudio, then select “R Script”.
  2. File -> New File -> R Script

Installing packages

To utilize certain functions in R, additional packages often need to be downloaded into R. To install a package, run:

install.packages("package_name")

You can also click on the “Packages” tab in the bottom right window in RStudio. Next click “install” and a small window will pop up, enter the package’s name and hit “install”.

When a function belongs to a package, you can use that function by running package_name::function_name() or by running library(package_name) at some point before function_name().

2.2 Read in a Dataset (.rds, .dta, .xlsx, .csv, .sas7bdat)

Datasets can be in various formats, such as “.rds” (R), “.dta” (Stata), “.xlsx” (Excel), “.csv” (Comma Separated Value), and “.sas7bdat” (SAS) files. Some of these datasets require additional packages to download into R.

You will need to include the path to the dataset file in your code. To determine the location of a dataset stored on a Windows computer, you can right-click and click on ‘Properties’ and copy and paste the location path. On a Mac, right-click the file, then press Option so that “Copy filename as Pathname” appears in the menu.

Unlike Stata, you can have multiple datasets open in the same R session. When reading in a dataset, you must store it as an object so that you can refer to it later. This is done in R using the “assignment arrow” <-.

Reading in datasets can also be done from “File” -> “Import Dataset”. This will bring up a window with options that will help you write the code that can be saved in an R script to use again. The “files” tab in the bottom right window of RStudio may also be helpful to import datasets. It operates similarly to Windows Explorer or the Finder on a Mac.

2.2.1 Reading an R “.rds”-file

The readRDS() function is part of “base R”. This means it does not need a separate package. The {readr} package includes many functions for reading in data that do the same things as the base R packages, often with better defaults.

dataset <- readRDS("filepath/filename.rds")

or

dataset <- readr::read_rds("filepath/filename.rds")

2.2.2 Reading a Stata “.dta”-file

dataset <- haven::read_stata("filepath/filename.dta")

2.2.3 Importing an Excel file

Because data in Excel can take many different forms (different sheets, extra rows with information, etc.), there are many arguments to this function to make sure you are reading in the right data. It will default to the first sheet, with the first row used as column names.

dataset <- readxl::read_excel("filepath/filename.xlsx")

2.2.4 Importing a “.csv”-file

dataset <- read.csv("filepath/filename.csv")

or

dataset <- readr::read_csv("filepath/filename.csv")

2.2.5 Reading in a SAS file

haven::read_sas("filepath/filename.sas7bdat",
                "filepath/catalog.sas7bcat")

2.3 Get to know your data

There are a number of ways to summarize your dataset. The summary() function will print summary statistics for each variable.

summary(dataset)

The glimpse() function tells you the number of rows and columns, what type of variable each column contains, and shows you the first few observations.

dplyr::glimpse(dataset)

To see what the entire dataset looks like, you can click on the name of the dataset in the Environment pane. This will open up a new tab with the dataset. You can also bring up this tab with:

View(dataset)

or

tibble::view(dataset)

To compactly list the variable names:

names(dataset)

To count the number of observations (i.e. rows of data) in a dataset:

nrow(dataset)

To count the number of observations in a subset of the data:

Code for character variables:

nrow(dataset[dataset$varname == "Male",])

Code for numeric variables:

nrow(dataset[dataset$varname == 1,])

You can also count the number of observations for each value of varname with:

dplyr::count(dataset, varname)

3 Basic data management

3.1 Generate quadratic term

dataset$newvar <- dataset$oldvar * dataset$newvar

or

dataset <- mutate(dataset, newvar = oldvar * oldvar)

You can also create a quadratic term directly in a model itself, but it must be wrapped in I(): var + I(var^2) + other_vars .

3.2 Generate a categorical variable based on cut-points

There are a couple ways to create a categorical variable 'newvar' based on cutpoint values of a continuous variable 'oldvar' at cutpoints x and y.

  1. One can use the cut() function to create a new categorical variable with cutpoints at the breaks = values. For example, say you have a variable with a minimum of 0 and a maximum of 100 with some missing values and you would like to make three categories defined by the values (0,x] (category 1), (x,y] (category 2), and (y,100] (category 3). For the breaks =argument, you will want to supply the left-hand breaks: a value lower than the minimum, x, y and then a value larger than the maximum. (You can use -Inf and Inf if you don’t know what the minimum and maximum are.) You can also use the include.lowest = and the right = arguments to change how the intervals are created.

    dataset$newvar <- cut(dataset$newvar, 
                          breaks = c(-Inf, x, y, Inf))
  2. You can also use case_when() if you want to be more explicit. For example, to generate a new categorical variable based on a set of cutpoints (ex. 40, 60, 80):

    library(dplyr)
    dataset <- mutate(dataset, newvar = case_when(
      oldvar < 40 ~ 0,
      oldvar >= 40 & oldvar < 60 ~ 1,
      oldvar >= 60 & oldvar < 80 ~ 2,
      oldvar >= 80 ~ 3
    ))

    Values of oldvar that don’t meet any of the conditions will be assigned NA.

3.3 Generate quantiles

Use the quantile() function to generate a categorical variable based on quantile cutoffs. In the code below, set the number of desired quantiles (here, tertiles) by changing the value of nquantiles.

nquantiles <- 3
dataset$newvar <- cut(dataset$oldvar, 
                      breaks = quantile(dataset$oldvar, 
                                        probs = seq(0, 1, 
                                                    length.out = nquantiles + 1)), 
                      include.lowest=TRUE)

3.4 Generate an indicator/dummy variable

  1. You generally don’t need to create indicator/dummy variables. In models, R will treat any factor variable as a series of indicator variables. If varname has numeric values, turn it into a factor variable (see next section):
dataset$varname <- factor(dataset$varname)

and then use varname directly in a model.

  1. One could also code the indicator/dummy variables manually using case_when().
  2. In some scenarios, it may be helpful to generate the indicator variables used in a model with:
newvars <- as.data.frame(model.matrix(~varname + othervar, 
                                      data = dataset))

Note that this will also create an intercept (a column of ones) and will only contain the variables varname and othervar, no other variables from the dataset.

3.5 Label a categorical variable

To create a factor (categorical) variable, use the factor() function, with arguments levels = for the unique values and labels = for the names of those values. The labels wil be assigned to the level in the same order they are specified. For example, if oldvar has values 0, 1, and 2, which should be named “small” (0), “medium” (1), “large” (2):

dataset$newvar <- factor(dataset$oldvar, levels = c(0, 1, 2), 
                         labels = c("small", "medium", "large"))

The reference level is always the first level (e.g. “small”). In models, R will treat character variables as factor variables, and the levels will be in alphabetical order. For example, if the variable in the example above was already labeled, but not a factor variable, putting it in a model or using the factor() function would result in levels “large”, “medium”, and “small” (“large” being the reference).

To determine whether a variable is already a factor variable, or whether it is a character variable (or another type), use:

class(dataset$varname)

To see the levels in order of a factor variable, use:

levels(dataset$varname)

The {forcats} package is designed to make working with categorical variables easier. If a variable has levels “large”, “medium”, and “small” and you want to change the order, you can use:

dataset$newvar <- forcats::fct_relevel(dataset$oldvar, 
                                       "small", "medium", "large")

To rename the levels “sm”, “md”, and “lg”:

dataset <- forcats::fct_recode(dataset$oldvar, 
                               "sm" = "small", "md" = "medium", "lg" = "large")

If the variable is numeric, you need to turn it into a character or factor variable before using fct_recode():

dataset <- forcats::fct_recode(as.character(dataset$oldvar), 
                               "sm" = "0", "md" = "1", "lg" = "2")

3.6 Sorting data

The default of the arrange() function from the {dplyr} package is in ascending order.

dataset <- dplyr::arrange(dataset, varname)

To sort in descending order:

library(dplyr)
dataset <- arrange(dataset, desc(varname))

You can sort on multiple variables to arrange observations into ascending order based on varname1, and within each varname1 category one can sort into descending order by varname2.

library(dplyr)
dataset <- arrange(dataset, varname1, desc(varname2))

3.7 Merge two datasets

To merge two datasets, use a _join() function from the {dplyr} package: left_join(), right_join(), full_join(). The functions differ by how they treat rows without a match in the “left-hand side” dataset and the “right-hand side” dataset.

This will keep all observations in dataset1 and merge only with observations from dataset2 with a matching id in dataset1:

library(dplyr)
dataset3 <- left_join(dataset1, dataset2, join_by(id))

3.8 Keeping and dropping observations

To create a dataset that is a subset of another dataset, use filter(). This code will retain all observations with both the age and sex requirement:

dataset2 <- dplyr::filter(dataset1, age > 35, sex == "F")

You can also explicitly use logic operators such as & (and) and | (or). To retain observations that are either older than 35 or female (or both):

dataset2 <- dplyr::filter(dataset1, age > 35 | sex == "F")

4 Descriptives, statistical tests, and visualization tools

4.1 Categorical data: frequencies, tests, and association measures

4.1.1 One-way frequency table

One can get a frequency table using the count() function.

dplyr::count(dataset, varname)

The tabyl() function from the {janitor} package is also helpful for frequency tables because it will also add percentages by default to a one-way frequency table.

janitor::tabyl(dataset, varname)

For a binary variable, you can use prop.test() or binom.test() (exact intervals) to get confidence intervals for the proportion of 1’s (or the “higher” level for a factor variable):

prop.test(table(dataset$binary_var))

4.1.2 RxC frequency tables

One can easily extend the functions above to cross-tabulate two categorical variables.

dplyr::count(dataset, varname1, varname2)
janitor::tabyl(dataset, varname1, varname2)

To get proportions when using tabyl():

library(janitor)
tabyl(dataset, varname1, varname2) |> adorn_percentages()

4.1.3 Two-sample test of proportions

To compare proportions of varname2 stratified by varname1 and get a confidence interval for the difference in proportions;

prop.test(table(dataset$varname1, dataset$varname2))

4.1.4 Estimate measures of association

The {epitools} package contains functions for estimating various simple epidemiologic quantities. For example, to estimate a risk ratio (and the corresponding proportions):

epitools::epitab(dataset$exposure, dataset$outcome, 
                 method = "riskratio")

4.1.5 Stratified frequency tables

You can estimate any summary statistic of one variable (ex. varname1 or varname2) within different levels of a categorical variable (ex. categorical_variable):

dplyr::summarize(dataset, mean_1 = mean(varname1), 
                 sd_2 = sd(varname2), .by = categorical_variable)

For two-way frequency tables (varname1 x varname2) stratified by a third variable (categoricalvariable):

janitor::tabyl(dataset, varname1, varname2, categorical_variable)

4.2 Continuous data: summary statistics, visualization, and statistical tests

4.2.1 Measures of location and variance

Use the summary command to obtain summary statistics for continuous data (mean, minimum, maximum, median, IQR):

summary(dataset$varname)

You can also estimate arbitrary summary statistics using dplyr::summarize():

dplyr::summarize(dataset, mean = mean(varname), sd = sd(varname))

One can estimate the standard errors and 95% CI of the mean using the ci or ci means command. t.test() function:

t.test(dataset$varname)

4.2.2 Histogram

A histogram graphic for continuous data:

hist(dataset$age)

With the {ggplot2} package:

ggplot(dataset, aes(x = age)) + 
    geom_histogram() +
    labs(title = "Histogram of age", 
         x = "Age")

4.2.3 Box plots

To show a box-and-whisker plot of age and a box plot of age by a categorical variable (ex. sex):

boxplot(dataset$age, dataset$sex)
ggplot(dataset, aes(y = age, x = sex)) + 
    geom_boxplot() +
    labs(title = "Boxplot of age by sex", 
         x = "Sex", y = "Age")

4.2.4 Scatterplot

Scatterplot of two continuous variables:

plot(dataset$age, dataset$bmi)
ggplot(dataset, aes(y = age, x = bmi)) + 
    geom_point() +
    labs(x = "Age", y = "BMI")

4.2.5 Statistical Tests

To conduct a two-sample T-test (continuous variable by a binary variable), use the t.test() function with the var.equal = FALSE argument if desired:

t.test(continuous_var ~ categorical_var, 
       data = dataset, var.equal = FALSE)

To conduct a one-way ANOVA (test differences of two or more means), use the aov() function:

aov(continuous_var ~ categorical_var, data = dataset)

To conduct a non-parametric test Wilcoxon Rank-sum test (aka Mann-Whitney U test), use the wilcox.test() function:

wilcox.test(continuous_var ~ categorical_var, data = dataset)

4.2.6 Correlation coefficients

One can estimate correlation coefficients using the cor() function. The default is Pearson; one can estimate the Spearman correlation coefficients using the method = "spearman" argument:

cor(dataset[, c("varname1", "varname2", "varname3")], 
    method = "spearman")

You can also provide two vectors if a single correlation is desired. For confidence intervals and hypothesis tests, use cor.test().

cor.test(dataset$varname1, dataset$varname2)

5 Regression analyses

5.1 Logistic Regression

5.1.1 Using continuous or binary variables

Fit a logistic regression model:

model <- glm(outcome ~ exposure + var1 + var2, 
             data = dataset, family = binomial())

Code to output the beta coefficients:

coef(model)

Code to output the odds ratios:

exp(coef(model))

Coefficients, standard errors, p-values:

summary(model)

Confidence intervals:

confint(model)

The {broom} package is helpful for getting model output:

broom::tidy(model, exponentiate = TRUE, conf.int = TRUE)

5.1.2 Using indicator variables

R automatically creates indicator variables for factor or character variables (see previous section). To force it to create indicator variables for a numeric variable, you can use factor():

model <- glm(outcome ~ exposure + factor(var1) + var2, 
             data = dataset, family = binomial())

To change the referent group, you can use fct_relevel() to move a new level to be the first level (the other levels will stay in their original order, unless you include them):

dataset$newvar <- forcats::fct_relevel(dataset$oldvar, "ref")
model <- glm(outcome ~ exposure + newvar, 
             data = dataset, family = binomial())

5.1.3 Stratified Logistic Regression

To fit a logistic regression model within levels of categorical_var:

library(dplyr)
models <- dataset |> 
  group_by(categorical_var) |> 
  group_map(~glm(outcome ~ exposure, data = .x))

5.2 Linear Regression

One can conduct a linear regression model using either the lm() function or the glm() function with family = gaussian(). All of the same functions work on these models to extract model output.

model <- lm(outcome ~ exposure + var1 + var1)
broom::tidy(model, conf.int = TRUE)

5.3 Survival Data

5.3.1 Set-up

The {survival} package is used for this entire section.

library(survival)

Within that package, the Surv() function is used to “create” the outcome and is used in other functions:

Surv(time_to_outcome, outcome)

5.3.2 Graphs

Kaplan-Meier curves overall:

km <- survfit(Surv(time_to_outcome, outcome) ~ 1, 
              data = dataset)
plot(km)

KM curves by exposure groups:

km <- survfit(Surv(time_to_outcome, outcome) ~ exposure, 
              data = dataset)
plot(km)

To add 95% confidence intervals and censoring marks to the figure:

plot(km, conf.int = TRUE, mark.time = TRUE)

Log negative-log survivor plots:

plot(km, fun = "cloglog")

The {survminer} package is useful for creating survival curves for publication and can also include risk tables:

survminer::ggsurvplot(km, risk.table = TRUE)

5.3.3 Summary statistics

To see survivor function at various time points (ex. 30, 60, 90 days):

summary(km, times = c(30, 60, 90))

You can also use the tidy() function to extract the estimates and confidence intervals over all times in a nicer format:

broom::tidy(km)

To get median and other percentile survival times with confidence intervals:

quantile(km, probs = c(0.25, 0.5, 0.75))

5.3.4 Statistical tests

Log-rank test:

survdiff(Surv(time_to_outcome, outcome) ~ exposure, 
         data = dataset)

See the {survminer} vignette for details on plotting survival curves with p-values from weighted log-rank tests (here, Tarone-Ware):

survminer::ggsurvplot(km, data = dataset, 
                      pval = TRUE, pval.method = TRUE,
                      log.rank.weights = "sqrtN",
                      pval.method.coord = c(3, 0.1),
                      pval.method.size = 4)

5.3.5 Cox Regression

One can conduct a Cox regression model using the coxph() function from the {survival} package.

cox <- coxph(Surv(time_to_outcome, outcome) ~ exposure, 
             data = dataset)

The functions from the logistic regression section can be used to extract model output.

Competing Risk Analysis: In carrying out a competing risk analysis, the “outcome” variable can take on three values. There must be a value for the actual event of interest (outcome=1); a value for censoring (outcome=0); and a value for the competing risk event (outcome=2).

fg_data <- finegray(Surv(time_to_outcome, outcome) ~ exposure, 
                    data = dataset)
fg_model <- coxph(Surv(fgstart, fgstop, fgstatus) ~ exposure, 
                  weight = fgwt, data = fg_data)

5.4 Goodness of Fit Statistics

To get the log Likelihood/AIC/BIC for the most recent regression model:

loglik(model)
AIC(model)
BIC(model)

The glance() function from the {broom} package will also give you goodness-of-fit statistics:

broom::glance(model)

To run a likelihood ratio test comparing two nested models:

lmtest::lrtest(mod_full, mod_nested)