help(function_name)
?function_name
Basic R coding overview
This document aligns with the Stata cheat sheet.
1 Tips before you start
There is a lot of documentation and support available at your fingertips both within R and outside of R on blogs, other educational websites, Twitter, YouTube and more.
Within R, use
help(function_name)
to get an overview, more options and examples for a specific function. You can also use a question mark.You can also directly search for a package or function from the help panel in RStudio.
R is case sensitive (e.g. if a variable is ‘ID’, you must refer to it as ‘ID’, not ‘id’).
It is best to save and write your code in a R script (.R) or an RMarkdown (.Rmd) or Quarto (.qmd) file.
Notes/comments can be added for easy annotation of the program. A pound sign/hash at the beginning of the line denotes a comment that can be used interactively. For example:
# This is a note that # R will not read
If you start to run a function and need to stop it, you can press the stop sign at the top right of the console.
2 Getting started with R
Open RStudio. If you have used R previously, an old workspace may still be active. However, you always want to start with a fresh session. Go to Tools -> Global Options, and under General, change these settings:
Now, you can quit RStudio if something goes wrong. You can also go to Session -> Restart R to clear your session.
2.1 Create and execute an R script
In RStudio, you can run R code directly in the console. This is helpful for running quick one-off code. To save your code, write it as an .R script, which allows one to run the same commands in the future to reproduce your work. To open a new .R script, one can:
- Click on the “New File” icon in the top left of RStudio, then select “R Script”.
- File -> New File -> R Script
Installing packages
To utilize certain functions in R, additional packages often need to be downloaded into R. To install a package, run:
install.packages("package_name")
You can also click on the “Packages” tab in the bottom right window in RStudio. Next click “install” and a small window will pop up, enter the package’s name and hit “install”.
When a function belongs to a package, you can use that function by running package_name::function_name()
or by running library(package_name)
at some point before function_name()
.
2.2 Read in a Dataset (.rds, .dta, .xlsx, .csv, .sas7bdat)
Datasets can be in various formats, such as “.rds” (R), “.dta” (Stata), “.xlsx” (Excel), “.csv” (Comma Separated Value), and “.sas7bdat” (SAS) files. Some of these datasets require additional packages to download into R.
You will need to include the path to the dataset file in your code. To determine the location of a dataset stored on a Windows computer, you can right-click and click on ‘Properties’ and copy and paste the location path. On a Mac, right-click the file, then press OptionOption so that “Copy filename as Pathname” appears in the menu.
Unlike Stata, you can have multiple datasets open in the same R session. When reading in a dataset, you must store it as an object so that you can refer to it later. This is done in R using the “assignment arrow” <-
.
Reading in datasets can also be done from “File” -> “Import Dataset”. This will bring up a window with options that will help you write the code that can be saved in an R script to use again. The “files” tab in the bottom right window of RStudio may also be helpful to import datasets. It operates similarly to Windows Explorer or the Finder on a Mac.
2.2.1 Reading an R “.rds”-file
The readRDS()
function is part of “base R”. This means it does not need a separate package. The {readr}
package includes many functions for reading in data that do the same things as the base R packages, often with better defaults.
<- readRDS("filepath/filename.rds") dataset
or
<- readr::read_rds("filepath/filename.rds") dataset
2.2.2 Reading a Stata “.dta”-file
<- haven::read_stata("filepath/filename.dta") dataset
2.2.3 Importing an Excel file
Because data in Excel can take many different forms (different sheets, extra rows with information, etc.), there are many arguments to this function to make sure you are reading in the right data. It will default to the first sheet, with the first row used as column names.
<- readxl::read_excel("filepath/filename.xlsx") dataset
2.2.4 Importing a “.csv”-file
<- read.csv("filepath/filename.csv") dataset
or
<- readr::read_csv("filepath/filename.csv") dataset
2.2.5 Reading in a SAS file
::read_sas("filepath/filename.sas7bdat",
haven"filepath/catalog.sas7bcat")
2.3 Get to know your data
There are a number of ways to summarize your dataset. The summary()
function will print summary statistics for each variable.
summary(dataset)
The glimpse()
function tells you the number of rows and columns, what type of variable each column contains, and shows you the first few observations.
::glimpse(dataset) dplyr
To see what the entire dataset looks like, you can click on the name of the dataset in the Environment pane. This will open up a new tab with the dataset. You can also bring up this tab with:
View(dataset)
or
::view(dataset) tibble
To compactly list the variable names:
names(dataset)
To count the number of observations (i.e. rows of data) in a dataset:
nrow(dataset)
To count the number of observations in a subset of the data:
Code for character variables:
nrow(dataset[dataset$varname == "Male",])
Code for numeric variables:
nrow(dataset[dataset$varname == 1,])
You can also count the number of observations for each value of varname
with:
::count(dataset, varname) dplyr
3 Basic data management
3.1 Generate quadratic term
$newvar <- dataset$oldvar * dataset$newvar dataset
or
<- mutate(dataset, newvar = oldvar * oldvar) dataset
You can also create a quadratic term directly in a model itself, but it must be wrapped in I()
: var + I(var^2) + other_vars
.
3.2 Generate a categorical variable based on cut-points
There are a couple ways to create a categorical variable 'newvar'
based on cutpoint values of a continuous variable 'oldvar'
at cutpoints x
and y
.
One can use the
cut()
function to create a new categorical variable with cutpoints at thebreaks =
values. For example, say you have a variable with a minimum of 0 and a maximum of 100 with some missing values and you would like to make three categories defined by the values (0,x] (category 1), (x,y] (category 2), and (y,100] (category 3). For thebreaks =
argument, you will want to supply the left-hand breaks: a value lower than the minimum, x, y and then a value larger than the maximum. (You can use-Inf
andInf
if you don’t know what the minimum and maximum are.) You can also use theinclude.lowest =
and theright =
arguments to change how the intervals are created.$newvar <- cut(dataset$newvar, datasetbreaks = c(-Inf, x, y, Inf))
You can also use
case_when()
if you want to be more explicit. For example, to generate a new categorical variable based on a set of cutpoints (ex. 40, 60, 80):library(dplyr) <- mutate(dataset, newvar = case_when( dataset < 40 ~ 0, oldvar >= 40 & oldvar < 60 ~ 1, oldvar >= 60 & oldvar < 80 ~ 2, oldvar >= 80 ~ 3 oldvar ))
Values of
oldvar
that don’t meet any of the conditions will be assignedNA
.
3.3 Generate quantiles
Use the quantile()
function to generate a categorical variable based on quantile cutoffs. In the code below, set the number of desired quantiles (here, tertiles) by changing the value of nquantiles
.
<- 3
nquantiles $newvar <- cut(dataset$oldvar,
datasetbreaks = quantile(dataset$oldvar,
probs = seq(0, 1,
length.out = nquantiles + 1)),
include.lowest=TRUE)
3.4 Generate an indicator/dummy variable
- You generally don’t need to create indicator/dummy variables. In models, R will treat any factor variable as a series of indicator variables. If
varname
has numeric values, turn it into a factor variable (see next section):
$varname <- factor(dataset$varname) dataset
and then use varname
directly in a model.
- One could also code the indicator/dummy variables manually using
case_when()
. - In some scenarios, it may be helpful to generate the indicator variables used in a model with:
<- as.data.frame(model.matrix(~varname + othervar,
newvars data = dataset))
Note that this will also create an intercept (a column of ones) and will only contain the variables varname
and othervar
, no other variables from the dataset.
3.5 Label a categorical variable
To create a factor (categorical) variable, use the factor()
function, with arguments levels =
for the unique values and labels =
for the names of those values. The labels wil be assigned to the level in the same order they are specified. For example, if oldvar
has values 0, 1, and 2, which should be named “small” (0), “medium” (1), “large” (2):
$newvar <- factor(dataset$oldvar, levels = c(0, 1, 2),
datasetlabels = c("small", "medium", "large"))
The reference level is always the first level (e.g. “small”). In models, R will treat character variables as factor variables, and the levels will be in alphabetical order. For example, if the variable in the example above was already labeled, but not a factor variable, putting it in a model or using the factor()
function would result in levels “large”, “medium”, and “small” (“large” being the reference).
To determine whether a variable is already a factor variable, or whether it is a character variable (or another type), use:
class(dataset$varname)
To see the levels in order of a factor variable, use:
levels(dataset$varname)
The {forcats}
package is designed to make working with categorical variables easier. If a variable has levels “large”, “medium”, and “small” and you want to change the order, you can use:
$newvar <- forcats::fct_relevel(dataset$oldvar,
dataset"small", "medium", "large")
To rename the levels “sm”, “md”, and “lg”:
<- forcats::fct_recode(dataset$oldvar,
dataset "sm" = "small", "md" = "medium", "lg" = "large")
If the variable is numeric, you need to turn it into a character or factor variable before using fct_recode()
:
<- forcats::fct_recode(as.character(dataset$oldvar),
dataset "sm" = "0", "md" = "1", "lg" = "2")
3.6 Sorting data
The default of the arrange()
function from the {dplyr}
package is in ascending order.
<- dplyr::arrange(dataset, varname) dataset
To sort in descending order:
library(dplyr)
<- arrange(dataset, desc(varname)) dataset
You can sort on multiple variables to arrange observations into ascending order based on varname1
, and within each varname1
category one can sort into descending order by varname2
.
library(dplyr)
<- arrange(dataset, varname1, desc(varname2)) dataset
3.7 Merge two datasets
To merge two datasets, use a _join()
function from the {dplyr}
package: left_join()
, right_join()
, full_join()
. The functions differ by how they treat rows without a match in the “left-hand side” dataset and the “right-hand side” dataset.
This will keep all observations in dataset1
and merge only with observations from dataset2
with a matching id
in dataset1
:
library(dplyr)
<- left_join(dataset1, dataset2, join_by(id)) dataset3
3.8 Keeping and dropping observations
To create a dataset that is a subset of another dataset, use filter()
. This code will retain all observations with both the age and sex requirement:
<- dplyr::filter(dataset1, age > 35, sex == "F") dataset2
You can also explicitly use logic operators such as &
(and) and |
(or). To retain observations that are either older than 35 or female (or both):
<- dplyr::filter(dataset1, age > 35 | sex == "F") dataset2
4 Descriptives, statistical tests, and visualization tools
4.1 Categorical data: frequencies, tests, and association measures
4.1.1 One-way frequency table
One can get a frequency table using the count()
function.
::count(dataset, varname) dplyr
The tabyl()
function from the {janitor}
package is also helpful for frequency tables because it will also add percentages by default to a one-way frequency table.
::tabyl(dataset, varname) janitor
For a binary variable, you can use prop.test()
or binom.test()
(exact intervals) to get confidence intervals for the proportion of 1’s (or the “higher” level for a factor variable):
prop.test(table(dataset$binary_var))
4.1.2 RxC frequency tables
One can easily extend the functions above to cross-tabulate two categorical variables.
::count(dataset, varname1, varname2)
dplyr::tabyl(dataset, varname1, varname2) janitor
To get proportions when using tabyl()
:
library(janitor)
tabyl(dataset, varname1, varname2) |> adorn_percentages()
4.1.3 Two-sample test of proportions
To compare proportions of varname2
stratified by varname1
and get a confidence interval for the difference in proportions;
prop.test(table(dataset$varname1, dataset$varname2))
4.1.4 Estimate measures of association
The {epitools}
package contains functions for estimating various simple epidemiologic quantities. For example, to estimate a risk ratio (and the corresponding proportions):
::epitab(dataset$exposure, dataset$outcome,
epitoolsmethod = "riskratio")
4.1.5 Stratified frequency tables
You can estimate any summary statistic of one variable (ex. varname1
or varname2
) within different levels of a categorical variable (ex. categorical_variable
):
::summarize(dataset, mean_1 = mean(varname1),
dplyrsd_2 = sd(varname2), .by = categorical_variable)
For two-way frequency tables (varname1
x varname2
) stratified by a third variable (categoricalvariable
):
::tabyl(dataset, varname1, varname2, categorical_variable) janitor
4.2 Continuous data: summary statistics, visualization, and statistical tests
4.2.1 Measures of location and variance
Use the summary
command to obtain summary statistics for continuous data (mean, minimum, maximum, median, IQR):
summary(dataset$varname)
You can also estimate arbitrary summary statistics using dplyr::summarize()
:
::summarize(dataset, mean = mean(varname), sd = sd(varname)) dplyr
One can estimate the standard errors and 95% CI of the mean using the ci
or ci means
command. t.test()
function:
t.test(dataset$varname)
4.2.2 Histogram
A histogram graphic for continuous data:
hist(dataset$age)
With the {ggplot2}
package:
ggplot(dataset, aes(x = age)) +
geom_histogram() +
labs(title = "Histogram of age",
x = "Age")
4.2.3 Box plots
To show a box-and-whisker plot of age and a box plot of age by a categorical variable (ex. sex):
boxplot(dataset$age, dataset$sex)
ggplot(dataset, aes(y = age, x = sex)) +
geom_boxplot() +
labs(title = "Boxplot of age by sex",
x = "Sex", y = "Age")
4.2.4 Scatterplot
Scatterplot of two continuous variables:
plot(dataset$age, dataset$bmi)
ggplot(dataset, aes(y = age, x = bmi)) +
geom_point() +
labs(x = "Age", y = "BMI")
4.2.5 Statistical Tests
To conduct a two-sample T-test (continuous variable by a binary variable), use the t.test()
function with the var.equal = FALSE
argument if desired:
t.test(continuous_var ~ categorical_var,
data = dataset, var.equal = FALSE)
To conduct a one-way ANOVA (test differences of two or more means), use the aov()
function:
aov(continuous_var ~ categorical_var, data = dataset)
To conduct a non-parametric test Wilcoxon Rank-sum test (aka Mann-Whitney U test), use the wilcox.test()
function:
wilcox.test(continuous_var ~ categorical_var, data = dataset)
4.2.6 Correlation coefficients
One can estimate correlation coefficients using the cor()
function. The default is Pearson; one can estimate the Spearman correlation coefficients using the method = "spearman"
argument:
cor(dataset[, c("varname1", "varname2", "varname3")],
method = "spearman")
You can also provide two vectors if a single correlation is desired. For confidence intervals and hypothesis tests, use cor.test()
.
cor.test(dataset$varname1, dataset$varname2)
5 Regression analyses
5.1 Logistic Regression
5.1.1 Using continuous or binary variables
Fit a logistic regression model:
<- glm(outcome ~ exposure + var1 + var2,
model data = dataset, family = binomial())
Code to output the beta coefficients:
coef(model)
Code to output the odds ratios:
exp(coef(model))
Coefficients, standard errors, p-values:
summary(model)
Confidence intervals:
confint(model)
The {broom}
package is helpful for getting model output:
::tidy(model, exponentiate = TRUE, conf.int = TRUE) broom
5.1.2 Using indicator variables
R automatically creates indicator variables for factor or character variables (see previous section). To force it to create indicator variables for a numeric variable, you can use factor()
:
<- glm(outcome ~ exposure + factor(var1) + var2,
model data = dataset, family = binomial())
To change the referent group, you can use fct_relevel()
to move a new level to be the first level (the other levels will stay in their original order, unless you include them):
$newvar <- forcats::fct_relevel(dataset$oldvar, "ref")
dataset<- glm(outcome ~ exposure + newvar,
model data = dataset, family = binomial())
5.1.3 Stratified Logistic Regression
To fit a logistic regression model within levels of categorical_var
:
library(dplyr)
<- dataset |>
models group_by(categorical_var) |>
group_map(~glm(outcome ~ exposure, data = .x))
5.2 Linear Regression
One can conduct a linear regression model using either the lm()
function or the glm()
function with family = gaussian()
. All of the same functions work on these models to extract model output.
<- lm(outcome ~ exposure + var1 + var1)
model ::tidy(model, conf.int = TRUE) broom
5.3 Survival Data
5.3.1 Set-up
The {survival}
package is used for this entire section.
library(survival)
Within that package, the Surv()
function is used to “create” the outcome and is used in other functions:
Surv(time_to_outcome, outcome)
5.3.2 Graphs
Kaplan-Meier curves overall:
<- survfit(Surv(time_to_outcome, outcome) ~ 1,
km data = dataset)
plot(km)
KM curves by exposure groups:
<- survfit(Surv(time_to_outcome, outcome) ~ exposure,
km data = dataset)
plot(km)
To add 95% confidence intervals and censoring marks to the figure:
plot(km, conf.int = TRUE, mark.time = TRUE)
Log negative-log survivor plots:
plot(km, fun = "cloglog")
The {survminer}
package is useful for creating survival curves for publication and can also include risk tables:
::ggsurvplot(km, risk.table = TRUE) survminer
5.3.3 Summary statistics
To see survivor function at various time points (ex. 30, 60, 90 days):
summary(km, times = c(30, 60, 90))
You can also use the tidy()
function to extract the estimates and confidence intervals over all times in a nicer format:
::tidy(km) broom
To get median and other percentile survival times with confidence intervals:
quantile(km, probs = c(0.25, 0.5, 0.75))
5.3.4 Statistical tests
Log-rank test:
survdiff(Surv(time_to_outcome, outcome) ~ exposure,
data = dataset)
See the {survminer}
vignette for details on plotting survival curves with p-values from weighted log-rank tests (here, Tarone-Ware):
::ggsurvplot(km, data = dataset,
survminerpval = TRUE, pval.method = TRUE,
log.rank.weights = "sqrtN",
pval.method.coord = c(3, 0.1),
pval.method.size = 4)
5.3.5 Cox Regression
One can conduct a Cox regression model using the coxph()
function from the {survival}
package.
<- coxph(Surv(time_to_outcome, outcome) ~ exposure,
cox data = dataset)
The functions from the logistic regression section can be used to extract model output.
Competing Risk Analysis: In carrying out a competing risk analysis, the “outcome” variable can take on three values. There must be a value for the actual event of interest (outcome=1); a value for censoring (outcome=0); and a value for the competing risk event (outcome=2).
<- finegray(Surv(time_to_outcome, outcome) ~ exposure,
fg_data data = dataset)
<- coxph(Surv(fgstart, fgstop, fgstatus) ~ exposure,
fg_model weight = fgwt, data = fg_data)
5.4 Goodness of Fit Statistics
To get the log Likelihood/AIC/BIC for the most recent regression model:
loglik(model)
AIC(model)
BIC(model)
The glance()
function from the {broom}
package will also give you goodness-of-fit statistics:
::glance(model) broom
To run a likelihood ratio test comparing two nested models:
::lrtest(mod_full, mod_nested) lmtest