Introduction to R
ggplot2
and make some simple figuresWe saw that we could compute summary statistics using summarize()
:
nlsy |>
group_by(sex_cat) |>
summarize(prop_glasses = mean(glasses),
mean_age_bir = mean(age_bir),
n_vg_eye = sum(eyesight_cat == "Very Good"),
prop_vg_eye = mean(eyesight_cat == "Very Good"),
n_g_eye = sum(eyesight_cat == "Good"),
prop_g_eye = mean(eyesight_cat == "Good")
)
# A tibble: 2 × 7
sex_cat prop_glasses mean_age_bir n_vg_eye prop_vg_eye n_g_eye prop_g_eye
<fct> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 Male 0.441 25.1 162 0.323 85 0.170
2 Female 0.572 22.2 223 0.317 164 0.233
The {janitor}
package has nice functionality to create tables:
sex_cat n percent
Male 501 0.4157676
Female 704 0.5842324
Tip
Just like many of our other functions, it takes a dataset as its first argument so can pipe
tabyl()
It’s easy to make a 2 x 2 (or n x m) table with tabyl()
tabyl()
We might want to see the percentages instead of counts
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 55.9% 44.1%
Female 42.8% 57.2%
Tip
By default, adorn_percentages()
will give you row percentages (sums to 100% across the row)
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 48.2% 35.4%
Female 51.8% 64.6%
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 23.2% 18.3%
Female 25.0% 33.4%
tabyl()
These are just dataframes, so we can save as csv, etc.
Warning
If you don’t have a “results” folder in your project, that code won’t work until you make one!
Pearson's Chi-squared test
data: tabyl(nlsy, sex_cat, eyesight_cat)
X-squared = 22.738, df = 4, p-value = 0.0001428
Tip
There are a lot of other helpful functions in the janitor
package. My favorite is clean_names()
. Check out the documentation for more!
There are lots of packages for making tables in R
One of my favorites is {gtsummary}
gtsummary::tbl_summary()
Characteristic | Male, N = 5011 | Female, N = 7041 |
---|---|---|
race_eth_cat | ||
Hispanic | 81 (16%) | 130 (18%) |
Black | 138 (28%) | 169 (24%) |
Non-Black, Non-Hispanic | 282 (56%) | 405 (58%) |
eyesight_cat | ||
Excellent | 228 (46%) | 246 (35%) |
Very Good | 162 (32%) | 223 (32%) |
Good | 85 (17%) | 164 (23%) |
Fair | 21 (4.2%) | 57 (8.1%) |
Poor | 5 (1.0%) | 14 (2.0%) |
glasses | 221 (44%) | 403 (57%) |
age_bir | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) |
1 n (%); Median (IQR) |
Characteristic | Male, N = 5011 | Female, N = 7041 |
---|---|---|
Race/ethnicity | ||
Hispanic | 81 (16%) | 130 (18%) |
Black | 138 (28%) | 169 (24%) |
Non-Black, Non-Hispanic | 282 (56%) | 405 (58%) |
Eyesight | ||
Excellent | 228 (46%) | 246 (35%) |
Very Good | 162 (32%) | 223 (32%) |
Good | 85 (17%) | 164 (23%) |
Fair | 21 (4.2%) | 57 (8.1%) |
Poor | 5 (1.0%) | 14 (2.0%) |
Wears glasses | 221 (44%) | 403 (57%) |
Age at first birth | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) |
1 n (%); Median (IQR) |
tbl_summary(
nlsy,
by = sex_cat,
include = c(race_eth_cat, eyesight_cat,
glasses, age_bir),
label = list(
race_eth_cat ~ "Race/ethnicity",
eyesight_cat ~ "Eyesight",
glasses ~ "Wears glasses",
age_bir ~ "Age at first birth"
)) |>
add_p(test = list(
all_continuous() ~ "t.test",
all_categorical() ~ "chisq.test")) |>
add_overall(col_label = "**Total**") |>
bold_labels() |>
modify_footnote(update = everything() ~ NA) |>
modify_header(label = "**Variable**",
p.value = "**P**")
Variable | Total | Male, N = 501 | Female, N = 704 | P |
---|---|---|---|---|
Race/ethnicity | 0.3 | |||
Hispanic | 211 (18%) | 81 (16%) | 130 (18%) | |
Black | 307 (25%) | 138 (28%) | 169 (24%) | |
Non-Black, Non-Hispanic | 687 (57%) | 282 (56%) | 405 (58%) | |
Eyesight | <0.001 | |||
Excellent | 474 (39%) | 228 (46%) | 246 (35%) | |
Very Good | 385 (32%) | 162 (32%) | 223 (32%) | |
Good | 249 (21%) | 85 (17%) | 164 (23%) | |
Fair | 78 (6.5%) | 21 (4.2%) | 57 (8.1%) | |
Poor | 19 (1.6%) | 5 (1.0%) | 14 (2.0%) | |
Wears glasses | 624 (52%) | 221 (44%) | 403 (57%) | <0.001 |
Age at first birth | 22.0 (19.0, 27.0) | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) | <0.001 |
tbl_summary()
as_flex_table()
to output using the flextable
packagegt
package with as_gt()
Regressions take a formula: y ~ x1 + x2 + x3
x1
and x2
with y ~ x1*x2 + x3
x1
and x2
will be included tooI()
: y ~ x1 + I(x1^2)
To fit a linear regression (by ordinary least squares), use the lm()
function
To fit a generalized linear model (e.g., logistic regression, Poisson regression) use glm()
and specify the family =
argument
family = gaussian()
is the default: another way of fitting a linear regresionfamily = binomial()
gives you logistic regressionfamily = poisson()
is Poisson regressionWe tell R what dataset to pull the variables from with data =
(Intercept) sex_catFemale
1029.3339 -4681.2484
age_bir race_eth_catBlack
482.4650 -299.6490
race_eth_catNon-Black, Non-Hispanic sex_catFemale:age_bir
6418.4568 161.5008
2.5 % 97.5 %
(Intercept) -3746.58543 5805.2531
sex_catFemale -10553.32090 1190.8242
age_bir 303.50836 661.4217
race_eth_catBlack -2448.39108 1849.0932
race_eth_catNon-Black, Non-Hispanic 4500.91244 8336.0012
sex_catFemale:age_bir -77.30931 400.3109
(Intercept) eyesight_catVery Good eyesight_catGood
0.6338301 0.9172091 0.8608297
eyesight_catFair eyesight_catPoor sex_catFemale
0.5798278 1.1624355 1.8362460
income
1.0000174
Call:
glm(formula = glasses ~ eyesight_cat + sex_cat + income, family = binomial(),
data = nlsy)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.560e-01 1.383e-01 -3.298 0.000975 ***
eyesight_catVery Good -8.642e-02 1.400e-01 -0.617 0.537031
eyesight_catGood -1.499e-01 1.604e-01 -0.934 0.350267
eyesight_catFair -5.450e-01 2.535e-01 -2.150 0.031585 *
eyesight_catPoor 1.505e-01 4.786e-01 0.315 0.753121
sex_catFemale 6.077e-01 1.210e-01 5.021 5.14e-07 ***
income 1.745e-05 4.613e-06 3.782 0.000155 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1668.9 on 1204 degrees of freedom
Residual deviance: 1626.8 on 1198 degrees of freedom
AIC: 1640.8
Number of Fisher Scoring iterations: 4
{broom}
helps “tidy” regression results
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1029. 2434. 0.423 6.72e- 1
2 sex_catFemale -4681. 2993. -1.56 1.18e- 1
3 age_bir 482. 91.2 5.29 1.46e- 7
4 race_eth_catBlack -300. 1095. -0.274 7.84e- 1
5 race_eth_catNon-Black, Non-Hispanic 6418. 977. 6.57 7.63e-11
6 sex_catFemale:age_bir 162. 122. 1.33 1.85e- 1
broom::tidy()
# A tibble: 7 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.634 0.138 -3.30 9.75e-4 0.483 0.830
2 eyesight_catVery Good 0.917 0.140 -0.617 5.37e-1 0.697 1.21
3 eyesight_catGood 0.861 0.160 -0.934 3.50e-1 0.628 1.18
4 eyesight_catFair 0.580 0.254 -2.15 3.16e-2 0.350 0.949
5 eyesight_catPoor 1.16 0.479 0.315 7.53e-1 0.458 3.08
6 sex_catFemale 1.84 0.121 5.02 5.14e-7 1.45 2.33
7 income 1.00 0.00000461 3.78 1.55e-4 1.00 1.00
broom::tidy()
can also help other statistics# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 22.7 0.000143 4 Pearson's Chi-squared test
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2397. 16690. 14292. 3.00 0.00273 963. 831. 3963.
# ℹ 2 more variables: method <chr>, alternative <chr>
gtsummary::tbl_regression()
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 1,029 | -3,747, 5,805 | 0.7 |
Sex | |||
Male | — | — | |
Female | -4,681 | -10,553, 1,191 | 0.12 |
Age at first birth | 482 | 304, 661 | <0.001 |
Race/ethnicity | |||
Hispanic | — | — | |
Black | -300 | -2,448, 1,849 | 0.8 |
Non-Black, Non-Hispanic | 6,418 | 4,501, 8,336 | <0.001 |
Sex/age interaction | |||
Female * Age at first birth | 162 | -77, 400 | 0.2 |
1 CI = Confidence Interval |
gtsummary::tbl_regression()
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Eyesight | |||
Excellent | — | — | |
Very Good | 0.92 | 0.70, 1.21 | 0.5 |
Good | 0.86 | 0.63, 1.18 | 0.4 |
Fair | 0.58 | 0.35, 0.95 | 0.032 |
Poor | 1.16 | 0.46, 3.08 | 0.8 |
Sex | |||
Male | — | — | |
Female | 1.84 | 1.45, 2.33 | <0.001 |
Income | 1.00 | 1.00, 1.00 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
linear_model_no_int <- lm(income ~ sex_cat + age_bir + race_eth_cat, data = nlsy)
tbl_no_int <- tbl_regression(
linear_model_no_int,
intercept = TRUE,
label = list(
sex_cat ~ "Sex",
race_eth_cat ~ "Race/ethnicity",
age_bir ~ "Age at first birth"
))
tbl_int <- tbl_regression(
linear_model,
intercept = TRUE,
label = list(
sex_cat ~ "Sex",
race_eth_cat ~ "Race/ethnicity",
age_bir ~ "Age at first birth",
`sex_cat:age_bir` ~ "Sex/age interaction"
))
Characteristic | Model 1 | Model 2 | ||||
---|---|---|---|---|---|---|
Beta | 95% CI1 | p-value | Beta | 95% CI1 | p-value | |
(Intercept) | -1,201 | -4,657, 2,256 | 0.5 | 1,029 | -3,747, 5,805 | 0.7 |
Sex | ||||||
Male | — | — | — | — | ||
Female | -833 | -2,283, 617 | 0.3 | -4,681 | -10,553, 1,191 | 0.12 |
Age at first birth | 571 | 448, 693 | <0.001 | 482 | 304, 661 | <0.001 |
Race/ethnicity | ||||||
Hispanic | — | — | — | — | ||
Black | -287 | -2,436, 1,863 | 0.8 | -300 | -2,448, 1,849 | 0.8 |
Non-Black, Non-Hispanic | 6,434 | 4,516, 8,352 | <0.001 | 6,418 | 4,501, 8,336 | <0.001 |
Sex/age interaction | ||||||
Female * Age at first birth | 162 | -77, 400 | 0.2 | |||
1 CI = Confidence Interval |
autoReg
Dependent: income |
| unit | value | Coefficient (multivariable) |
---|---|---|---|---|
sex_cat | Female (N=704) | Mean ± SD | 16689.6 ± 14526.5 | |
Male (N=501) | Mean ± SD | 14292.3 ± 12321.0 | -4681.25 (-10553.32 to 1190.82, p=.118) | |
age_bir | [13,52] | Mean ± SD | 23.4 ± 6.0 | 482.47 (303.51 to 661.42, p<.001) |
race_eth_cat | Black (N=307) | Mean ± SD | 10794.8 ± 9468.8 | |
Hispanic (N=211) | Mean ± SD | 10489.7 ± 9209.5 | -299.65 (-2448.39 to 1849.09, p=.784) | |
Non-Black, Non-Hispanic (N=687) | Mean ± SD | 18814.0 ± 14750.7 | 6418.46 (4500.91 to 8336.00, p<.001) | |
sex_cat:age_bir | Male: | |||
sex_cat:age_bir | Female: | 161.50 (-77.31 to 400.31, p=.185) | ||
ggplot()
ggplot()
{data}
: must be a dataframe (or tibble!){xvar}
and {yvar}
are the names (unquoted) of the variables on the x- and y-axes
{othvar}
is some other unquoted variable name that defines a grouping or other characteristic you want to map to an aesthetic<characteristic>
: you can map {othvar}
(or a fixed "value"
) to any of a number of aesthetic features of the figure; e.g., color, shape, size, linetype, etc.<geom>
: the geometric feature you want to use; e.g., point (scatterplot), line, histogram, bar, etc."value"
: a fixed value that defines some characteristic of the figure; e.g., “red”, 10, “dashed”ggplot()
doesn’t plot any data itself, it just sets up the data and variablesgeom_bar()
creates a bar graph for the number of observations with a certain value of the x
variable
y
variableTip
use geom_col()
if you have a y
variable that you want to use as the height of the bars
facet_grid()
creates a panel for each value of another variable
rows =
vars()
(you can use helpers like starts_with()
)Tip
use facet_wrap()
if you want to create panels that expand along rows and columns (e.g., to facet by many countries)
scale_{fill/color}_{...}()
functions change the color palette
Tip
scale_fill_viridis_d()
good color-blind and black & white-friendly options
scale_{x/y}_{...}()
functions change the axis scale and/or labelingTip
scale_y_log10()
is helpful when plotting odds or risk ratios
theme_{...}()
changes the “look” of the plot
Tip
find lots of themes and color palettes at https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
ggplot(data = nlsy,
aes(x = eyesight_cat,
fill = eyesight_cat)) +
geom_bar() +
facet_grid(cols = vars(glasses_cat)) +
scale_fill_brewer(palette = "Spectral",
direction = -1) +
scale_x_discrete(breaks = c("Excellent",
"Good", "Poor"),
name = "Eyesight quality") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(
angle = 45, vjust = 1, hjust = 1))
Tip
lots of arguments can be set to element_blank()
to get rid of them
ggplot(data = nlsy,
aes(x = eyesight_cat,
fill = eyesight_cat)) +
geom_bar() +
facet_grid(cols = vars(glasses_cat)) +
scale_fill_brewer(palette = "Spectral",
direction = -1) +
scale_x_discrete(breaks = c("Excellent",
"Good", "Poor"),
name = "Eyesight quality") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(
angle = 45, vjust = 1, hjust = 1))
labs()
can add subtitles, caption, alt text, as well as label any aesthetics (fill, color, etc.)Tip
there’s a lot of redundancy… we could have specified x = "Eyesight quality"
here instead.
ggplot(data = nlsy,
aes(x = eyesight_cat,
fill = eyesight_cat)) +
geom_bar() +
facet_grid(cols = vars(glasses_cat)) +
scale_fill_brewer(palette = "Spectral",
direction = -1) +
scale_x_discrete(breaks = c("Excellent",
"Good", "Poor"),
name = "Eyesight quality") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(
angle = 45, vjust = 1, hjust = 1)) +
labs(title = "Eyesight in NLSY",
y = NULL) +
coord_cartesian(expand = FALSE)
coord_{...}()
functions change the coordinate system
expand = FALSE
means there is no extra space beyond the axis limitsTip
coord_fixed(ratio = 1)
will ensure that 1 unit on the x-axis is the same length as one unit on the y-axis
Let’s walk through some more examples in depth
geom_point()
Question
How are we specifying the type of plot (scatterplot)? How are we specifying the variables to plot? How are we specifying the data used to plot it?
When we put color =
outside the aes()
, it means we’re giving it a specific color value that applies to all the points.
When we put color =
inside the aes()
– with no quotation marks – it means we’re telling it how it should assign colors.
Note that we could also put the aes()
(aesthetics) in the geom_()
itself.
Warning
If within geom_point()
, it will only apply to that geom. Here it doesn’t matter because we only have one geom.
We add on another layer to specify the colors we want.
Tip
There are a lot of options that follow the same naming scheme.
There are tons of different options in R for color palettes.
You can play around with those in the RColorBrewer
package here.
You can access the scales in that package with scale_color_brewer()
Tip
Each of the scale_color_x()
functions has a lot of the same arguments. For a lot more info visit the ggplot2 book
There are a lot of scale_x_()
and scale_y_()
functions for you to explore
Tip
The naming schemes work similarly to the scale_color
ones, just with different options!
The {scales}
packages contains lots of helpful number formatting functions
See the the examples in help(scale_x_log10)
for some of them
{ggplot2}
is the ability to “facet” a graph by splitting it up according to the values of some variableWe’ll introduce bar graphs at the same time!
Notice how we only need an x =
argument - the y-axis is automatically the count with this geom.
The facet_grid()
function splits up the data according to a variable(s).
Here we’ve split it by region into columns.
Since that was hard to read, we’ll probably want to split by rows instead.
We can also add a row for all of the data combined.
That squishes the other rows though! We can allow them all to have their own axis limits with the scales =
argument.
Other options are “free_x” if we want to allow the x-axis scale to vary, or just “free” to allow for both.
We can use facet_wrap()
instead, if we want to use both multiple rows and columns for all the values of a variable.
It tries to make a good decision, but you can override how many columns you want!
When we have a variable with a lot of possible values, we may want to bin them with a histogram
stat_bin()
using bins = 30
. Pick better value with binwidth
.We used discrete values with geom_bar()
, but with geom_histogram()
we’re combining values: the default is into 30 bins.
This is one of the most common warning messages I get in R!
We can use bins =
instead, if we want!
Note
Note how this fits into the <characteristic> = "value"
structure
Warning
Be aware that you may interpret your data differently depending on how you bin it!
Sometimes the bin width actually has some meaning so we want to specify that
Here, each bin is $1000 – you can see the $5000 and $10000 increments
You probably recognize the ggplot theme. But did you know you can trick people into thinking you made your figures in Stata?
p <- ggplot(nlsy,
aes(x = factor(sleep_wknd),
y = sleep_wkdy,
fill = factor(sleep_wknd))) +
geom_boxplot() +
scale_fill_discrete(guide = "none") +
labs(x = "hours slept on weekends",
y = "hours slept on weekends",
title = "The more people sleep on weekends, the more they\n sleep on weekdays",
subtitle = "According to NLSY data")
p
Let’s store our plot first.
Plots work just like other R objects, meaning we can use the assignment arrow.
Question
Can you figure out what each chunk of this code is doing to the figure?
We can change the overall theme
Since we stored the plot as p
, it’s easy to add on / try different things
Other packages contain themes, too.
In case you miss Excel….
You can even make your own!
If your data changes, you can easily run the whole script again:
The ggsave()
function will automatically save the most recent plot in your output.
To be safe, you can store your plot, e.g., p <- ggplot(...) + ...
and then
janitor
and gtsummary
packages to make tablesbroom
to tidy regression resultsggplot2
tabyl()
: create frequency tablestbl_summary()
: create summary tables with a lot of covariateslm()
: fit linear regression modelsglm()
: fit generalized linear modelstbl_regression()
: create regression tablestidy()
: tidy regression resultsggplot()
: create figuresgeom_...()
: add a geometric feature to a figurescale_...()
: change the scale of an axis or aestheticfacet_...()
: split a figure into panelstheme_...()
: change the look of a figure