Introduction to R
ggplot2
and make some simple figuresWe saw that we could compute summary statistics using summarize()
:
nlsy |>
group_by(sex_cat) |>
summarize(prop_glasses = mean(glasses),
mean_age_bir = mean(age_bir),
n_vg_eye = sum(eyesight_cat == "Very Good"),
prop_vg_eye = mean(eyesight_cat == "Very Good"),
n_g_eye = sum(eyesight_cat == "Good"),
prop_g_eye = mean(eyesight_cat == "Good")
)
# A tibble: 2 × 7
sex_cat prop_glasses mean_age_bir n_vg_eye prop_vg_eye n_g_eye prop_g_eye
<fct> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 Male 0.441 25.1 162 0.323 85 0.170
2 Female 0.572 22.2 223 0.317 164 0.233
The {janitor}
package has nice functionality to create tables:
sex_cat n percent
Male 501 0.4157676
Female 704 0.5842324
Tip
Just like many of our other functions, it takes a dataset as its first argument so can pipe
tabyl()
It’s easy to make a 2 x 2 (or n x m) table with tabyl()
tabyl()
We might want to see the percentages instead of counts
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 55.9% 44.1%
Female 42.8% 57.2%
Tip
By default, adorn_percentages()
will give you row percentages (sums to 100% across the row)
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 48.2% 35.4%
Female 51.8% 64.6%
sex_cat Doesn't wear glasses Wears glasses/contacts
Male 23.2% 18.3%
Female 25.0% 33.4%
adorn_percentages()
with a 1-way table because it comes with them automaticallyadorn_ns()
adorn_totals()
to either type of table (argument where =
determines whether they are row, column, or totals for both)adorn_rounding()
tabyl()
These are just dataframes, so we can save as csv, etc.
Warning
If you don’t have a “results” folder in your project, that code won’t work until you make one!
Pearson's Chi-squared test
data: tabyl(nlsy, sex_cat, eyesight_cat)
X-squared = 22.738, df = 4, p-value = 0.0001428
Tip
There are a lot of other helpful functions in the janitor
package. My favorite is clean_names()
. Check out the documentation for more!
There are lots of packages for making tables in R
One of my favorites is {gtsummary}
gtsummary::tbl_summary()
Characteristic | Male N = 5011 |
Female N = 7041 |
---|---|---|
race_eth_cat | ||
Hispanic | 81 (16%) | 130 (18%) |
Black | 138 (28%) | 169 (24%) |
Non-Black, Non-Hispanic | 282 (56%) | 405 (58%) |
eyesight_cat | ||
Excellent | 228 (46%) | 246 (35%) |
Very Good | 162 (32%) | 223 (32%) |
Good | 85 (17%) | 164 (23%) |
Fair | 21 (4.2%) | 57 (8.1%) |
Poor | 5 (1.0%) | 14 (2.0%) |
glasses | 221 (44%) | 403 (57%) |
age_bir | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) |
1 n (%); Median (Q1, Q3) |
Characteristic | Male N = 5011 |
Female N = 7041 |
---|---|---|
Race/ethnicity | ||
Hispanic | 81 (16%) | 130 (18%) |
Black | 138 (28%) | 169 (24%) |
Non-Black, Non-Hispanic | 282 (56%) | 405 (58%) |
Eyesight | ||
Excellent | 228 (46%) | 246 (35%) |
Very Good | 162 (32%) | 223 (32%) |
Good | 85 (17%) | 164 (23%) |
Fair | 21 (4.2%) | 57 (8.1%) |
Poor | 5 (1.0%) | 14 (2.0%) |
Wears glasses | 221 (44%) | 403 (57%) |
Age at first birth | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) |
1 n (%); Median (Q1, Q3) |
tbl_summary(
nlsy,
by = sex_cat,
include = c(race_eth_cat, eyesight_cat,
glasses, age_bir),
label = list(
race_eth_cat ~ "Race/ethnicity",
eyesight_cat ~ "Eyesight",
glasses ~ "Wears glasses",
age_bir ~ "Age at first birth"
)) |>
add_p(test = list(
all_continuous() ~ "t.test",
all_categorical() ~ "chisq.test")) |>
add_overall(col_label = "**Total**") |>
bold_labels() |>
modify_header(label = "**Variable**",
p.value = "**P**") |>
remove_footnote_header() |>
modify_caption("Participant characteristics")
Variable | Total | Male N = 501 |
Female N = 704 |
P |
---|---|---|---|---|
Race/ethnicity | 0.3 | |||
Hispanic | 211 (18%) | 81 (16%) | 130 (18%) | |
Black | 307 (25%) | 138 (28%) | 169 (24%) | |
Non-Black, Non-Hispanic | 687 (57%) | 282 (56%) | 405 (58%) | |
Eyesight | <0.001 | |||
Excellent | 474 (39%) | 228 (46%) | 246 (35%) | |
Very Good | 385 (32%) | 162 (32%) | 223 (32%) | |
Good | 249 (21%) | 85 (17%) | 164 (23%) | |
Fair | 78 (6.5%) | 21 (4.2%) | 57 (8.1%) | |
Poor | 19 (1.6%) | 5 (1.0%) | 14 (2.0%) | |
Wears glasses | 624 (52%) | 221 (44%) | 403 (57%) | <0.001 |
Age at first birth | 22.0 (19.0, 27.0) | 24.0 (21.0, 29.0) | 21.0 (18.0, 26.0) | <0.001 |
tbl_summary()
as_flex_table()
to output using the flextable
packageBesides gtsummary
, there are other packages that can help with tables and customizing them (particularly for html, but also if you want to output to Word or LaTeX)
{gt}
: html table package that {gtsummary}
uses under the hood{kableExtra}
: great for html tables, but also helpful for LaTeX{flextable}
: great for Word (and Powerpoint!) tables, but also works with other formatsSee also the winners of the Posit table contest!
Regressions take a formula: y ~ x1 + x2 + x3
x1
and x2
with y ~ x1*x2 + x3
x1
and x2
will be included tooI()
: y ~ x1 + I(x1^2)
To fit a linear regression (by ordinary least squares), use the lm()
function
To fit a generalized linear model (e.g., logistic regression, Poisson regression) use glm()
and specify the family =
argument
family = gaussian()
is the default: another way of fitting a linear regresionfamily = binomial()
gives you logistic regressionfamily = poisson()
is Poisson regressionWe tell R what dataset to pull the variables from with data =
(Intercept) sex_catFemale
1029.3339 -4681.2484
age_bir race_eth_catBlack
482.4650 -299.6490
race_eth_catNon-Black, Non-Hispanic sex_catFemale:age_bir
6418.4568 161.5008
2.5 % 97.5 %
(Intercept) -3746.58543 5805.2531
sex_catFemale -10553.32090 1190.8242
age_bir 303.50836 661.4217
race_eth_catBlack -2448.39108 1849.0932
race_eth_catNon-Black, Non-Hispanic 4500.91244 8336.0012
sex_catFemale:age_bir -77.30931 400.3109
(Intercept) eyesight_catVery Good eyesight_catGood
0.6338301 0.9172091 0.8608297
eyesight_catFair eyesight_catPoor sex_catFemale
0.5798278 1.1624355 1.8362460
income
1.0000174
Call:
glm(formula = glasses ~ eyesight_cat + sex_cat + income, family = binomial(),
data = nlsy)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.560e-01 1.383e-01 -3.298 0.000975 ***
eyesight_catVery Good -8.642e-02 1.400e-01 -0.617 0.537031
eyesight_catGood -1.499e-01 1.604e-01 -0.934 0.350267
eyesight_catFair -5.450e-01 2.535e-01 -2.150 0.031585 *
eyesight_catPoor 1.505e-01 4.786e-01 0.315 0.753121
sex_catFemale 6.077e-01 1.210e-01 5.021 5.14e-07 ***
income 1.745e-05 4.613e-06 3.782 0.000155 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1668.9 on 1204 degrees of freedom
Residual deviance: 1626.8 on 1198 degrees of freedom
AIC: 1640.8
Number of Fisher Scoring iterations: 4
{broom}
helps “tidy” regression results
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1029. 2434. 0.423 6.72e- 1
2 sex_catFemale -4681. 2993. -1.56 1.18e- 1
3 age_bir 482. 91.2 5.29 1.46e- 7
4 race_eth_catBlack -300. 1095. -0.274 7.84e- 1
5 race_eth_catNon-Black, Non-Hispanic 6418. 977. 6.57 7.63e-11
6 sex_catFemale:age_bir 162. 122. 1.33 1.85e- 1
broom::tidy()
# A tibble: 7 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.634 0.138 -3.30 9.75e-4 0.483 0.830
2 eyesight_catVery Good 0.917 0.140 -0.617 5.37e-1 0.697 1.21
3 eyesight_catGood 0.861 0.160 -0.934 3.50e-1 0.628 1.18
4 eyesight_catFair 0.580 0.254 -2.15 3.16e-2 0.350 0.949
5 eyesight_catPoor 1.16 0.479 0.315 7.53e-1 0.458 3.08
6 sex_catFemale 1.84 0.121 5.02 5.14e-7 1.45 2.33
7 income 1.00 0.00000461 3.78 1.55e-4 1.00 1.00
broom::tidy()
can also help other statistics# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 22.7 0.000143 4 Pearson's Chi-squared test
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2397. 16690. 14292. 3.00 0.00273 963. 831. 3963.
# ℹ 2 more variables: method <chr>, alternative <chr>
gtsummary::tbl_regression()
Characteristic | Beta | 95% CI | p-value |
---|---|---|---|
(Intercept) | 1,029 | -3,747, 5,805 | 0.7 |
Sex | |||
Male | — | — | |
Female | -4,681 | -10,553, 1,191 | 0.12 |
Age at first birth | 482 | 304, 661 | <0.001 |
Race/ethnicity | |||
Hispanic | — | — | |
Black | -300 | -2,448, 1,849 | 0.8 |
Non-Black, Non-Hispanic | 6,418 | 4,501, 8,336 | <0.001 |
Sex/age interaction | |||
Female * Age at first birth | 162 | -77, 400 | 0.2 |
Abbreviation: CI = Confidence Interval |
gtsummary::tbl_regression()
Characteristic | OR | 95% CI | p-value |
---|---|---|---|
Eyesight | |||
Excellent | — | — | |
Very Good | 0.92 | 0.70, 1.21 | 0.5 |
Good | 0.86 | 0.63, 1.18 | 0.4 |
Fair | 0.58 | 0.35, 0.95 | 0.032 |
Poor | 1.16 | 0.46, 3.08 | 0.8 |
Sex | |||
Male | — | — | |
Female | 1.84 | 1.45, 2.33 | <0.001 |
Income | 1.00 | 1.00, 1.00 | <0.001 |
Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
linear_model_no_int <- lm(income ~ sex_cat + age_bir + race_eth_cat, data = nlsy)
tbl_no_int <- tbl_regression(
linear_model_no_int,
intercept = TRUE,
label = list(
sex_cat ~ "Sex",
race_eth_cat ~ "Race/ethnicity",
age_bir ~ "Age at first birth"
))
tbl_int <- tbl_regression(
linear_model,
intercept = TRUE,
label = list(
sex_cat ~ "Sex",
race_eth_cat ~ "Race/ethnicity",
age_bir ~ "Age at first birth",
`sex_cat:age_bir` ~ "Sex/age interaction"
))
Characteristic |
Model 1
|
Model 2
|
||||
---|---|---|---|---|---|---|
Beta | 95% CI | p-value | Beta | 95% CI | p-value | |
(Intercept) | -1,201 | -4,657, 2,256 | 0.5 | 1,029 | -3,747, 5,805 | 0.7 |
Sex | ||||||
Male | — | — | — | — | ||
Female | -833 | -2,283, 617 | 0.3 | -4,681 | -10,553, 1,191 | 0.12 |
Age at first birth | 571 | 448, 693 | <0.001 | 482 | 304, 661 | <0.001 |
Race/ethnicity | ||||||
Hispanic | — | — | — | — | ||
Black | -287 | -2,436, 1,863 | 0.8 | -300 | -2,448, 1,849 | 0.8 |
Non-Black, Non-Hispanic | 6,434 | 4,516, 8,352 | <0.001 | 6,418 | 4,501, 8,336 | <0.001 |
Sex/age interaction | ||||||
Female * Age at first birth | 162 | -77, 400 | 0.2 | |||
Abbreviation: CI = Confidence Interval |
ggplot()
ggplot()
{data}
: must be a dataframe (or tibble!){xvar}
and {yvar}
are the names (unquoted) of the variables on the x- and y-axes
{othvar}
is some other unquoted variable name that defines a grouping or other characteristic you want to map to an aesthetic<characteristic>
: you can map {othvar}
(or a fixed "value"
) to any of a number of aesthetic features of the figure; e.g., color, shape, size, linetype, etc.<geom>
: the geometric feature you want to use; e.g., point (scatterplot), line, histogram, bar, etc."value"
: a fixed value that defines some characteristic of the figure; e.g., “red”, 10, “dashed”ggplot()
doesn’t plot any data itself, it just sets up the data and variablesgeom_bar()
creates a bar graph for the number of observations with a certain value of the x
variable
y
variableTip
use geom_col()
if you have a y
variable that you want to use as the height of the bars
facet_grid()
creates a panel for each value of another variable
rows =
vars()
(you can use helpers like starts_with()
)Tip
use facet_wrap()
if you want to create panels that expand along rows and columns (e.g., to facet by many countries)
scale_{fill/color}_{...}()
functions change the color palette
Tip
scale_fill_viridis_d()
good color-blind and black & white-friendly options