ID 543: Day 4

Introduction to R

Homework review

Today’s goals

  • Make a simple frequency tables
  • Make a more complex “Table 1”
  • Run regressions in R and show the results in a table
  • Introduce the framework of ggplot2 and make some simple figures

Tables

We saw that we could compute summary statistics using summarize():

nlsy |> 
  group_by(sex_cat) |> 
  summarize(prop_glasses = mean(glasses),
            mean_age_bir = mean(age_bir),
            n_vg_eye = sum(eyesight_cat == "Very Good"),
            prop_vg_eye = mean(eyesight_cat == "Very Good"),
            n_g_eye = sum(eyesight_cat == "Good"),
            prop_g_eye = mean(eyesight_cat == "Good")
  )
# A tibble: 2 × 7
  sex_cat prop_glasses mean_age_bir n_vg_eye prop_vg_eye n_g_eye prop_g_eye
  <fct>          <dbl>        <dbl>    <int>       <dbl>   <int>      <dbl>
1 Male           0.441         25.1      162       0.323      85      0.170
2 Female         0.572         22.2      223       0.317     164      0.233

Easier way to make frequency tables

The {janitor} package has nice functionality to create tables:

# install.packages("janitor")
library(janitor)
tabyl(nlsy, sex_cat) 
 sex_cat   n   percent
    Male 501 0.4157676
  Female 704 0.5842324

Tip

Just like many of our other functions, it takes a dataset as its first argument so can pipe

Cross-tabulations with tabyl()

It’s easy to make a 2 x 2 (or n x m) table with tabyl()

tabyl(nlsy, sex_cat, eyesight_cat)
 sex_cat Excellent Very Good Good Fair Poor
    Male       228       162   85   21    5
  Female       246       223  164   57   14

tabyl()

We might want to see the percentages instead of counts

nlsy |> 
  tabyl(sex_cat, glasses_cat) |> 
  adorn_percentages() |> 
  adorn_pct_formatting()
 sex_cat Doesn't wear glasses Wears glasses/contacts
    Male                55.9%                  44.1%
  Female                42.8%                  57.2%

Tip

By default, adorn_percentages() will give you row percentages (sums to 100% across the row)

Other percentages

nlsy |>
    tabyl(sex_cat, glasses_cat) |>  
    adorn_percentages("col") |>             
    adorn_pct_formatting()
 sex_cat Doesn't wear glasses Wears glasses/contacts
    Male                48.2%                  35.4%
  Female                51.8%                  64.6%
nlsy |>
    tabyl(sex_cat, glasses_cat) |>
    adorn_percentages("all") |>             
    adorn_pct_formatting()
 sex_cat Doesn't wear glasses Wears glasses/contacts
    Male                23.2%                  18.3%
  Female                25.0%                  33.4%

Other functions

  • Don’t need to use adorn_percentages() with a 1-way table because it comes with them automatically
  • Add back n’s with adorn_ns()
  • Can add totals with adorn_totals() to either type of table (argument where = determines whether they are row, column, or totals for both)
  • Round with adorn_rounding()

tabyl()

These are just dataframes, so we can save as csv, etc.

two_by_two <- nlsy |>
    tabyl(sex_cat, glasses_cat)

write_csv(two_by_two, 
          here::here("results", "sex-eyesight-table.csv"))

Warning

If you don’t have a “results” folder in your project, that code won’t work until you make one!

We can easily do a chi-squared test

tabyl(nlsy, sex_cat, eyesight_cat) |> 
  chisq.test()

    Pearson's Chi-squared test

data:  tabyl(nlsy, sex_cat, eyesight_cat)
X-squared = 22.738, df = 4, p-value = 0.0001428

Tip

There are a lot of other helpful functions in the janitor package. My favorite is clean_names(). Check out the documentation for more!

Exercise

Making more complex tables

There are lots of packages for making tables in R

One of my favorites is {gtsummary}

# install.packages("gtsummary")
library(gtsummary)

gtsummary::tbl_summary()

tbl_summary(
  nlsy,
  by = sex_cat,
  include = c(race_eth_cat,
              eyesight_cat, 
              glasses, 
              age_bir))
Characteristic Male
N = 5011
Female
N = 7041
race_eth_cat

    Hispanic 81 (16%) 130 (18%)
    Black 138 (28%) 169 (24%)
    Non-Black, Non-Hispanic 282 (56%) 405 (58%)
eyesight_cat

    Excellent 228 (46%) 246 (35%)
    Very Good 162 (32%) 223 (32%)
    Good 85 (17%) 164 (23%)
    Fair 21 (4.2%) 57 (8.1%)
    Poor 5 (1.0%) 14 (2.0%)
glasses 221 (44%) 403 (57%)
age_bir 24.0 (21.0, 29.0) 21.0 (18.0, 26.0)
1 n (%); Median (Q1, Q3)

tbl_summary(
  nlsy,
  by = sex_cat,
  include = c(race_eth_cat, eyesight_cat, 
              glasses, age_bir),
  label = list(
    race_eth_cat ~ "Race/ethnicity",
    eyesight_cat ~ "Eyesight",
    glasses ~ "Wears glasses",
    age_bir ~ "Age at first birth"
  ))
Characteristic Male
N = 5011
Female
N = 7041
Race/ethnicity

    Hispanic 81 (16%) 130 (18%)
    Black 138 (28%) 169 (24%)
    Non-Black, Non-Hispanic 282 (56%) 405 (58%)
Eyesight

    Excellent 228 (46%) 246 (35%)
    Very Good 162 (32%) 223 (32%)
    Good 85 (17%) 164 (23%)
    Fair 21 (4.2%) 57 (8.1%)
    Poor 5 (1.0%) 14 (2.0%)
Wears glasses 221 (44%) 403 (57%)
Age at first birth 24.0 (21.0, 29.0) 21.0 (18.0, 26.0)
1 n (%); Median (Q1, Q3)

tbl_summary(
  nlsy,
  by = sex_cat,
  include = c(race_eth_cat, eyesight_cat, 
              glasses, age_bir),
  label = list(
    race_eth_cat ~ "Race/ethnicity",
    eyesight_cat ~ "Eyesight",
    glasses ~ "Wears glasses",
    age_bir ~ "Age at first birth"
  )) |> 
  add_p(test = list(
        all_continuous() ~ "t.test", 
        all_categorical() ~ "chisq.test")) |> 
  add_overall(col_label = "**Total**") |> 
  bold_labels() |> 
  modify_header(label = "**Variable**", 
                p.value = "**P**") |> 
  remove_footnote_header() |> 
  modify_caption("Participant characteristics")
Participant characteristics
Variable Total Male
N = 501
Female
N = 704
P
Race/ethnicity


0.3
    Hispanic 211 (18%) 81 (16%) 130 (18%)
    Black 307 (25%) 138 (28%) 169 (24%)
    Non-Black, Non-Hispanic 687 (57%) 282 (56%) 405 (58%)
Eyesight


<0.001
    Excellent 474 (39%) 228 (46%) 246 (35%)
    Very Good 385 (32%) 162 (32%) 223 (32%)
    Good 249 (21%) 85 (17%) 164 (23%)
    Fair 78 (6.5%) 21 (4.2%) 57 (8.1%)
    Poor 19 (1.6%) 5 (1.0%) 14 (2.0%)
Wears glasses 624 (52%) 221 (44%) 403 (57%) <0.001
Age at first birth 22.0 (19.0, 27.0) 24.0 (21.0, 29.0) 21.0 (18.0, 26.0) <0.001

tbl_summary()

  • Incredibly customizeable
  • Really helpful with Table 1
  • I often just view in the web browser and copy and paste into a Word document
  • Can also be used within quarto/R Markdown
  • If output is Word, I use as_flex_table() to output using the flextable package
  • There are a million other ways to customize tables

Table customization

Besides gtsummary, there are other packages that can help with tables and customizing them (particularly for html, but also if you want to output to Word or LaTeX)

  • {gt}: html table package that {gtsummary} uses under the hood
  • {kableExtra}: great for html tables, but also helpful for LaTeX
  • {flextable}: great for Word (and Powerpoint!) tables, but also works with other formats

See also the winners of the Posit table contest!

Exercise

Regression

Regressions take a formula: y ~ x1 + x2 + x3

  • Include interaction terms between x1 and x2 with y ~ x1*x2 + x3
    • Main effects of x1 and x2 will be included too
  • Indicator (“dummy”) variables will automatically be created for factors
    • The first level will be the reference level
  • If you want to include a squared term (for example), you can make the squared variable first, or wrap in I(): y ~ x1 + I(x1^2)

Regression

To fit a linear regression (by ordinary least squares), use the lm() function

To fit a generalized linear model (e.g., logistic regression, Poisson regression) use glm() and specify the family = argument

  • family = gaussian() is the default: another way of fitting a linear regresion
  • family = binomial() gives you logistic regression
  • family = poisson() is Poisson regression

We tell R what dataset to pull the variables from with data =

Regression

linear_model <- lm(income ~ sex_cat*age_bir + race_eth_cat, 
                   data = nlsy)
logistic_model <- glm(glasses ~ eyesight_cat + sex_cat + income, 
                      data = nlsy, family = binomial())

Regression

coef(linear_model)
                        (Intercept)                       sex_catFemale 
                          1029.3339                          -4681.2484 
                            age_bir                   race_eth_catBlack 
                           482.4650                           -299.6490 
race_eth_catNon-Black, Non-Hispanic               sex_catFemale:age_bir 
                          6418.4568                            161.5008 
confint(linear_model)
                                           2.5 %    97.5 %
(Intercept)                          -3746.58543 5805.2531
sex_catFemale                       -10553.32090 1190.8242
age_bir                                303.50836  661.4217
race_eth_catBlack                    -2448.39108 1849.0932
race_eth_catNon-Black, Non-Hispanic   4500.91244 8336.0012
sex_catFemale:age_bir                  -77.30931  400.3109

Regression

exp(coef(logistic_model))
          (Intercept) eyesight_catVery Good      eyesight_catGood 
            0.6338301             0.9172091             0.8608297 
     eyesight_catFair      eyesight_catPoor         sex_catFemale 
            0.5798278             1.1624355             1.8362460 
               income 
            1.0000174 
summary(logistic_model)

Call:
glm(formula = glasses ~ eyesight_cat + sex_cat + income, family = binomial(), 
    data = nlsy)

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -4.560e-01  1.383e-01  -3.298 0.000975 ***
eyesight_catVery Good -8.642e-02  1.400e-01  -0.617 0.537031    
eyesight_catGood      -1.499e-01  1.604e-01  -0.934 0.350267    
eyesight_catFair      -5.450e-01  2.535e-01  -2.150 0.031585 *  
eyesight_catPoor       1.505e-01  4.786e-01   0.315 0.753121    
sex_catFemale          6.077e-01  1.210e-01   5.021 5.14e-07 ***
income                 1.745e-05  4.613e-06   3.782 0.000155 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1668.9  on 1204  degrees of freedom
Residual deviance: 1626.8  on 1198  degrees of freedom
AIC: 1640.8

Number of Fisher Scoring iterations: 4

Helpful regression packages

{broom} helps “tidy” regression results

library(broom)
tidy(linear_model)
# A tibble: 6 × 5
  term                                estimate std.error statistic  p.value
  <chr>                                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                            1029.    2434.      0.423 6.72e- 1
2 sex_catFemale                         -4681.    2993.     -1.56  1.18e- 1
3 age_bir                                 482.      91.2     5.29  1.46e- 7
4 race_eth_catBlack                      -300.    1095.     -0.274 7.84e- 1
5 race_eth_catNon-Black, Non-Hispanic    6418.     977.      6.57  7.63e-11
6 sex_catFemale:age_bir                   162.     122.      1.33  1.85e- 1

broom::tidy()

tidy(logistic_model, conf.int = TRUE, exponentiate = TRUE)
# A tibble: 7 × 7
  term                  estimate  std.error statistic p.value conf.low conf.high
  <chr>                    <dbl>      <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)              0.634 0.138         -3.30  9.75e-4    0.483     0.830
2 eyesight_catVery Good    0.917 0.140         -0.617 5.37e-1    0.697     1.21 
3 eyesight_catGood         0.861 0.160         -0.934 3.50e-1    0.628     1.18 
4 eyesight_catFair         0.580 0.254         -2.15  3.16e-2    0.350     0.949
5 eyesight_catPoor         1.16  0.479          0.315 7.53e-1    0.458     3.08 
6 sex_catFemale            1.84  0.121          5.02  5.14e-7    1.45      2.33 
7 income                   1.00  0.00000461     3.78  1.55e-4    1.00      1.00 

broom::tidy() can also help other statistics

tabyl(nlsy, sex_cat, eyesight_cat) |> 
  chisq.test() |> 
  tidy()
# A tibble: 1 × 4
  statistic  p.value parameter method                    
      <dbl>    <dbl>     <int> <chr>                     
1      22.7 0.000143         4 Pearson's Chi-squared test
t.test(income ~ sex_cat, data = nlsy) |> 
  tidy()
# A tibble: 1 × 10
  estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
     <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
1    2397.    16690.    14292.      3.00 0.00273      963.     831.     3963.
# ℹ 2 more variables: method <chr>, alternative <chr>

gtsummary::tbl_regression()

tbl_regression(
  linear_model, 
  intercept = TRUE,
  label = list(
    sex_cat ~ "Sex",
    race_eth_cat ~ "Race/ethnicity",
    age_bir ~ "Age at first birth",
    `sex_cat:age_bir` ~ "Sex/age interaction"
  ))
Characteristic Beta 95% CI p-value
(Intercept) 1,029 -3,747, 5,805 0.7
Sex


    Male
    Female -4,681 -10,553, 1,191 0.12
Age at first birth 482 304, 661 <0.001
Race/ethnicity


    Hispanic
    Black -300 -2,448, 1,849 0.8
    Non-Black, Non-Hispanic 6,418 4,501, 8,336 <0.001
Sex/age interaction


    Female * Age at first birth 162 -77, 400 0.2
Abbreviation: CI = Confidence Interval

gtsummary::tbl_regression()

tbl_regression(
  logistic_model, 
  exponentiate = TRUE,
  label = list(
    sex_cat ~ "Sex",
    eyesight_cat ~ "Eyesight",
    income ~ "Income"
  ))
Characteristic OR 95% CI p-value
Eyesight


    Excellent
    Very Good 0.92 0.70, 1.21 0.5
    Good 0.86 0.63, 1.18 0.4
    Fair 0.58 0.35, 0.95 0.032
    Poor 1.16 0.46, 3.08 0.8
Sex


    Male
    Female 1.84 1.45, 2.33 <0.001
Income 1.00 1.00, 1.00 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

You could put several together

linear_model_no_int <- lm(income ~ sex_cat + age_bir + race_eth_cat, data = nlsy)

tbl_no_int <- tbl_regression(
  linear_model_no_int, 
  intercept = TRUE,
  label = list(
    sex_cat ~ "Sex",
    race_eth_cat ~ "Race/ethnicity",
    age_bir ~ "Age at first birth"
  ))

tbl_int <- tbl_regression(
  linear_model, 
  intercept = TRUE,
  label = list(
    sex_cat ~ "Sex",
    race_eth_cat ~ "Race/ethnicity",
    age_bir ~ "Age at first birth",
    `sex_cat:age_bir` ~ "Sex/age interaction"
  ))

You could put several together

tbl_merge(list(tbl_no_int, tbl_int), 
          tab_spanner = c("**Model 1**", "**Model 2**"))
Characteristic
Model 1
Model 2
Beta 95% CI p-value Beta 95% CI p-value
(Intercept) -1,201 -4,657, 2,256 0.5 1,029 -3,747, 5,805 0.7
Sex





    Male

    Female -833 -2,283, 617 0.3 -4,681 -10,553, 1,191 0.12
Age at first birth 571 448, 693 <0.001 482 304, 661 <0.001
Race/ethnicity





    Hispanic

    Black -287 -2,436, 1,863 0.8 -300 -2,448, 1,849 0.8
    Non-Black, Non-Hispanic 6,434 4,516, 8,352 <0.001 6,418 4,501, 8,336 <0.001
Sex/age interaction





    Female * Age at first birth


162 -77, 400 0.2
Abbreviation: CI = Confidence Interval

Exercise

Figures in R using ggplot()

Figures in R using ggplot()

Why ggplot?

  • Powerful and flexible: create complex and customized visualizations easily
  • Reproducibility and efficiency: promotes reproducibility by offering consistent syntax and saves time through automation of plot creation
  • Layered approach: incrementally build visualizations with multiple layers, exploring different aspects of data
  • Extensive customization: essentially infinite options to tailor visualizations

ggplot builds figures by adding on pieces via a particular “grammar of graphics”

Basic structure of a ggplot

ggplot(data = {data}, 
       aes(x = {xvar}, y = {yvar}, <characteristic> = {othvar}, ...)) +
      <geom>(<characteristic> = "value", ...) + 
      ...
  • {data}: must be a dataframe (or tibble!)
  • {xvar} and {yvar} are the names (unquoted) of the variables on the x- and y-axes
    • some graphs may not require both, or may require other parameters
  • {othvar} is some other unquoted variable name that defines a grouping or other characteristic you want to map to an aesthetic
  • <characteristic>: you can map {othvar} (or a fixed "value") to any of a number of aesthetic features of the figure; e.g., color, shape, size, linetype, etc.
  • <geom>: the geometric feature you want to use; e.g., point (scatterplot), line, histogram, bar, etc.
  • "value": a fixed value that defines some characteristic of the figure; e.g., “red”, 10, “dashed”
  • … : there are numerous other options to discover!

Let’s walk through the creation of a figure

ggplot(data = nlsy, 
       aes(x = eyesight_cat, 
           fill = eyesight_cat))

  • ggplot() doesn’t plot any data itself, it just sets up the data and variables

Let’s walk through the creation of a figure

ggplot(data = nlsy, 
       aes(x = eyesight_cat, 
           fill = eyesight_cat)) +
  geom_bar()

  • geom_bar() creates a bar graph for the number of observations with a certain value of the x variable
    • does not need a y variable

Tip

use geom_col() if you have a y variable that you want to use as the height of the bars

Let’s walk through the creation of a figure

ggplot(data = nlsy, 
       aes(x = eyesight_cat, 
           fill = eyesight_cat)) +
  geom_bar() +
  facet_grid(cols = vars(glasses_cat))

  • facet_grid() creates a panel for each value of another variable
    • can also do rows =
    • variable name should be within vars() (you can use helpers like starts_with())

Tip

use facet_wrap() if you want to create panels that expand along rows and columns (e.g., to facet by many countries)

Let’s walk through the creation of a figure

ggplot(data = nlsy, 
       aes(x = eyesight_cat, 
           fill = eyesight_cat)) +
  geom_bar() +
  facet_grid(cols = vars(glasses_cat)) +
  scale_fill_brewer(palette = "Spectral",
                    direction = -1)

  • scale_{fill/color}_{...}() functions change the color palette
    • some are appropriate for continuous variables, others discrete

Tip

scale_fill_viridis_d() good color-blind and black & white-friendly options

Let’s walk through the creation of a figure

ggplot(data = nlsy, 
       aes(x = eyesight_cat, 
           fill = eyesight_cat)) +
  geom_bar() +
  facet_grid(cols = vars(glasses_cat)) +
  scale_fill_brewer(palette = "Spectral",
                    direction = -1) +
  scale_x_discrete(breaks = c("Excellent", 
                          "Good", "Poor"),
                name = "Eyesight quality")