ID 543: Day 3

Introduction to R

Homework review

Today’s goals

  • Understand how to use summarize() to get summary statistics
  • Learn how to read in data from different file types
  • Understand how to use the here package to refer to files
  • Learn some tools for dealing with missing data
  • Learn how to join datasets using left_join(), right_join(), full_join(), and inner_join()

Summary statistics

We can get certain summary statistics about our data with summary(), which we can use either on an entire dataframe or on a single variable

nlsy_sleep <- nlsy |> 
  select(id, contains("sleep"), age_bir, sex)
summary(nlsy_sleep$age_bir)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   19.00   22.00   23.45   27.00   52.00 
summary(nlsy_sleep)
       id          sleep_wkdy       sleep_wknd        age_bir     
 Min.   :    3   Min.   : 0.000   Min.   : 0.000   Min.   :13.00  
 1st Qu.: 2317   1st Qu.: 6.000   1st Qu.: 6.000   1st Qu.:19.00  
 Median : 4744   Median : 7.000   Median : 7.000   Median :22.00  
 Mean   : 5229   Mean   : 6.643   Mean   : 7.267   Mean   :23.45  
 3rd Qu.: 7937   3rd Qu.: 8.000   3rd Qu.: 8.000   3rd Qu.:27.00  
 Max.   :12667   Max.   :13.000   Max.   :14.000   Max.   :52.00  
      sex       
 Min.   :1.000  
 1st Qu.:1.000  
 Median :2.000  
 Mean   :1.584  
 3rd Qu.:2.000  
 Max.   :2.000  

Summary statistics

We can also apply certain functions to a variable(s) to get a single statistic: mean(), median(), var(), sd(), cov(), cor(), min(), max(), quantile(), etc.

median(nlsy$age_bir)
[1] 22
cor(nlsy$sleep_wkdy, nlsy$sleep_wknd)
[1] 0.7101579
quantile(nlsy$income, probs = c(0.1, 0.9))
    10%     90% 
 3177.2 33024.0 

New function: summarize()

But what if we want a lot of summary statistics – just not those that come with the summary() function?

  • For example, it doesn’t give us a standard deviation!

We can use summarize()

summarize(nlsy, 
          sd_age_bir = sd(age_bir),
          cor_sleep = cor(sleep_wkdy, sleep_wknd),
          ten_pctle_inc = quantile(income, probs = 0.1),
          ninety_pctle_inc = quantile(income, probs = 0.9))
# A tibble: 1 × 4
  sd_age_bir cor_sleep ten_pctle_inc ninety_pctle_inc
       <dbl>     <dbl>         <dbl>            <dbl>
1       5.99     0.710         3177.            33024

summarize() specifics

Important to note:

  • Takes a dataframe as its first argument. That means we can use pipes!
  • Returns a tibble – helpful if you want to use those values in a figure or table.
  • Can give the summary statistics names.
  • Can ask for any type of function of the variables (including one you make up yourself).

Combining with other functions

Because we can pipe, we can also look at statistics of variables that we make using mutate(), in a dataset we’ve subsetted with filter().

nlsy |>
  mutate(age_bir_stand = (age_bir - mean(age_bir)) / sd(age_bir)) |>
  filter(sex == 1) |>
  summarize(mean_men = mean(age_bir_stand))
# A tibble: 1 × 1
  mean_men
     <dbl>
1    0.283

Note

Note that we’re standardizing the data before filtering. Or else the mean would be 0!

Exercise

What if we want both groups at once?

nlsy |>
  filter(sex == 1) |>
  summarize(age_bir_men = mean(age_bir))
# A tibble: 1 × 1
  age_bir_men
        <dbl>
1        25.1
nlsy |>
  filter(sex == 2) |>
  summarize(age_bir_women = mean(age_bir))
# A tibble: 1 × 1
  age_bir_women
          <dbl>
1          22.2

We can group our calculations using the .by = argument in summarize()

Instead of creating separate summaries, we can get statistics for all groups at once:

nlsy |>
  summarize(mean_age_bir = mean(age_bir), 
            .by = sex_cat)
# A tibble: 2 × 2
  sex_cat mean_age_bir
  <fct>          <dbl>
1 Female          22.2
2 Male            25.1

This is cleaner and more efficient than filtering multiple times!

Why .by =?

You’ll see some optional arguments in the tidyverse start with a dot, like .by =.

  • Since you can name a variable whatever you want within summarize(), by = (with no dot) looks like it will be another variable
  • The dot is a convention to indicate that this is an optional argument, and it won’t conflict with any variable names in your data (just don’t start your variable names with dots!)

Counting groups

Sometimes we just want to know how many observations are in a group. We already saw how to do that with count(), but we can also do it with summarize() and the n() function, which counts the number of rows:

nlsy |> 
  count(sex)
# A tibble: 2 × 2
    sex     n
  <dbl> <int>
1     1   501
2     2   704
nlsy |>
  summarize(n = n(), 
            .by = sex)
# A tibble: 2 × 2
    sex     n
  <dbl> <int>
1     2   704
2     1   501

Group by multiple variables

You can group by multiple variables by putting them in a vector:

nlsy |>
  summarize(mean_income = mean(income),
            .by = c(sex_cat, race_eth_cat))
# A tibble: 6 × 3
  sex_cat race_eth_cat            mean_income
  <fct>   <fct>                         <dbl>
1 Female  Non-Black, Non-Hispanic      17404.
2 Male    Non-Black, Non-Hispanic      20839.
3 Female  Black                         9457.
4 Female  Hispanic                     10884.
5 Male    Black                        11755.
6 Male    Hispanic                     10651.

Alternative group_by() approach

You will also see the traditional group_by() approach:

nlsy |> 
  group_by(region) |>
  summarize(mean_inc = mean(income))
# A tibble: 4 × 2
  region mean_inc
   <dbl>    <dbl>
1      1   17771.
2      2   16698.
3      3   14101.
4      4   13360.

Tibbles are “group-ed” when using group_by()

We can tell it’s “grouped” and how many groups there are by printing out the data.

The data itself won’t look (much) different, but we’ll be able to perform grouped functions on it.

nlsy_by_region <- group_by(nlsy, region)
nlsy_by_region
# A tibble: 1,205 × 15
# Groups:   region [4]
      id glasses eyesight sleep_wkdy sleep_wknd nsibs race_eth   sex region
   <dbl>   <dbl>    <dbl>      <dbl>      <dbl> <dbl>    <dbl> <dbl>  <dbl>
 1     3       0        1          5          7     3        3     2      1
 2     6       1        2          6          7     1        3     1      1
 3     8       0        2          7          9     7        3     2      1
 4    16       1        3          6          7     3        3     2      1
 5    18       0        3         10         10     2        3     1      3
 6    20       1        2          7          8     2        3     2      1
 7    27       0        1          8          8     1        3     2      1
 8    49       1        1          8          8     6        3     2      1
 9    57       1        2          7          8     1        3     2      1
10    67       0        1          8          8     1        3     1      1
# ℹ 1,195 more rows
# ℹ 6 more variables: income <dbl>, age_bir <dbl>, eyesight_cat <fct>,
#   glasses_cat <fct>, race_eth_cat <fct>, sex_cat <fct>

Stratify with group_by() |> summarize()

Instead of using the .by = argument, we can use group_by() to create a grouped dataframe, then use summarize() to get the summary statistics for each group:

nlsy |>
  mutate(income_stand = (income - mean(income))/sd(income)) |>
  group_by(region) |>
   summarize(mean_inc = mean(income_stand),
            sd_inc = sd(income_stand))
# A tibble: 4 × 3
  region mean_inc sd_inc
   <dbl>    <dbl>  <dbl>
1      1   0.186   1.17 
2      2   0.106   0.958
3      3  -0.0891  1.03 
4      4  -0.145   0.810

1. by = vs. group_by()

The .by = approach is generally simpler and doesn’t leave your data grouped (which can cause issues later)

Use group_by() when you need to perform multiple grouped operations in sequence or just create a new variable with mutate()

  • For example, you want to group your data by person id when you have multiple rows per person, e.g., to create a lagged variable
data |> 
  group_by(person_id) |> 
  mutate(last_visit = lag(visit))

Exercise

Getting other data into R

We have been reading in data as an .rds file:

nlsy_rds <- read_rds("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.rds")

We could also read it in as a .csv file:

nlsy_csv <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy-cc.csv")

What do you notice about the differences?

nlsy_rds |> 
  select(id, eyesight_cat)
# A tibble: 1,205 × 2
      id eyesight_cat
   <dbl> <fct>       
 1     3 Excellent   
 2     6 Very Good   
 3     8 Very Good   
 4    16 Good        
 5    18 Good        
 6    20 Very Good   
 7    27 Excellent   
 8    49 Excellent   
 9    57 Very Good   
10    67 Excellent   
# ℹ 1,195 more rows
nlsy_csv |> 
  select(id, eyesight_cat)
# A tibble: 1,205 × 2
      id eyesight_cat
   <dbl> <chr>       
 1     3 Excellent   
 2     6 Very Good   
 3     8 Very Good   
 4    16 Good        
 5    18 Good        
 6    20 Very Good   
 7    27 Excellent   
 8    49 Excellent   
 9    57 Very Good   
10    67 Excellent   
# ℹ 1,195 more rows

.rds is an R-specific file for a single object

It will be the exact same object when you read it back in.

write_rds(nlsy_rds, "nlsy.rds")

You can save any object, not just a dataframe:

x <- c(4, 5, 6)
write_rds(x, "numbers.rds")

What is y going to print?

y <- read_rds("numbers.rds")
y

.csv files are much more general but don’t maintain things like factors

.csv files might need a little more specification to read in

read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv") |> print(n = 2)
# A tibble: 12,686 × 14
  H0012400 H0012500 H0022300 H0022500 R0000100 R0009100 R0173600 R0214700
     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1       -4       -4       -4       -4        1        1        5        3
2        0        1        4        3        2        8        5        3
# ℹ 12,684 more rows
# ℹ 6 more variables: R0214800 <dbl>, R0216400 <dbl>, R0217900 <dbl>,
#   R0402800 <dbl>, R7090700 <dbl>, T4120500 <dbl>
nlsy_full <- read_csv(
  "https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv", skip = 1,
  col_names = c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd", 
                "id", "nsibs", "samp", "race_eth", "sex", "region", 
                "income", "res_1980", "res_2002", "age_bir"),
                na = c("-1", "-2", "-3", "-4", "-5", "-998"))
print(nlsy_full, n = 2)
# A tibble: 12,686 × 14
  glasses eyesight sleep_wkdy sleep_wknd    id nsibs  samp race_eth   sex region
    <dbl>    <dbl>      <dbl>      <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl>  <dbl>
1      NA       NA         NA         NA     1     1     5        3     2      1
2       0        1          4          3     2     8     5        3     2      1
# ℹ 12,684 more rows
# ℹ 4 more variables: income <dbl>, res_1980 <dbl>, res_2002 <dbl>,
#   age_bir <dbl>

Corresponding write_() function

If you are sharing data with collaborators who don’t use R, or you want to look at it in Excel, you can save a dataframe as a .csv file:

nlsy_rds <- read_rds("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.rds")
write_csv(nlsy_rds, "nlsy.csv", na = "")

The data will be saved in your “working directory” (see the top of your console)

Note

We’ll talk about directories in a little bit!

Other functions come from the {haven} package

library(haven)
medical_dta <- read_dta("http://www.principlesofeconometrics.com/stata/medical.dta")
medical_sas <- read_sas("http://www.principlesofeconometrics.com/sas/medical.sas7bdat")
glimpse(medical_dta)
Rows: 1,000
Columns: 6
$ id     <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ year   <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ medexp <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ inc    <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ age    <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ insur  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
glimpse(medical_sas)
Rows: 1,000
Columns: 6
$ ID     <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ YEAR   <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ MEDEXP <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ INC    <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ AGE    <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ INSUR  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …

Or {readxl}

library(readxl)
dat <- read_excel("excel_data.xlsx")

All these functions take arguments, but read_excel() takes a ton of arguments – which sheet, how many rows to read, whether there are column names, a specific range to read in, etc….

  • See help(read_excel) for details!

Exercise

Where are these files? File paths

list.files()
 [1] "_extensions"                      "_freeze"                         
 [3] "_publish.yml"                     "_quarto_internal_scss_error.scss"
 [5] "_quarto.yml"                      "_site"                           
 [7] "cheat_sheet.html"                 "cheat_sheet.qmd"                 
 [9] "data"                             "data.qmd"                        
[11] "decktape calls"                   "exercises"                       
[13] "homeworks"                        "ID543 2025.Rproj"                
[15] "images"                           "img"                             
[17] "index.html"                       "index.qmd"                       
[19] "intro-to-r-syllabus-20250526.pdf" "intro-to-r-syllabus.docx"        
[21] "intro-to-r-syllabus.pdf"          "LICENSE"                         
[23] "pages"                            "README.md"                       
[25] "site_libs"                        "slides"                          
[27] "www"                             
getwd()
[1] "/Users/louisa-teaching/Desktop/ID543-git"
file.path("data", "my_dataset.csv")
[1] "data/my_dataset.csv"
file.path("~", "Downloads", "my_dataset.csv")
[1] "~/Downloads/my_dataset.csv"
file.path("C:", "Users", "Downloads", "my_dataset.csv")
[1] "C:/Users/Downloads/my_dataset.csv"

The problem with setwd()

  • setwd() changes the working directory, leading to potential issues in collaboration and reproducibility

  • You and I don’t have the same file structure!

  • For example, my current working directory is

getwd()
[1] "/Users/louisa-teaching/Desktop/ID543-git"
  • It’s also really annoying to change your working directory when you move around files and folders, even if it’s just you using them

Do you think this code from 2015 still works?

R Projects

my-project/
├─ my-project.Rproj
├─ README
├─ data/
│   ├── raw/
│   └── processed/
├─ R/
├─ results/
│   ├── tables/
│   ├── figures/
│   └── output/
└─ docs/
  • An .Rproj file is mostly just a placeholder. It remembers various options, and makes it easy to open a new RStudio session that starts up in the correct working directory. You never need to edit it directly.

  • A README file can just be a text file that includes notes for yourself or future users.

  • I like to have a folder for raw data – which I never touch – and a folder(s) for datasets that I create along the way.

R Projects

Demo

Referring to files with the here package

source(here::here("R", "functions.R"))

dat <- read_csv(here::here("data", "raw", "data.csv"))

p <- ggplot(dat) + geom_point(aes(x, y))

ggsave(plot = p, 
       filename = here::here("results", "figures", "fig.pdf"))
  • The here package lets you refer to files without worrying too much about relative file paths.

  • Construct file paths with reference to the top directory holding your .Rproj file.

  • here::here("data", "raw", "data.csv") for me, here, becomes "/Users/l.smith/Documents/Teaching/Harvard/ID543 2024/data/raw/data.csv"

  • But if I send you my code to run, it will become whatever file path you need it to be, as long as you’re running it within the R Project.

Referring to the here package

here::here()

is equivalent to

library(here)
here()

I just prefer to write out the package name whenever I need it, but you can load the package for your entire session if you want.

Note

Note that you can refer to any function without loading the whole package this way, e.g. haven::read_dta()

Exercise

Missing values

  • R uses NA for missing values
  • Unlike some other statistical software, it will return NA to any logical statement
  • This makes it somewhat harder to deal with but also harder to make mistakes
3 < NA
[1] NA
mean(c(1, 2, NA))
[1] NA
mean(c(1, 2, NA), na.rm = TRUE)
[1] 1.5

Special NA functions

Certain functions deal with missing values explicitly

vals <- c(1, 2, NA)
is.na(vals)
[1] FALSE FALSE  TRUE
anyNA(vals)
[1] TRUE
na.omit(vals)
[1] 1 2

Specific missingness

You know some value is implausible, whether for everyone or for a specific observation

nlsy <- nlsy |>
  mutate(sleep_wknd = if_else(sleep_wknd > 24, NA, sleep_wknd),
        # OR
         sleep_wknd = case_when(
           sleep_wknd > 24 ~ NA,
           .default = sleep_wknd
         ),
         income = if_else(id == 283, NA, income),
         nsibs = na_if(nsibs, 99))

na_if(x, y) will replace values in x that are equal to y with NA

Read in NA’s directly

In NLSY, -1 = Refused, -2 = Don’t know, -3 = Invalid missing, -4 = Valid missing, -5 = Non-interview

Other files might have . for missing, or 999.

nlsy_cols <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd", 
               "id", "nsibs", "samp", "race_eth", "sex", "region", 
               "income", "res_1980", "res_2002", "age_bir")
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
    na = c("-1", "-2", "-3", "-4", "-5", "-998"),
                 skip = 1, col_names = nlsy_cols)
  • You have to write the values as strings, even if they’re numbers

Reasons for missingness

Caveat: This previous way, you lose the info about the reason for missingness. If that’s important, read in the data first, create a variable for missingness reason (e.g., use fct_recode()), then changes the values to NA.

nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
                 skip = 1, col_names = nlsy_cols) |> 
  mutate(age_bir_missing = ifelse(age_bir > 0, NA, age_bir),
         age_bir_missing = fct_recode(
           factor(age_bir_missing), "Refused" = "-1", 
           "Don't know" = "-2", "Invalid missing" = "-3",
           "Valid missing" = "-4", "Non-interview" = "-5",
           "Other missing" = "-998"))
summary(nlsy$age_bir_missing)
  Other missing   Non-interview Invalid missing            NA's 
           1343            5385              15            5943 

Complete cases

Sometimes you may just want to get rid of all the rows with missing values.

nrow(nlsy)
[1] 12686
nlsy_cc <- nlsy |> filter(complete.cases(nlsy))
nrow(nlsy_cc)
[1] 6743
nlsy2 <- nlsy |> na.omit() # same

Caution

Don’t do this without good reason! It will exlude rows with any missing values, even in variables you’re not using.

Exercise

Joins

There are multiple functions in the tidyverse (specifically, the {dplyr} package) for joining/merging data

Mutating joins merge two datasets based on matching variable(s), adding together the new columns from the joined dataframe

left_join(x, y, by = join_by(xcol == ycol))

Note

We will also refer to the x dataframe as the left-hand side (LHS) and the y dataframe as the right-hand side (RHS)

Why join?

I may have people and dates in one dataset, and want to merge with a separate dataset on prescriptions:

id start_date end_date
1 01/01/2021 01/01/2024
2 03/01/2023 01/01/2025
3 09/01/2020 10/01/2021
id Rx_date Rx_name
1 02/21/2020 Tylenol
1 04/06/2022 Aspirin
2 06/09/2023 Claritin

Merging with kids

The NLSY also included the kids of the moms in the NLSY79 survey that we’re using.

nlsy_kids
# A tibble: 11,551 × 6
   id_kid id_mom sex_kid dob_kid agebir_mom bwt_kid
    <dbl>  <dbl>   <dbl>   <dbl>      <dbl>   <dbl>
 1    201      2       2    1993         34     139
 2    202      2       2    1994         35      NA
 3    301      3       2    1981         19     162
 4    302      3       2    1983         22     144
 5    303      3       2    1986         24     112
 6    401      4       1    1980         18     107
 7    403      4       2    1997         34      NA
 8    801      8       2    1976         17     119
 9    802      8       1    1979         20     107
10    803      8       2    1982         24     146
# ℹ 11,541 more rows

Note

Birthweight is in ounces

Left join

left_join(nlsy_sleep, nlsy_kids)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

It will automatically look for matching columns (can be dangerous!) but if none, need to specify:

left_join(nlsy_sleep, nlsy_kids, 
          by = join_by(id == id_mom))
# A tibble: 2,284 × 10
      id sleep_wkdy sleep_wknd age_bir   sex id_kid sex_kid dob_kid agebir_mom
   <dbl>      <dbl>      <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>      <dbl>
 1     3          5          7      19     2    301       2    1981         19
 2     3          5          7      19     2    302       2    1983         22
 3     3          5          7      19     2    303       2    1986         24
 4     6          6          7      30     1     NA      NA      NA         NA
 5     8          7          9      17     2    801       2    1976         17
 6     8          7          9      17     2    802       1    1979         20
 7     8          7          9      17     2    803       2    1982         24
 8    16          6          7      31     2   1601       1    1990         31
 9    16          6          7      31     2   1602       1    1993         34
10    16          6          7      31     2   1603       2    1996         37
# ℹ 2,274 more rows
# ℹ 1 more variable: bwt_kid <dbl>

by = join_by() syntax

The most flexible way to specify the join is with by = join_by()

  • This can accommodate not only variables you want to match exactly, but also variables that are related in other ways (e.g., within or overlapping intervals)
  • You’ll also see by = c("id" = "id_mom") syntax, which is more limited but works for simple cases

Left join

LHS rows are duplicated if we have multiple matches, but we lose any rows in the RHS dataset that don’t have a match

n_distinct(nlsy_kids$id_kid)
[1] 11551
nlsy_left <- left_join(nlsy_sleep, nlsy_kids,
                       by = join_by(id == id_mom))

n_distinct(nlsy_left$id_kid)
[1] 1784

In this case, the moms of some of the kids aren’t in the nlsy_sleep dataset, so kids without moms are lost

Right join

right_join(nlsy_sleep, nlsy_kids, by = join_by(id == id_mom))
# A tibble: 11,551 × 10
      id sleep_wkdy sleep_wknd age_bir   sex id_kid sex_kid dob_kid agebir_mom
   <dbl>      <dbl>      <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>      <dbl>
 1     3          5          7      19     2    301       2    1981         19
 2     3          5          7      19     2    302       2    1983         22
 3     3          5          7      19     2    303       2    1986         24
 4     8          7          9      17     2    801       2    1976         17
 5     8          7          9      17     2    802       1    1979         20
 6     8          7          9      17     2    803       2    1982         24
 7    16          6          7      31     2   1601       1    1990         31
 8    16          6          7      31     2   1602       1    1993         34
 9    16          6          7      31     2   1603       2    1996         37
10    20          7          8      30     2   2001       2    1990         30
# ℹ 11,541 more rows
# ℹ 1 more variable: bwt_kid <dbl>
  • Now we don’t have the dads, because there are no matching ids in the RHS dataset

  • But we do keep all the kids, even those without moms in the LHS

Full join: we want everything!

full_join(nlsy_sleep, nlsy_kids, by = join_by(id == id_mom))
# A tibble: 12,052 × 10
      id sleep_wkdy sleep_wknd age_bir   sex id_kid sex_kid dob_kid agebir_mom
   <dbl>      <dbl>      <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>      <dbl>
 1     3          5          7      19     2    301       2    1981         19
 2     3          5          7      19     2    302       2    1983         22
 3     3          5          7      19     2    303       2    1986         24
 4     6          6          7      30     1     NA      NA      NA         NA
 5     8          7          9      17     2    801       2    1976         17
 6     8          7          9      17     2    802       1    1979         20
 7     8          7          9      17     2    803       2    1982         24
 8    16          6          7      31     2   1601       1    1990         31
 9    16          6          7      31     2   1602       1    1993         34
10    16          6          7      31     2   1603       2    1996         37
# ℹ 12,042 more rows
# ℹ 1 more variable: bwt_kid <dbl>

This dataset is larger than either of the initial datasets alone: it has the dads without kids and the kids without moms

Inner join: we only want matches

inner_join(nlsy_sleep, nlsy_kids, by = join_by(id == id_mom))
# A tibble: 1,783 × 10
      id sleep_wkdy sleep_wknd age_bir   sex id_kid sex_kid dob_kid agebir_mom
   <dbl>      <dbl>      <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>      <dbl>
 1     3          5          7      19     2    301       2    1981         19
 2     3          5          7      19     2    302       2    1983         22
 3     3          5          7      19     2    303       2    1986         24
 4     8          7          9      17     2    801       2    1976         17
 5     8          7          9      17     2    802       1    1979         20
 6     8          7          9      17     2    803       2    1982         24
 7    16          6          7      31     2   1601       1    1990         31
 8    16          6          7      31     2   1602       1    1993         34
 9    16          6          7      31     2   1603       2    1996         37
10    20          7          8      30     2   2001       2    1990         30
# ℹ 1,773 more rows
# ℹ 1 more variable: bwt_kid <dbl>
  • This dataset has only the moms with kids (no dads) and the kids with moms
  • It still has multiple rows per mom – one for each kid

Join by multiple variables

  • I only want the kid that was the mom’s first
  • I’m going to match on the age at first birth on the RHS
first_births <- inner_join(nlsy_sleep, nlsy_kids, 
                           by = join_by(id == id_mom,
                                        age_bir == agebir_mom))

first_births
# A tibble: 708 × 9
      id sleep_wkdy sleep_wknd age_bir   sex id_kid sex_kid dob_kid bwt_kid
   <dbl>      <dbl>      <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
 1     3          5          7      19     2    301       2    1981     162
 2     8          7          9      17     2    801       2    1976     119
 3    16          6          7      31     2   1601       1    1990     109
 4    20          7          8      30     2   2001       2    1990     129
 5    27          8          8      27     2   2701       2    1988     117
 6    49          8          8      24     2   4901       1    1982     139
 7    57          7          8      21     2   5701       1    1979     148
 8    86          8          8      17     2   8601       2    1977      97
 9    96          7          7      19     2   9601       2    1980     124
10    97          7          8      29     2   9701       1    1987      48
# ℹ 698 more rows

Exercise

Today’s summary

  • We can use summarize() to get summary statistics of our data
  • We can use group_by() to group our data and then get summary statistics within those groups
  • Missing values in R are NA
  • R projects are a good way to keep your files organized
  • We can use the here package to refer to files in a project
  • We can use left_join(), right_join(), full_join(), and inner_join() to merge datasets

Today’s functions

  • summarize(): calculate summary statistics (with .by = argument for stratified statistics)
  • group_by(): group data
  • here::here(): refer to files in a project
  • read_rds(), read_csv(), read_dta(), read_sas(), read_excel(): read in data
  • complete.cases(), na.omit(): remove missing values
  • is.na(), anyNA(): check for missing values
  • na_if(): replace values with NA
  • left_join(), right_join(), full_join(), inner_join(): merge datasets