Introduction to R
summarize()
and group_by()
to get summary statisticshere
package to refer to filesleft_join()
, right_join()
, full_join()
, and inner_join()
We can get certain summary statistics about our data with summary()
, which we can use either on an entire dataframe or on a single variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 19.00 22.00 23.45 27.00 52.00
id sleep_wkdy sleep_wknd age_bir
Min. : 3 Min. : 0.000 Min. : 0.000 Min. :13.00
1st Qu.: 2317 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.:19.00
Median : 4744 Median : 7.000 Median : 7.000 Median :22.00
Mean : 5229 Mean : 6.643 Mean : 7.267 Mean :23.45
3rd Qu.: 7937 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.:27.00
Max. :12667 Max. :13.000 Max. :14.000 Max. :52.00
sex
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.584
3rd Qu.:2.000
Max. :2.000
We can also apply certain functions to a variable(s) to get a single statistic: mean()
, median()
, var()
, sd()
, cov()
, cor()
, min()
, max()
, quantile()
, etc.
summarize()
But what if we want a lot of summary statistics – just not those that come with the summary()
function?
We can use summarize()
summarize()
specificsImportant to note:
Because we can pipe, we can also look at statistics of variables that we make using mutate()
, in a dataset we’ve subsetted with filter()
.
nlsy |>
mutate(age_bir_stand = (age_bir - mean(age_bir)) / sd(age_bir)) |>
filter(sex == 1) |>
summarize(mean_men = mean(age_bir_stand))
# A tibble: 1 × 1
mean_men
<dbl>
1 0.283
Note
Note that we’re standardizing the data before filtering. Or else the mean would be 0!
group_by()
We can tell it’s “grouped” and how many groups there are by printing out the data.
The data itself won’t look (much) different, but we’ll be able to perform grouped functions on it.
# A tibble: 1,205 × 15
# Groups: region [4]
id glasses eyesight sleep_wkdy sleep_wknd nsibs race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0 1 5 7 3 3 2 1
2 6 1 2 6 7 1 3 1 1
3 8 0 2 7 9 7 3 2 1
4 16 1 3 6 7 3 3 2 1
5 18 0 3 10 10 2 3 1 3
6 20 1 2 7 8 2 3 2 1
7 27 0 1 8 8 1 3 2 1
8 49 1 1 8 8 6 3 2 1
9 57 1 2 7 8 1 3 2 1
10 67 0 1 8 8 1 3 1 1
# ℹ 1,195 more rows
# ℹ 6 more variables: income <dbl>, age_bir <dbl>, eyesight_cat <fct>,
# glasses_cat <fct>, race_eth_cat <fct>, sex_cat <fct>
group_by()
and summarize()
This function is especially important when calculating summary statistics, which we often want to be stratified.
group_by() |> summarize()
Like the other functions we’ve seen, we can use pipes:
Sometimes we just want to know how many observations are in a group. We already saw how to do that with count()
, but we can also do it with group_by() |> summarize()
:
We have been reading in data as an .rds
file:
We could also read it in as a .csv
file:
# A tibble: 1,205 × 5
id eyesight_cat glasses_cat race_eth_cat sex_cat
<dbl> <fct> <fct> <fct> <fct>
1 3 Excellent Doesn't wear glasses Non-Black, Non-Hispanic Female
2 6 Very Good Wears glasses/contacts Non-Black, Non-Hispanic Male
3 8 Very Good Doesn't wear glasses Non-Black, Non-Hispanic Female
4 16 Good Wears glasses/contacts Non-Black, Non-Hispanic Female
5 18 Good Doesn't wear glasses Non-Black, Non-Hispanic Male
6 20 Very Good Wears glasses/contacts Non-Black, Non-Hispanic Female
7 27 Excellent Doesn't wear glasses Non-Black, Non-Hispanic Female
8 49 Excellent Wears glasses/contacts Non-Black, Non-Hispanic Female
9 57 Very Good Wears glasses/contacts Non-Black, Non-Hispanic Female
10 67 Excellent Doesn't wear glasses Non-Black, Non-Hispanic Male
# ℹ 1,195 more rows
# A tibble: 1,205 × 6
id eyesight_cat glasses_cat race_eth_cat sex_cat slp_cat_wkdy
<dbl> <chr> <chr> <chr> <chr> <chr>
1 3 Excellent Doesn't wear glasses Non-Black, No… Female some
2 6 Very Good Wears glasses/contacts Non-Black, No… Male some
3 8 Very Good Doesn't wear glasses Non-Black, No… Female ideal
4 16 Good Wears glasses/contacts Non-Black, No… Female some
5 18 Good Doesn't wear glasses Non-Black, No… Male lots
6 20 Very Good Wears glasses/contacts Non-Black, No… Female ideal
7 27 Excellent Doesn't wear glasses Non-Black, No… Female ideal
8 49 Excellent Wears glasses/contacts Non-Black, No… Female ideal
9 57 Very Good Wears glasses/contacts Non-Black, No… Female ideal
10 67 Excellent Doesn't wear glasses Non-Black, No… Male ideal
# ℹ 1,195 more rows
.rds
is an R-specific file for a single objectIt will be the exact same object when you read it back in.
You can save any object, not just a dataframe:
What is y
going to print?
.csv
files are much more general but don’t maintain things like factors.csv
files might need a little more specification to read in# A tibble: 12,686 × 14
H0012400 H0012500 H0022300 H0022500 R0000100 R0009100 R0173600 R0214700
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -4 -4 -4 -4 1 1 5 3
2 0 1 4 3 2 8 5 3
# ℹ 12,684 more rows
# ℹ 6 more variables: R0214800 <dbl>, R0216400 <dbl>, R0217900 <dbl>,
# R0402800 <dbl>, R7090700 <dbl>, T4120500 <dbl>
nlsy_full <- read_csv(
"https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv", skip = 1,
col_names = c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir"),
na = c("-1", "-2", "-3", "-4", "-5", "-998"))
print(nlsy_full, n = 2)
# A tibble: 12,686 × 14
glasses eyesight sleep_wkdy sleep_wknd id nsibs samp race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA 1 1 5 3 2 1
2 0 1 4 3 2 8 5 3 2 1
# ℹ 12,684 more rows
# ℹ 4 more variables: income <dbl>, res_1980 <dbl>, res_2002 <dbl>,
# age_bir <dbl>
write_()
functionIf you are sharing data with collaborators who don’t use R, or you want to look at it in Excel, you can save a dataframe as a .csv
file:
The data will be saved in your “working directory” (see the top of your console)
Note
We’ll talk about directories in a little bit!
{haven}
packageRows: 1,000
Columns: 6
$ id <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ year <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ medexp <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ inc <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ age <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ insur <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
Rows: 1,000
Columns: 6
$ ID <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ YEAR <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ MEDEXP <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ INC <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ AGE <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ INSUR <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
{readxl}
All these functions take arguments, but read_excel()
takes a ton of arguments – which sheet, how many rows to read, whether there are column names, a specific range to read in, etc….
help(read_excel)
for details! [1] "_extensions" "_freeze" "_publish.yml" "_quarto.yml"
[5] "_site" "cheat_sheet.html" "cheat_sheet.qmd" "data"
[9] "data.html" "data.qmd" "decktape calls" "exercises"
[13] "homeworks" "ID543 2024.Rproj" "img" "index.html"
[17] "index.qmd" "pages" "site_libs" "slides"
[21] "www"
[1] "/Users/l.smith/Documents/Teaching/Harvard/ID543 2024"
[1] "data/my_dataset.csv"
[1] "~/Downloads/my_dataset.csv"
[1] "C:/Users/Downloads/my_dataset.csv"
setwd()
setwd()
changes the working directory, leading to potential issues in collaboration and reproducibility
You and I don’t have the same file structure!
For example, my current working directory is
my-project/
├─ my-project.Rproj
├─ README
├─ data/
│ ├── raw/
│ └── processed/
├─ R/
├─ results/
│ ├── tables/
│ ├── figures/
│ └── output/
└─ docs/
An .Rproj
file is mostly just a placeholder. It remembers various options, and makes it easy to open a new RStudio session that starts up in the correct working directory. You never need to edit it directly.
A README file can just be a text file that includes notes for yourself or future users.
I like to have a folder for raw data – which I never touch – and a folder(s) for datasets that I create along the way.
Demo
here
packageThe here
package lets you refer to files without worrying too much about relative file paths.
Construct file paths with reference to the top directory holding your .Rproj
file.
here::here("data", "raw", "data.csv")
for me, here, becomes "/Users/l.smith/Documents/Teaching/Harvard/ID543 2024/data/raw/data.csv"
But if I send you my code to run, it will become whatever file path you need it to be, as long as you’re running it within the R Project.
here
packageis equivalent to
I just prefer to write out the package name whenever I need it, but you can load the package for your entire session if you want.
Note
Note that you can refer to any function without loading the whole package this way, e.g. haven::read_dta()
NA
for missing valuesNA
to any logical statementNA
functionsCertain functions deal with missing values explicitly
[1] 1 2
You know some value is implausible, whether for everyone or for a specific observation
na_if(x, y)
will replace values in x
that are equal to y
with NA
In NLSY, -1 = Refused, -2 = Don’t know, -3 = Invalid missing, -4 = Valid missing, -5 = Non-interview
Other files might have .
for missing, or 999
.
nlsy_cols <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir")
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
na = c("-1", "-2", "-3", "-4", "-5", "-998"),
skip = 1, col_names = nlsy_cols)
Caveat: This previous way, you lose the info about the reason for missingness. If that’s important, read in the data first, create a variable for missingness reason (e.g., use fct_recode()
), then changes the values to NA
.
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
skip = 1, col_names = nlsy_cols) |>
mutate(age_bir_missing = ifelse(age_bir > 0, NA, age_bir),
age_bir_missing = fct_recode(
factor(age_bir_missing), "Refused" = "-1",
"Don't know" = "-2", "Invalid missing" = "-3",
"Valid missing" = "-4", "Non-interview" = "-5",
"Other missing" = "-998"))
summary(nlsy$age_bir_missing)
Other missing Non-interview Invalid missing NA's
1343 5385 15 5943
Sometimes you may just want to get rid of all the rows with missing values.
[1] 12686
[1] 6743
Caution
Don’t do this without good reason! It will exlude rows with any missing values, even in variables you’re not using.
There are multiple functions in the tidyverse
(specifically, the {dplyr}
package) for joining/merging data
Mutating joins merge two datasets based on matching variable(s), adding together the new columns from the joined dataframe
Note
We will also refer to the x
dataframe as the left-hand side (LHS) and the y
dataframe as the right-hand side (RHS)
The NLSY also included the kids of the moms in the NLSY79 survey that we’re using.
# A tibble: 11,551 × 6
id_kid id_mom sex_kid dob_kid agebir_mom bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 201 2 2 1993 34 139
2 202 2 2 1994 35 NA
3 301 3 2 1981 19 162
4 302 3 2 1983 22 144
5 303 3 2 1986 24 112
6 401 4 1 1980 18 107
7 403 4 2 1997 34 NA
8 801 8 2 1976 17 119
9 802 8 1 1979 20 107
10 803 8 2 1982 24 146
# ℹ 11,541 more rows
Note
Birthweight is in ounces
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
It will automatically look for matching columns (can be dangerous!) but if none, need to specify:
# A tibble: 2,284 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 2,274 more rows
# ℹ 1 more variable: bwt_kid <dbl>
LHS rows are duplicated if we have multiple matches, but we lose any rows in the RHS dataset that don’t have a match
[1] 11551
nlsy_left <- left_join(nlsy_sleep, nlsy_kids,
by = join_by(id == id_mom))
n_distinct(nlsy_left$id_kid)
[1] 1784
In this case, the moms of some of the kids aren’t in the nlsy_sleep
dataset, so kids without moms are lost
# A tibble: 11,551 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 11,541 more rows
# ℹ 1 more variable: bwt_kid <dbl>
Now we don’t have the dads, because there are no matching ids in the RHS dataset
But we do keep all the kids, even those without moms in the LHS
# A tibble: 12,052 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 12,042 more rows
# ℹ 1 more variable: bwt_kid <dbl>
This dataset is larger than either of the initial datasets alone: it has the dads without kids and the kids without moms
# A tibble: 1,783 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 1,773 more rows
# ℹ 1 more variable: bwt_kid <dbl>
first_births <- inner_join(nlsy_sleep, nlsy_kids,
by = join_by(id == id_mom,
age_bir == agebir_mom))
first_births
# A tibble: 708 × 9
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 162
2 8 7 9 17 2 801 2 1976 119
3 16 6 7 31 2 1601 1 1990 109
4 20 7 8 30 2 2001 2 1990 129
5 27 8 8 27 2 2701 2 1988 117
6 49 8 8 24 2 4901 1 1982 139
7 57 7 8 21 2 5701 1 1979 148
8 86 8 8 17 2 8601 2 1977 97
9 96 7 7 19 2 9601 2 1980 124
10 97 7 8 29 2 9701 1 1987 48
# ℹ 698 more rows
summarize()
to get summary statistics of our datagroup_by()
to group our data and then get summary statistics within those groupsNA
here
package to refer to files in a projectleft_join()
, right_join()
, full_join()
, and inner_join()
to merge datasetssummarize()
: calculate summary statisticsgroup_by()
: group datahere::here()
: refer to files in a projectread_rds()
, read_csv()
, read_dta()
, read_sas()
, read_excel()
: read in datacomplete.cases()
, na.omit()
: remove missing valuesis.na()
, anyNA()
: check for missing valuesna_if()
: replace values with NA
left_join()
, right_join()
, full_join()
, inner_join()
: merge datasets