Introduction to R
summarize()
to get summary statisticshere
package to refer to filesleft_join()
, right_join()
, full_join()
, and inner_join()
We can get certain summary statistics about our data with summary()
, which we can use either on an entire dataframe or on a single variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 19.00 22.00 23.45 27.00 52.00
id sleep_wkdy sleep_wknd age_bir
Min. : 3 Min. : 0.000 Min. : 0.000 Min. :13.00
1st Qu.: 2317 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.:19.00
Median : 4744 Median : 7.000 Median : 7.000 Median :22.00
Mean : 5229 Mean : 6.643 Mean : 7.267 Mean :23.45
3rd Qu.: 7937 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.:27.00
Max. :12667 Max. :13.000 Max. :14.000 Max. :52.00
sex
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.584
3rd Qu.:2.000
Max. :2.000
We can also apply certain functions to a variable(s) to get a single statistic: mean()
, median()
, var()
, sd()
, cov()
, cor()
, min()
, max()
, quantile()
, etc.
summarize()
But what if we want a lot of summary statistics – just not those that come with the summary()
function?
We can use summarize()
summarize()
specificsImportant to note:
Because we can pipe, we can also look at statistics of variables that we make using mutate()
, in a dataset we’ve subsetted with filter()
.
nlsy |>
mutate(age_bir_stand = (age_bir - mean(age_bir)) / sd(age_bir)) |>
filter(sex == 1) |>
summarize(mean_men = mean(age_bir_stand))
# A tibble: 1 × 1
mean_men
<dbl>
1 0.283
Note
Note that we’re standardizing the data before filtering. Or else the mean would be 0!
.by =
argument in summarize()
Instead of creating separate summaries, we can get statistics for all groups at once:
# A tibble: 2 × 2
sex_cat mean_age_bir
<fct> <dbl>
1 Female 22.2
2 Male 25.1
This is cleaner and more efficient than filtering multiple times!
.by =
?You’ll see some optional arguments in the tidyverse start with a dot, like .by =
.
summarize()
, by =
(with no dot) looks like it will be another variableSometimes we just want to know how many observations are in a group. We already saw how to do that with count()
, but we can also do it with summarize()
and the n()
function, which counts the number of rows:
You can group by multiple variables by putting them in a vector:
# A tibble: 6 × 3
sex_cat race_eth_cat mean_income
<fct> <fct> <dbl>
1 Female Non-Black, Non-Hispanic 17404.
2 Male Non-Black, Non-Hispanic 20839.
3 Female Black 9457.
4 Female Hispanic 10884.
5 Male Black 11755.
6 Male Hispanic 10651.
group_by()
approachYou will also see the traditional group_by()
approach:
group_by()
We can tell it’s “grouped” and how many groups there are by printing out the data.
The data itself won’t look (much) different, but we’ll be able to perform grouped functions on it.
# A tibble: 1,205 × 15
# Groups: region [4]
id glasses eyesight sleep_wkdy sleep_wknd nsibs race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0 1 5 7 3 3 2 1
2 6 1 2 6 7 1 3 1 1
3 8 0 2 7 9 7 3 2 1
4 16 1 3 6 7 3 3 2 1
5 18 0 3 10 10 2 3 1 3
6 20 1 2 7 8 2 3 2 1
7 27 0 1 8 8 1 3 2 1
8 49 1 1 8 8 6 3 2 1
9 57 1 2 7 8 1 3 2 1
10 67 0 1 8 8 1 3 1 1
# ℹ 1,195 more rows
# ℹ 6 more variables: income <dbl>, age_bir <dbl>, eyesight_cat <fct>,
# glasses_cat <fct>, race_eth_cat <fct>, sex_cat <fct>
group_by() |> summarize()
Instead of using the .by =
argument, we can use group_by()
to create a grouped dataframe, then use summarize()
to get the summary statistics for each group:
by =
vs. group_by()
The .by =
approach is generally simpler and doesn’t leave your data grouped (which can cause issues later)
Use group_by()
when you need to perform multiple grouped operations in sequence or just create a new variable with mutate()
We have been reading in data as an .rds
file:
We could also read it in as a .csv
file:
.rds
is an R-specific file for a single objectIt will be the exact same object when you read it back in.
You can save any object, not just a dataframe:
What is y
going to print?
.csv
files are much more general but don’t maintain things like factors.csv
files might need a little more specification to read in# A tibble: 12,686 × 14
H0012400 H0012500 H0022300 H0022500 R0000100 R0009100 R0173600 R0214700
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -4 -4 -4 -4 1 1 5 3
2 0 1 4 3 2 8 5 3
# ℹ 12,684 more rows
# ℹ 6 more variables: R0214800 <dbl>, R0216400 <dbl>, R0217900 <dbl>,
# R0402800 <dbl>, R7090700 <dbl>, T4120500 <dbl>
nlsy_full <- read_csv(
"https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv", skip = 1,
col_names = c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir"),
na = c("-1", "-2", "-3", "-4", "-5", "-998"))
print(nlsy_full, n = 2)
# A tibble: 12,686 × 14
glasses eyesight sleep_wkdy sleep_wknd id nsibs samp race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA 1 1 5 3 2 1
2 0 1 4 3 2 8 5 3 2 1
# ℹ 12,684 more rows
# ℹ 4 more variables: income <dbl>, res_1980 <dbl>, res_2002 <dbl>,
# age_bir <dbl>
write_()
functionIf you are sharing data with collaborators who don’t use R, or you want to look at it in Excel, you can save a dataframe as a .csv
file:
The data will be saved in your “working directory” (see the top of your console)
Note
We’ll talk about directories in a little bit!
{haven}
packageRows: 1,000
Columns: 6
$ id <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ year <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ medexp <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ inc <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ age <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ insur <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
Rows: 1,000
Columns: 6
$ ID <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ YEAR <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ MEDEXP <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ INC <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ AGE <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ INSUR <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
{readxl}
All these functions take arguments, but read_excel()
takes a ton of arguments – which sheet, how many rows to read, whether there are column names, a specific range to read in, etc….
help(read_excel)
for details! [1] "_extensions" "_freeze"
[3] "_publish.yml" "_quarto_internal_scss_error.scss"
[5] "_quarto.yml" "_site"
[7] "cheat_sheet.html" "cheat_sheet.qmd"
[9] "data" "data.qmd"
[11] "decktape calls" "exercises"
[13] "homeworks" "ID543 2025.Rproj"
[15] "images" "img"
[17] "index.html" "index.qmd"
[19] "intro-to-r-syllabus-20250526.pdf" "intro-to-r-syllabus.docx"
[21] "intro-to-r-syllabus.pdf" "LICENSE"
[23] "pages" "README.md"
[25] "site_libs" "slides"
[27] "www"
[1] "/Users/louisa-teaching/Desktop/ID543-git"
[1] "data/my_dataset.csv"
[1] "~/Downloads/my_dataset.csv"
[1] "C:/Users/Downloads/my_dataset.csv"
setwd()
setwd()
changes the working directory, leading to potential issues in collaboration and reproducibility
You and I don’t have the same file structure!
For example, my current working directory is
my-project/
├─ my-project.Rproj
├─ README
├─ data/
│ ├── raw/
│ └── processed/
├─ R/
├─ results/
│ ├── tables/
│ ├── figures/
│ └── output/
└─ docs/
An .Rproj
file is mostly just a placeholder. It remembers various options, and makes it easy to open a new RStudio session that starts up in the correct working directory. You never need to edit it directly.
A README file can just be a text file that includes notes for yourself or future users.
I like to have a folder for raw data – which I never touch – and a folder(s) for datasets that I create along the way.
Demo
here
packageThe here
package lets you refer to files without worrying too much about relative file paths.
Construct file paths with reference to the top directory holding your .Rproj
file.
here::here("data", "raw", "data.csv")
for me, here, becomes "/Users/l.smith/Documents/Teaching/Harvard/ID543 2024/data/raw/data.csv"
But if I send you my code to run, it will become whatever file path you need it to be, as long as you’re running it within the R Project.
here
packageis equivalent to
I just prefer to write out the package name whenever I need it, but you can load the package for your entire session if you want.
Note
Note that you can refer to any function without loading the whole package this way, e.g. haven::read_dta()
NA
for missing valuesNA
to any logical statementNA
functionsCertain functions deal with missing values explicitly
[1] 1 2
You know some value is implausible, whether for everyone or for a specific observation
na_if(x, y)
will replace values in x
that are equal to y
with NA
In NLSY, -1 = Refused, -2 = Don’t know, -3 = Invalid missing, -4 = Valid missing, -5 = Non-interview
Other files might have .
for missing, or 999
.
nlsy_cols <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir")
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
na = c("-1", "-2", "-3", "-4", "-5", "-998"),
skip = 1, col_names = nlsy_cols)
Caveat: This previous way, you lose the info about the reason for missingness. If that’s important, read in the data first, create a variable for missingness reason (e.g., use fct_recode()
), then changes the values to NA
.
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
skip = 1, col_names = nlsy_cols) |>
mutate(age_bir_missing = ifelse(age_bir > 0, NA, age_bir),
age_bir_missing = fct_recode(
factor(age_bir_missing), "Refused" = "-1",
"Don't know" = "-2", "Invalid missing" = "-3",
"Valid missing" = "-4", "Non-interview" = "-5",
"Other missing" = "-998"))
summary(nlsy$age_bir_missing)
Other missing Non-interview Invalid missing NA's
1343 5385 15 5943
Sometimes you may just want to get rid of all the rows with missing values.
[1] 12686
[1] 6743
Caution
Don’t do this without good reason! It will exlude rows with any missing values, even in variables you’re not using.
There are multiple functions in the tidyverse
(specifically, the {dplyr}
package) for joining/merging data
Mutating joins merge two datasets based on matching variable(s), adding together the new columns from the joined dataframe
Note
We will also refer to the x
dataframe as the left-hand side (LHS) and the y
dataframe as the right-hand side (RHS)
I may have people and dates in one dataset, and want to merge with a separate dataset on prescriptions:
id | start_date | end_date |
---|---|---|
1 | 01/01/2021 | 01/01/2024 |
2 | 03/01/2023 | 01/01/2025 |
3 | 09/01/2020 | 10/01/2021 |
id | Rx_date | Rx_name |
---|---|---|
1 | 02/21/2020 | Tylenol |
1 | 04/06/2022 | Aspirin |
2 | 06/09/2023 | Claritin |
The NLSY also included the kids of the moms in the NLSY79 survey that we’re using.
# A tibble: 11,551 × 6
id_kid id_mom sex_kid dob_kid agebir_mom bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 201 2 2 1993 34 139
2 202 2 2 1994 35 NA
3 301 3 2 1981 19 162
4 302 3 2 1983 22 144
5 303 3 2 1986 24 112
6 401 4 1 1980 18 107
7 403 4 2 1997 34 NA
8 801 8 2 1976 17 119
9 802 8 1 1979 20 107
10 803 8 2 1982 24 146
# ℹ 11,541 more rows
Note
Birthweight is in ounces
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
It will automatically look for matching columns (can be dangerous!) but if none, need to specify:
# A tibble: 2,284 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 2,274 more rows
# ℹ 1 more variable: bwt_kid <dbl>
by = join_by()
syntaxThe most flexible way to specify the join is with by = join_by()
by = c("id" = "id_mom")
syntax, which is more limited but works for simple casesLHS rows are duplicated if we have multiple matches, but we lose any rows in the RHS dataset that don’t have a match
[1] 11551
nlsy_left <- left_join(nlsy_sleep, nlsy_kids,
by = join_by(id == id_mom))
n_distinct(nlsy_left$id_kid)
[1] 1784
In this case, the moms of some of the kids aren’t in the nlsy_sleep
dataset, so kids without moms are lost
# A tibble: 11,551 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 11,541 more rows
# ℹ 1 more variable: bwt_kid <dbl>
Now we don’t have the dads, because there are no matching ids in the RHS dataset
But we do keep all the kids, even those without moms in the LHS
# A tibble: 12,052 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 12,042 more rows
# ℹ 1 more variable: bwt_kid <dbl>
This dataset is larger than either of the initial datasets alone: it has the dads without kids and the kids without moms
# A tibble: 1,783 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 1,773 more rows
# ℹ 1 more variable: bwt_kid <dbl>
first_births <- inner_join(nlsy_sleep, nlsy_kids,
by = join_by(id == id_mom,
age_bir == agebir_mom))
first_births
# A tibble: 708 × 9
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 162
2 8 7 9 17 2 801 2 1976 119
3 16 6 7 31 2 1601 1 1990 109
4 20 7 8 30 2 2001 2 1990 129
5 27 8 8 27 2 2701 2 1988 117
6 49 8 8 24 2 4901 1 1982 139
7 57 7 8 21 2 5701 1 1979 148
8 86 8 8 17 2 8601 2 1977 97
9 96 7 7 19 2 9601 2 1980 124
10 97 7 8 29 2 9701 1 1987 48
# ℹ 698 more rows
summarize()
to get summary statistics of our datagroup_by()
to group our data and then get summary statistics within those groupsNA
here
package to refer to files in a projectleft_join()
, right_join()
, full_join()
, and inner_join()
to merge datasetssummarize()
: calculate summary statistics (with .by =
argument for stratified statistics)group_by()
: group datahere::here()
: refer to files in a projectread_rds()
, read_csv()
, read_dta()
, read_sas()
, read_excel()
: read in datacomplete.cases()
, na.omit()
: remove missing valuesis.na()
, anyNA()
: check for missing valuesna_if()
: replace values with NA
left_join()
, right_join()
, full_join()
, inner_join()
: merge datasets