Introduction to R
summarize() to get summary statisticshere package to refer to filesleft_join(), right_join(), full_join(), and inner_join()We can get certain summary statistics about our data with summary(), which we can use either on an entire dataframe or on a single variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 19.00 22.00 23.45 27.00 52.00
id sleep_wkdy sleep_wknd age_bir
Min. : 3 Min. : 0.000 Min. : 0.000 Min. :13.00
1st Qu.: 2317 1st Qu.: 6.000 1st Qu.: 6.000 1st Qu.:19.00
Median : 4744 Median : 7.000 Median : 7.000 Median :22.00
Mean : 5229 Mean : 6.643 Mean : 7.267 Mean :23.45
3rd Qu.: 7937 3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.:27.00
Max. :12667 Max. :13.000 Max. :14.000 Max. :52.00
sex
Min. :1.000
1st Qu.:1.000
Median :2.000
Mean :1.584
3rd Qu.:2.000
Max. :2.000
We can also apply certain functions to a variable(s) to get a single statistic: mean(), median(), var(), sd(), cov(), cor(), min(), max(), quantile(), etc.
summarize()But what if we want a lot of summary statistics – just not those that come with the summary() function?
We can use summarize()
summarize() specificsImportant to note:
Because we can pipe, we can also look at statistics of variables that we make using mutate(), in a dataset we’ve subsetted with filter().
# A tibble: 1 × 1
mean_men
<dbl>
1 0.283
Note
Note that we’re standardizing the data before filtering. Or else the mean would be 0!
.by = argument in summarize()Instead of creating separate summaries, we can get statistics for all groups at once:
# A tibble: 2 × 2
sex_cat mean_age_bir
<fct> <dbl>
1 Female 22.2
2 Male 25.1
This is cleaner and more efficient than filtering multiple times!
.by =?You’ll see some optional arguments in the tidyverse start with a dot, like .by =.
summarize(), by = (with no dot) looks like it will be another variableSometimes we just want to know how many observations are in a group. We already saw how to do that with count(), but we can also do it with summarize() and the n() function, which counts the number of rows:
You can group by multiple variables by putting them in a vector:
# A tibble: 6 × 3
sex_cat race_eth_cat mean_income
<fct> <fct> <dbl>
1 Female Non-Black, Non-Hispanic 17404.
2 Male Non-Black, Non-Hispanic 20839.
3 Female Black 9457.
4 Female Hispanic 10884.
5 Male Black 11755.
6 Male Hispanic 10651.
group_by() approachYou will also see the traditional group_by() approach:
group_by()We can tell it’s “grouped” and how many groups there are by printing out the data.
The data itself won’t look (much) different, but we’ll be able to perform grouped functions on it.
# A tibble: 1,205 × 15
# Groups: region [4]
id glasses eyesight sleep_wkdy sleep_wknd nsibs race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0 1 5 7 3 3 2 1
2 6 1 2 6 7 1 3 1 1
3 8 0 2 7 9 7 3 2 1
4 16 1 3 6 7 3 3 2 1
5 18 0 3 10 10 2 3 1 3
6 20 1 2 7 8 2 3 2 1
7 27 0 1 8 8 1 3 2 1
8 49 1 1 8 8 6 3 2 1
9 57 1 2 7 8 1 3 2 1
10 67 0 1 8 8 1 3 1 1
# ℹ 1,195 more rows
# ℹ 6 more variables: income <dbl>, age_bir <dbl>, eyesight_cat <fct>,
# glasses_cat <fct>, race_eth_cat <fct>, sex_cat <fct>
.by = vs. group_by()The .by = approach is generally simpler and doesn’t leave your data grouped (which can cause issues later)
Use group_by() when you need to perform multiple grouped operations in sequence or just create a new variable with mutate()
We have been reading in data as an .rds file:
We could also read it in as a .csv file:
.rds is an R-specific file for a single objectIt will be the exact same object when you read it back in.
You can save any object, not just a dataframe:
What is y going to print?
.csv files are much more general but don’t maintain things like factors.csv files might need a little more specification to read in# A tibble: 12,686 × 14
H0012400 H0012500 H0022300 H0022500 R0000100 R0009100 R0173600 R0214700
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -4 -4 -4 -4 1 1 5 3
2 0 1 4 3 2 8 5 3
# ℹ 12,684 more rows
# ℹ 6 more variables: R0214800 <dbl>, R0216400 <dbl>, R0217900 <dbl>,
# R0402800 <dbl>, R7090700 <dbl>, T4120500 <dbl>
nlsy_full <- read_csv(
"https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv", skip = 1,
col_names = c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir"),
na = c("-1", "-2", "-3", "-4", "-5", "-998"))
print(nlsy_full, n = 2)# A tibble: 12,686 × 14
glasses eyesight sleep_wkdy sleep_wknd id nsibs samp race_eth sex region
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA 1 1 5 3 2 1
2 0 1 4 3 2 8 5 3 2 1
# ℹ 12,684 more rows
# ℹ 4 more variables: income <dbl>, res_1980 <dbl>, res_2002 <dbl>,
# age_bir <dbl>
write_() functionIf you are sharing data with collaborators who don’t use R, or you want to look at it in Excel, you can save a dataframe as a .csv file:
The data will be saved in your “working directory” (see the top of your console) unless you specify otherwise
Note
We’ll talk about directories in a little bit!
{haven} packageRows: 1,000
Columns: 6
$ id <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ year <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ medexp <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ inc <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ age <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ insur <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
Rows: 1,000
Columns: 6
$ ID <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, …
$ YEAR <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, …
$ MEDEXP <dbl> 9, 9, 9, 10, 11, 6, 7, 7, 7, 7, 4, 3, 5, 4, 4, 5, 3, 6, 6, 3, 4…
$ INC <dbl> 49, 51, 55, 58, 61, 48, 48, 58, 59, 63, 46, 51, 55, 58, 63, 68,…
$ AGE <dbl> 51, 52, 53, 54, 55, 62, 63, 64, 65, 66, 57, 58, 59, 60, 61, 48,…
$ INSUR <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, …
{readxl}All these functions take arguments, but read_excel() takes a ton of arguments – which sheet, how many rows to read, whether there are column names, a specific range to read in, etc….
help(read_excel) for details!File paths tell R where the data is located
setwd()Often you’ll see code that uses setwd() to set the working directory to a specific folder
setwd() changes the working directory, leading to potential issues in collaboration and reproducibility
It’s really annoying to change your working directory when you move around files and folders, even if it’s just you using them
Note
This is a great blog post explaining more: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
my-project/
├─ my-project.Rproj
├─ README
├─ data/
│ ├── raw/
│ └── processed/
├─ R/
├─ results/
│ ├── tables/
│ ├── figures/
│ └── output/
└─ docs/
An .Rproj file is mostly just a placeholder. It remembers various options, and makes it easy to open a new RStudio session that starts up in the correct working directory. You never need to edit it directly.
A README file can just be a text file that includes notes for yourself or future users.
I like to have a folder for raw data – which I never touch – and a folder(s) for datasets that I create along the way.
Demo
here packageThe here package lets you refer to files without worrying too much about relative file paths.
Construct file paths with reference to the top directory holding your .Rproj file.
here::here("data/raw/data.csv") for me, here, becomes /Users/l.smith/Documents/Teaching/Harvard/ID543 2026/data/raw/data.csv
But if I send you my code to run, it will become whatever file path you need it to be, as long as you’re running it within the R Project.
here packageis equivalent to
I just prefer to write out the package name whenever I need it, but you can load the package for your entire session if you want.
Note
Note that you can refer to any function without loading the whole package this way, e.g. haven::read_dta()
NA for missing valuesNA to any logical statementNA functionsCertain functions deal with missing values explicitly
[1] 1 2
You know some value is implausible, whether for everyone or for a specific observation
na_if(x, y) will replace values in x that are equal to y with NA
In NLSY, -1 = Refused, -2 = Don’t know, -3 = Invalid missing, -4 = Valid missing, -5 = Non-interview
Other files might have . for missing, or 999.
nlsy_cols <- c("glasses", "eyesight", "sleep_wkdy", "sleep_wknd",
"id", "nsibs", "samp", "race_eth", "sex", "region",
"income", "res_1980", "res_2002", "age_bir")
nlsy <- read_csv("https://github.com/louisahsmith/data/raw/main/nlsy/nlsy.csv",
na = c("-1", "-2", "-3", "-4", "-5", "-998"),
skip = 1, col_names = nlsy_cols)Sometimes you may just want to get rid of all the rows with missing values.
[1] 12686
[1] 1205
Caution
Don’t do this without good reason! It will exlude rows with any missing values, even in variables you’re not using.
There are multiple functions in the tidyverse (specifically, the {dplyr} package) for joining/merging data
Mutating joins merge two datasets based on matching variable(s), adding together the new columns from the joined dataframe
Note
We will also refer to the x dataframe as the left-hand side (LHS) and the y dataframe as the right-hand side (RHS)
I may have people and dates in one dataset, and want to merge with a separate dataset on prescriptions:
| id | start_date | end_date |
|---|---|---|
| 1 | 01/01/2021 | 01/01/2024 |
| 2 | 03/01/2023 | 01/01/2025 |
| 3 | 09/01/2020 | 10/01/2021 |
| id | Rx_date | Rx_name |
|---|---|---|
| 1 | 02/21/2020 | Tylenol |
| 1 | 04/06/2022 | Aspirin |
| 2 | 06/09/2023 | Claritin |
Goal: person 1 was taking Aspirin at some point during that interval, person 2 was taking Claritin, and person 3 was not taking any medications
The NLSY also included the kids of the moms in the NLSY79 survey that we’re using.
# A tibble: 11,551 × 6
id_kid id_mom sex_kid dob_kid agebir_mom bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 201 2 2 1993 34 139
2 202 2 2 1994 35 NA
3 301 3 2 1981 19 162
4 302 3 2 1983 22 144
5 303 3 2 1986 24 112
6 401 4 1 1980 18 107
7 403 4 2 1997 34 NA
8 801 8 2 1976 17 119
9 802 8 1 1979 20 107
10 803 8 2 1982 24 146
# ℹ 11,541 more rows
Note
Birthweight is in ounces
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
It will automatically look for matching columns (can be dangerous!) but if none, need to specify:
# A tibble: 2,284 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 2,274 more rows
# ℹ 1 more variable: bwt_kid <dbl>
by = join_by() syntaxThe most flexible way to specify the join is with by = join_by()
by = c("id" = "id_mom") syntax, which is more limited but works for simple casesLHS rows are duplicated if we have multiple matches, but we lose any rows in the RHS dataset that don’t have a match
[1] 11551
[1] 1784
In this case, the moms of some of the kids aren’t in the nlsy_sleep dataset, so kids without moms are lost
# A tibble: 11,551 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 11,541 more rows
# ℹ 1 more variable: bwt_kid <dbl>
Now we don’t have the dads, because there are no matching ids in the RHS dataset
But we do keep all the kids, even those without moms in the LHS
# A tibble: 12,052 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 6 6 7 30 1 NA NA NA NA
5 8 7 9 17 2 801 2 1976 17
6 8 7 9 17 2 802 1 1979 20
7 8 7 9 17 2 803 2 1982 24
8 16 6 7 31 2 1601 1 1990 31
9 16 6 7 31 2 1602 1 1993 34
10 16 6 7 31 2 1603 2 1996 37
# ℹ 12,042 more rows
# ℹ 1 more variable: bwt_kid <dbl>
This dataset is larger than either of the initial datasets alone: it has the dads without kids and the kids without moms
# A tibble: 1,783 × 10
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid agebir_mom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 19
2 3 5 7 19 2 302 2 1983 22
3 3 5 7 19 2 303 2 1986 24
4 8 7 9 17 2 801 2 1976 17
5 8 7 9 17 2 802 1 1979 20
6 8 7 9 17 2 803 2 1982 24
7 16 6 7 31 2 1601 1 1990 31
8 16 6 7 31 2 1602 1 1993 34
9 16 6 7 31 2 1603 2 1996 37
10 20 7 8 30 2 2001 2 1990 30
# ℹ 1,773 more rows
# ℹ 1 more variable: bwt_kid <dbl>
age_bir_mom)age_bir)in the moms dataset# A tibble: 708 × 9
id sleep_wkdy sleep_wknd age_bir sex id_kid sex_kid dob_kid bwt_kid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 5 7 19 2 301 2 1981 162
2 8 7 9 17 2 801 2 1976 119
3 16 6 7 31 2 1601 1 1990 109
4 20 7 8 30 2 2001 2 1990 129
5 27 8 8 27 2 2701 2 1988 117
6 49 8 8 24 2 4901 1 1982 139
7 57 7 8 21 2 5701 1 1979 148
8 86 8 8 17 2 8601 2 1977 97
9 96 7 7 19 2 9601 2 1980 124
10 97 7 8 29 2 9701 1 1987 48
# ℹ 698 more rows
summarize() to get summary statistics of our datagroup_by() to group our data and then get summary statistics within those groupsNAhere package to refer to files in a projectleft_join(), right_join(), full_join(), and inner_join() to merge datasetssummarize(): calculate summary statistics (with .by = argument for stratified statistics)group_by(): group datahere::here(): refer to files in a projectread_rds(), read_csv(), read_dta(), read_sas(), read_excel(): read in datacomplete.cases(), na.omit(): remove missing valuesis.na(), anyNA(): check for missing valuesna_if(): replace values with NAleft_join(), right_join(), full_join(), inner_join(): merge datasets