Final project

Overall goal

Your objective is to do an analysis of your choice, on data of your choice, following a few general steps:

  • Clean the data
  • Analyze the data
  • Present results

Specific objectives

Along the way, you should make sure you hit all these skills, though not necessarily in this order:

  • Use mutate() to change/create at least 3 variables.
  • Convert at least one of these variables from numeric to factor by creating categories
  • Use at least two fct_ functions to manage your factor variables
  • Use select() and at least one select helper (e.g. starts_with()), at least once
  • Use case_when() at least once
  • Apply some eligibility criteria (at least two criteria) to your dataset using filter()
  • Use the pipe ( |> or %>% ) to link a few of these steps together
  • Create a Table 1 of descriptive statistics about your sample
  • Fit a regression and present well-formatted results from the regression
  • Create at least 2 ggplots of different types (geoms), these should be reasonably well-formatted for publication (e.g., have appropriate labels, a title, etc.)

and at least one of these:

  • Use a _join() function
  • Deal with missing data in a thoughtful way (i.e., more than just using na.rm = TRUE)
  • Read in your data and save your results in different directories within an R project using the {here} package

Presentation

Provide a short description (~ 1 paragraph) of the data and your overall goal for the analysis.

In addition, describe in words what you are doing throughout your code, either in code comments or in text between code chunks. For example, when using filter() to restrict your data to meet some criteria, tell us: “Restrict data to older women with high blood pressure.” If you are ever worried that it might not be clear where you are meeting one of the required points, feel free to point it out in your description!

This is not a statistics or epidemiology class, so you don’t have to interpret your findings (they don’t even have to be meaningful, as with most of our examples in class!), but you are welcome to.

Include all of your code and output in a quarto document. Render your document to html, docx, or pdf and submit on Canvas.

Data

You may choose any dataset you’d like. Options include:

The data you use should have at least 5 meaningful variables (i.e., not id), including some numeric variables that you can turn into factor variables, and it should have at least 100 rows. It should be interesting enough to motivate you to create some interesting tables and plots! If you are not sure whether it will be appropriate for this project, feel free to email us to check.

Guidelines

Your work must be your own, in that it reflects your decisions about how to analyze the data, what variables to use, what results to present, etc. If you are using the same dataset as a classmate, these choices must be substantially different. When you get an error you just can’t fix, you are welcome to use your classmates, the class materials and resources, the internet, and AI tools, keeping in mind the requirement for it to be your own work, and the focus on the particular skills and tools we learned in class. When you use an internet resource, please cite it (just a link is fine, don’t need formal citations), recalling that you are bound by Harvard University and the Harvard T.H. Chan School of Public Health School’s standards of Academic Integrity.

Do not enter confidential, patient, student, unpublished, or otherwise non-public data into a public AI tool. If you use AI, you are responsible for checking its suggestions and explaining any code you submit.

AI and reproducibility appendix

At the end of your submitted project, include an AI and reproducibility appendix. If you used AI, include your full conversation transcript, either pasted into the appendix or attached as a separate file. If you did not use AI, write: “I did not use AI for this assignment.”

Whether or not you used AI, include at least three checks showing that your data and code did what you think they did. Choose checks that match your project. Examples include:

  • glimpse(), summary(), or dim() after reading in data
  • count() or tabyl() after creating categories
  • row counts before and after filtering
  • row counts and unmatched record checks before and after joins
  • summaries of variables used in a table, model, or plot
  • a written explanation of what each plot layer or model term is doing

AI appendix template

Use this structure at the end of your submitted project.

AI use

Choose one:

  • I did not use AI for this assignment.
  • I used AI for this assignment and have included the full conversation below or attached it as a separate file.

Full conversation

Paste the full AI conversation here, or write the name of the attached transcript file.

How AI helped

Briefly describe what you asked AI to help with. For example: understanding an error message, checking whether a function was appropriate, improving comments, or suggesting verification code.

What I accepted, changed, or rejected

Briefly describe which AI suggestions you used, which you changed, and which you ignored.

Verification

Show at least three pieces of code or output you used to check your work. These should be checks that would help catch a mistake, not just a repeat of the final answer.

# verification code here

In a few sentences, explain what the verification output shows.

Grading rubric

Criteria Points
Overall goal
Analysis objective is clear and well-defined 2
Dataset is appropriate and has at least 5 meaningful variables with sufficient rows 1
Specific objectives
mutate() used to change/create at least 3 variables 1
Conversion of at least one variable from numeric to factor using categories 1
Use of at least two fct_ functions to manage factor variables 1
Use of select() and at least one select helper function 1
Use of case_when() at least once 1
Application of at least two eligibility criteria using filter() 1
Use of the pipe (|> or %>%) to link multiple steps 1
Creation of Table 1 with descriptive statistics about the sample 2
Regression is fitted and well-formatted results are presented 2
Creation of at least 2 well-formatted ggplots of different types 2
Additional objective
Use of _join() function, or dealing with missing data in a thoughtful way, or using the {here} package 2
Code description
Clear and detailed description of steps taken in the code 2
Any internet resources used are cited 1
AI process and verification
AI use is documented, or non-use is stated 1
Full AI transcript is included if AI was used 1
Verification checks are appropriate for the analysis 2
Verification output is explained in the student’s own words 1