Dataset ideas from Yunge
Here is a list of publicly-available datasets that I have personally enjoyed pulling from in my classes/research/work, and have learned about from fellow researchers! Feel free to use this as a starting point for final project data ideas, and please reach out if you have any questions or want help exploring what is out there.
Kaggle: Awesome hub with many datasets, with a wide range of topics, complexity, and file types. You can find simple learning datasets (my personal favorite is this cereal one), or more complex ones (just found this health care analytics one that can be used to practice predicting patient outcomes)
Panel Study of Income Dynamics (PSID): National, longitudinal household survey with socioeconomic focus. I used this data for my master’s thesis, so have gotten to know it quite well! For the scope of this class, it would be easiest to use data from 1 year. But if you would like to do a longitudinal analysis, you can take on the challenge (there are some specific considerations in order to do this using the PSID).
CDC Public Data Repository: Lots to look at here, but some highlights that I’ve picked out:
- Data regarding chronic diseases, injury and violence, vaccination, etc.
- Social Vulnerability Index (SVI) data if you are interested in geographic factors and health (i.e. social determinants of health with emphasis on “place” as a predictor of health outcomes/access): https://www.atsdr.cdc.gov/place-health/php/svi/svi-data-documentation-download.html
US Census Bureau and American Community Survey (ACS): Great for US demographic, socioeconomic, and housing data, and I find them especially useful for merging with other datasets by some geographic identifier, i.e. a state FIPs code, zip code, etc.
World Health Organization: Data for global health statistics
Environmental Protection Agency (EPA): for environmental data on topics like air pollution, water quality, etc.
Analyze Boston: Boston-specific data on topics like property assessments, crime incident reports, etc.