25-papers.qmd

---
engine: knitr
---

# Papers {#sec-papers}

```{r}
#| include: false
#| eval: true
#| warning: false
#| message: false

library(tidyverse)
library(tinytable)

rubric <- read_csv(here::here("inputs/rubric.csv"))
```

One way to build understanding of material is by using it. The purpose of these papers is to give you a chance to implement what you have learnt in a real-world setting. Completing the papers is also important from the perspective of building a portfolio for job applications.

Expectations change from year to year so please treat the "previous examples" as examples rather than templates.


## *Donaldson* Paper {#sec-paper-one} 

### Task

- Working **individually** and in an entirely reproducible way, please find a dataset of interest on [Open Data Toronto](https://open.toronto.ca) and write a short paper telling a story about the data.
  - Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
  - Find a dataset of interest on [Open Data Toronto](https://open.toronto.ca). (While not banned, please don't use a dataset about the pandemic.) 
    - Put together an R script, "scripts/00-simulate_data.R", that simulates the dataset of interest and develops some tests. Push to GitHub and include an informative commit message
    - Write an R script, "scripts/01-download_data.R" to download the actual data in a reproducible way using `opendatatoronto` [@citeSharla]. Save the data: "data/raw_data/unedited_data.csv" (use a meaningful name and appropriate file type). Push to GitHub and include an informative commit message.
  - Prepare a PDF using Quarto "paper/paper.qmd" with these sections: title, author, date, abstract, introduction, data, and references. 
    - The title should be descriptive, informative, and specific. 
    - The date should be in an unambiguous format. Add a link to the GitHub repo in the acknowledgments.
    - The abstract should be three or four sentences. The abstract must tell the reader the top-level finding. What is the one thing that we learn about the world because of this paper?
    - The introduction should be two or three paragraphs of content. And there should be an additional final paragraph that sets out the remainder of the paper.
    - The data section should thoroughly and precisely discuss the source of the data and the broader context that gave rise to it (ethical, statistical, and otherwise). Comprehensively describe and summarize the data using text, graphs, and tables. Graphs must be made with `ggplot2` [@citeggplot] and tables must be made with `tinytable` [@tinytable]. Graphs must show the actual data, or as close to it as possible, not summary statistics. Graphs and tables should be cross-referenced in the text e.g. "Table 1 shows...".
    - References should be added using BibTeX. Be sure to reference R, and any R packages you use, as well as the dataset. Strong submissions will draw on related literature and reference those.
    - The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
    - Use appendices for supporting, but not critical, material.
    - Push to GitHub and include an informative commit message
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class assignment.

### Checks

- There should be no R code or raw R output in the final PDF.
- An example statement for the README on LLM usage that you could base yours on is: "Statement on LLM usage: Aspects of the code were written with the help of the autocomplete tool, Codriver. The abstract and introduction were written with the help of ChatHorse and the entire chat history is available in `other/llm/usage.txt`."
- The paper should render directly to PDF i.e. use "Render to PDF". 
- Graphs, tables, and text should be clear, and of comparable quality to those of the Financial Times.
- The date should be up-to-date and unambiguous (e.g. 2/3/2024 is ambiguous, 2 March 2024 is not).
- The entire workflow should be entirely reproducible.
- There should not be any typos.
- There should be no sign this is a school paper.
- There must be a link to the paper's GitHub repo using a footnote.
- The GitHub repo should be well-organized, and contain an informative README. 
- The paper should be well-written and able to be understood by the average reader of, say, the Financial Times This means that you are allowed to use mathematical notation, but you must explain all of it in plain language. All statistical concepts and terminology must be explained. Your reader is someone with a university education, but not necessarily someone who understands what a p-value is.
- Abstracts need to be "tightly written", almost terse. Remove unnecessary words. Do not include more than four sentences. (You can break this rule once you get experience.)
- Introduction needs paragraphs (leave a space between lines in the Quarto Document). 
- In the introduction, please telegraph the rest of the paper: "Section 2..., Section 3....". (You can break this rule once you get experience.)
- Please don't read the data from their server into the Quarto Document, read the saved version. Submissions that do this receive 0 overall.
- In the introduction, please be more specific about your findings.
- The data section is not about data cleaning, it is about the data. Put data cleaning into an appendix. Unless there is something critical, do not discuss data cleaning in the data section.
- Simulation needs a seed. 
- Do not call the repo "Paper 1" or similar. 
- Do not have sections called "graphs" or "tables" or similar.
- Use `usethis::git_vaccinate()` to get a better gitignore file, and specifically to ignore `DS_Store`.
- Please remember to cite both the dataset that you use and also `opendatatoronto` - they are separate things.

### FAQ

- Can I use a dataset from Kaggle instead? No, because they have done the hard work for you.
- I cannot use code to download my dataset, can I just manually download it? No, because your entire workflow needs to be reproducible. Please fix the download problem or pick a different dataset.
- How much should I write? Most students submit something in the two-to-six-page range, but it is up to you. Be precise and thorough.
- My data is about apartment blocks/NBA/League of Legends so there's no broader context, what do I do? Please re-read the relevant chapter and readings to better understand bias and ethics. If you really cannot think of something, then it might be worth picking a different dataset.
- Can I use Python? No. If you already know Python then it does not hurt to learn another language.
- Why do I need to cite R, when I don't need to cite Word? R is a free statistical programming language with academic origins, so it is appropriate to acknowledge the work of others. It is also important for reproducibility.
- What reference style should I use? Any major reference style is fine (APA, Harvard, Chicago, etc); just pick one that you are used to.
- The paper in the starter folder has a model section, so do I need to put together a model? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need. 
- The paper in the starter folder has a data sheets appendix, so do I need to put together a data sheet? No. The starter folder is designed to be applicable to all of the papers; just delete the aspects that you do not need. 
- What does "graph the actual data" mean? If you have, say 5,000 observations in the dataset and three variables, then for every variable there should be a graph that has 5,000 points in the case of dots, or adds up to 5,000 in the case of bar charts and histograms.


### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Data cited", "Class paper", "Estimand", "Replication", "Model", "Results", "Discussion", "Enhancements", "Idealized methodology", "Idealized survey", "Pollster review", "Datasheet", "Parquet", "Surveys, sampling, and observational data")) |>
  tt()
```

### Previous examples

- 2024 (Fall): 
[Julia Lee](https://github.com/JuliaJLee/Toronto_Paramedic_Services/blob/main/paper/paper.pdf), 
[Steven Li](https://github.com/stevenli-uoft/Toronto_BikeShare_Development/blob/main/paper/paper.pdf), and 
[Ziheng Zhong](https://github.com/iJustinn/Toronto_Cycling_Network/blob/main/paper/paper.pdf).
- 2024 (Winter): 
[Abbass Sleiman](inputs/pdfs/paper1-2024-SleimanAbbass.pdf), 
[Benny Rochwerg](inputs/pdfs/paper1-2024-RochwergBenny.pdf), 
[Carly Penrose](inputs/pdfs/paper-1-2024-CarlyPenrose.pdf) (an article based on this paper was later published by CBC News [@carlyfirearticle]), 
[Hadi Ahmad](inputs/pdfs/paper-1-2024-HadiAhmad.pdf), 
[Luca Carnegie](inputs/pdfs/paper-1-2024-LucaCarnegie.pdf), 
[Sami El Sabri](inputs/pdfs/paper-1-2024-SamiElSabri.pdf), 
[Thomas Fox](inputs/pdfs/paper-1-2024-ThomasFox.pdf), and 
[Timothius Prajogi](inputs/pdfs/paper-1-2024-TimothiusPrajogi.pdf).
- 2023: 
[Christina Wei](inputs/pdfs/paper-1-2023-Christina_Wei.pdf), and 
[Inessa De Angelis](inputs/pdfs/paper-1-2023-InessaDeAngelis.pdf).
- 2022: 
[Adam Labas](inputs/pdfs/paper_one-2022-adam_labas.pdf), 
[Alicia Yang](inputs/pdfs/paper_one-2022-alicia_yang.pdf), 
[Alyssa Schleifer](inputs/pdfs/paper_one-2022-alyssa_schleifer.pdf), 
[Ethan Sansom](inputs/pdfs/paper_one-2022-ethan_sansom.pdf), 
[Hudson Yuen](inputs/pdfs/paper_one-2022-hudson_yuen.pdf), 
[Jack McKay](inputs/pdfs/paper_one-2022-jack_mckay.pdf), 
[Roy Chan](inputs/pdfs/paper_one-2022-roy_chan.pdf), 
[Thomas D'Onofrio](inputs/pdfs/paper_one-2022-thomas_donofrio.pdf), and 
[William Gerecke](inputs/pdfs/paper_one-2022-william_gerecke.pdf).
- 2021: 
[Amy Farrow](inputs/pdfs/paper_one-2021-Amy_Farrow.pdf), 
[Morgaine Westin](inputs/pdfs/paper_one-2021-Morgaine_Westin.pdf), and 
[Rachael Lam](inputs/pdfs/paper_one-2021-Rachael_Lam.pdf).


## *Mawson* Paper {#sec-paper-two}

### Task

- Working as part of a team of one to three people, please pick a paper of interest to you, with code and data that are available from: 

    1. A paper published anytime since 2019, in an American Economic Association [journal](https://www.aeaweb.org/journals). These journals are: "American Economic Review", "AER: Insights", "AEJ: Applied Economics", "AEJ: Economic Policy", "AEJ: Macroeconomics", "AEJ: Microeconomics", "Journal of Economic Literature", "Journal of Economic Perspectives", "AEA Papers & Proceedings". 
    2. Any article from the Institute for Replication list available [here](https://i4replication.org/reports.html) that has a replicability status of "Looking for replicator".
    3. One of [Gilad Feldman's](https://mgto.org/check-me-replicate-me/) papers.^[Gilad gave explicit permission and encouragement to be included in this list.]

- Following the [*Guide for Accelerating Computational Reproducibility in the Social Sciences*](https://bitss.github.io/ACRE/), please complete a **replication**^[This terminology is used following @barba2018terminologies, but it is the opposite of that used by BITSS.] of at least three graphs, tables, or a combination, from that paper, using the [Social Science Reproduction Platform](https://www.socialsciencereproduction.org). Note the DOI of your replication.
- Working in an entirely reproducible way then conduct a **reproduction** based on two or three aspects of the paper, and write a short paper about that. 
  - Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using Quarto with these sections (you should use this [starter folder](https://github.com/RohanAlexander/starter_folder)): title, author, date, abstract, introduction, data, results, discussion, and references.
  - The aspects that you focus on in your paper could be the same aspects that you replicated, but they do not need to be. Follow the direction of the paper, but make it your own. That means you should ask a slightly different question, or answer the same question in a slightly different way, but still use the same dataset.
  - Include the DOI of your replication in your paper and a link to the GitHub repo that underpins your paper.
  - The results section should convey findings.
  - The discussion should include three or four sub-sections that each focus on an interesting point, and there should be another sub-section on the weaknesses of your paper, and another on potential next steps for it.
  - In the discussion section, and any other relevant section, please be sure to discuss ethics and bias, with reference to relevant literature.
  - The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at an educated, but non-specialist, audience.
  - Use appendices for supporting, but not critical, material. 
  - Code should be entirely reproducible, well-documented, and readable.
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class assignment.


### Checks

- The paper should not just copy/paste the code from the original paper, but have instead used that as a foundation to work from.
- Your paper should have a link to the associated GitHub repository and the DOI of the Social Science Reproduction Platform replication that you conducted. 
- Make sure you have referenced everything, including R. Strong submissions will draw on related literature in the discussion (and other sections) and would be sure to also reference those. The style of references does not matter, provided it is consistent.


### FAQ

- How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.
- Do I have to focus on a model result? No, it is likely best to stay away from that at this point, and instead focus on tables or graphs of summary or explanatory statistics.
- What if the paper I choose is in a language other than R? Both your replication and reproduction code should be in R. So you will need to translate the code into R for the replication. And the reproduction should be your own work, so that also should be in R. One common language is Stata, and @lost2022 might be useful as a "Rosetta Stone" of sorts, for R, Python, and Stata, or just use a LLM to help.
- Can I work by myself? Yes.
- Do the graphs/tables have to look identical to the original? No, you are welcome to, and should, make them look better as part of the reproduction. And even as part of the replication, they do not have to be identical, just similar enough.
- One of my graphs has four panels, do I have to do all of them for this to be counted as one element? No, for the purpose of this paper, every panel counts as a separate element, so all you would need to do is three panels and that would be enough.
- How do I automatically download the data if they are behind a sign-in? If the data are behind a sign-in, just add commented documentation for how to download it into the `download_data.R` R file, rather than code.
- Do we need to commit our original, unedited data to GitHub if it is really big? No, you do not necessarily need to commit the original, unedited data to GitHub if it is too large, just add a note explaining the situation in the README and how to obtain the data.
- What should the abstract and introduction be about? The abstract and introduction should reflect your own work and findings, rather than those of the original paper (even though those will necessarily nonetheless have some role). You are (almost surely) not replicating their entire paper, and so your abstract should be different. See the examples for guidance.

### Rubric

```{r}
#| eval: true 
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Model", "Enhancements", "Idealized methodology", "Idealized survey", "Pollster review", "Datasheet", "Parquet", "Surveys, sampling, and observational data")) |>
  tt()
```

### Previous examples

- 2024: 
[Benny Rochwerg](inputs/pdfs/2024-paper2-BennyRochwerg.pdf); 
[Krishiv Jain, Julia Kim and Abbass Sleiman](inputs/pdfs/2024-paper2-KrishivJain_JuliaKim_AbbassSleiman.pdf); 
[Sami El Sabri and Liban Timir](inputs/pdfs/2024-paper2-SamiElSabri_LibanTimir.pdf); 
[Thomas Fox](inputs/pdfs/2024-paper2-ThomasFox.pdf); and 
[Yuanyi (Leo) Liu and Qi Er (Emma) Teng](inputs/pdfs/2024-paper2-YuanyiLiu_QiErTeng.pdf).
- 2023: [Jayden Jung, Finn Korol, and Sofia_Sellitto](inputs/pdfs/paper-2-2023-Jayden_Jung_Finn_Korol_Sofia_Sellitto.pdf).
- 2022: [Alyssa Schleifer, Hudson Yuen, Tamsen Yau](inputs/pdfs/paper_two-2022-Alyssa_Schleifer_Hudson_Yuen_Tamsen_Yau.pdf); [Olaedo Okpareke, Arsh Lakhanpal, Swarnadeep Chattopadhyay](inputs/pdfs/paper_two-2022-Olaedo_Okpareke_Arsh_Lakhanpal_Swarnadeep_Chattopadhyay.pdf); and [Kimlin Chin](inputs/pdfs/paper_two-2022-Kimlin_Chin.pdf).


## *Howrah* Paper {#sec-paper-three}

### Task

- Working as part of a team of one to three people, and in an entirely reproducible way, please obtain data from the [US General Social Survey](https://gss.norc.org/Get-The-Data)^[The US GSS is recommended here because individual-level data are publicly available, and the dataset is well-documented. But, often university students in particular countries have access to individual level data that are not available to the public, and if this is the case then you are welcome to use that instead.  Students at Australian universities will likely have access to individual-level data from the Australian General Social Survey, and could use that. Students at Canadian universities will likely have access to individual-level data from the Canadian General Social and may like to use that.]. (You are welcome to use a different government-run survey, but please obtain permission before starting.)
- Obtain the data, focus on one aspect of the survey, and then use it to tell a story.
  - Create a well-organized folder with appropriate sub-folders, add it to GitHub, and then use Quarto to prepare a PDF with these sections (you should use this [starter folder](https://github.com/RohanAlexander/starter_folder)): title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a survey, and references.
  - In addition to conveying a sense of the dataset of interest, the data section should include, but not be limited to:
      - A discussion of the survey's methodology, and its key features, strengths, and weaknesses. For instance: what is the population, frame, and sample; how is the sample recruited; what sampling approach is taken, and what are some of the trade-offs of this; how is non-response handled.
      - A discussion of the questionnaire: what is good and bad about it?
      - If this becomes too detailed, then use appendices for supporting but not essential aspects.
  - In an appendix, please put together a supplementary survey that could be used to augment the general social survey the paper focuses on. The purpose of the supplementary survey is to gain additional information on the topic that is the focus of the paper, beyond that gathered by the general social survey. The survey would be distributed in the same manner as the general social survey but needs to stand independently. The supplementary survey should be put together using a survey platform. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
  - Please be sure to discuss ethics and bias, with reference to relevant literature.
  - Code should be entirely reproducible, well-documented, and readable.
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- The paper should be well-written, draw on relevant literature, and explain all technical concepts. Pitch it at a university-educated, but non-specialist, audience. Use survey, sampling, and statistical terminology, but be sure to explain it. The paper should flow, and be easy to follow and understand.
- There should be no evidence that this is a class paper.


### Checks

- An appendix should contain both a link to the supplementary survey and the details of it, including questions (in case the link fails, and to make the paper self-contained).

### FAQ

- What should I focus on? You may focus on any year, aspect, or geography that is reasonable given the focus and constraints of the general social survey that you are interested in. Please consider the year and topics that you are interested in together, as some surveys focus on particular topics in some years.
- Do I need to include the raw GSS data in the repo? For most of the general social surveys you will not have permission to share the GSS data. If that is the case, then you should add clear details in the README explaining how the data could be obtained.
- How many graphs do I need? In general, you need at least as many graphs as you have variables, because you need to show all the observations for all variables. But you may be able to combine a few; or, vice versa, you may be interested in looking at different aspects or relationships.

### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Replication", "Enhancements", "Model", "Idealized methodology", "Datasheet", "Parquet", "Surveys, sampling, and observational data")) |>
  tt()
```

### Previous examples

- 2023: [Christina Wei and Michaela Drouillard](inputs/pdfs/paper-3-2023-Christina_Wei_Michaela_Drouillard.pdf); and 
[Inessa De Angelis](inputs/pdfs/paper-3-2023-InessaDeAngelis.pdf).
- 2022: [Anna Li and Mohammad Sardar Sheikh](inputs/pdfs/paper3-2022-Li_Sheikh.pdf); 
[Chyna Hui and Marco Chau](inputs/pdfs/paper3-2022-hui_chau.pdf);
[Ethan Sansom](inputs/pdfs/paper3-2022-Ethan_Sansom.pdf); 
[Luckyna Laurent, Samita Prabhasavat, and Zoie So](inputs/pdfs/paper3-2022-LuckynaLaurent_SamitaPrabhasavat_ZoieSo.pdf); 
[Pascal Lee Slew, and Yunkyung_Park](inputs/pdfs/paper3-2022-Pascal_Lee_Slew_Yunkyung_Park.pdf); and
[Ray Wen, Isfandyar Virani, and Rayhan Walia](inputs/pdfs/paper3-2022-Ray_Wen_Isfandyar_Virani_Rayhan_Walia.pdf).


## *Dysart* Paper {#sec-paper-four}

### Task

- Working as part of a team of one to three people, and in an entirely reproducible way, please convert 
at least one full-page table from 
one DHS Program "Final Report", from the 1980s or 1990s, as available [here](https://dhsprogram.com/search/index.cfm?_srchd=1&bydoctype=publication&bypubtype=26%2C5%2C39%2C30%2C21%2C100&byyear=1999&byyear=1998&byyear=1997&byyear=1996&byyear=1995&byyear=1994&byyear=1993&byyear=1992&byyear=1991&byyear=1990&byyear=1989&byyear=1988&byyear=1987&bylanguage=2), 
into a usable dataset, then write a short paper telling a story with the data.
- Create a well-organized folder with appropriate sub-folders, and add it to GitHub. You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
- Create and document a dataset:
  - Save the PDF to "inputs".
  - Put together a simulation of your plan for the usable dataset and save the script to "scripts/00-simulation.R".
  - Write R code, saved as "scripts/01-gather_data.R", to either OCR or parse the PDF, as appropriate, and save the output to "outputs/data/first_parse.csv".
  - Write R code, saved as "scripts/02-clean_and_prepare_data.R", that draws on "first_parse.csv" to clean and prepare the dataset. Use `pointblank` to put together tests that the dataset passes (at a minimum, every variable should have a test for class and another for content). Save the dataset to "outputs/data/cleaned_data.parquet".
  - Following @gebru2021datasheets, put together a data sheet for the dataset you put together (put this in the appendix of your paper). You are welcome to start from the template "inputs/data/datasheet_template.qmd" in the starter folder, although, again, you should add it to the appendix of your paper, rather than a stand-alone document.
- Use the dataset to tell a story by using Quarto to prepare a PDF with these sections: title, author, date, abstract, introduction, data, results, discussion, an appendix that will, at least, contain a datasheet for the dataset, and references.
  - In addition to conveying a sense of the dataset of interest, the data section should include details of the methodology used by the DHS you used, and its key features, strengths, and weaknesses. 
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class paper.


### Checks

- Use GitHub in a well-developed way by making at least a few commits and using descriptive commit messages.

<!-- - In an appendix there is both a link to the supplementary survey and the details of it, including questions (in case the link fails, and to make the paper self-contained). -->
 

### FAQ

<!-- - What should I focus on? You may focus on any year, aspect, or geography that is reasonable given the focus and constraints of the general social survey that you are interested in. Please consider the year and topics that you are interested in together, as some surveys focus on particular topics in some years. -->
<!-- - Do I need to include the raw GSS data in the repo? For most of the general social surveys you will not have permission to share the GSS data. If that is the case, then you should add clear details in the README explaining how the data could be obtained. -->
<!-- - The Canadian GSS is available to University of Toronto students via the library. In order to use it you need to clean and prepare it. Code to do this for one year is being distributed alongside this problem set and was discussed in lectures.  -->
<!-- - You are welcome to simply use this code and this year, but the topic of that year will constrain your focus. Naturally, you are welcome to adapt the code to other years. If you use the code exactly as is then you must cite it. If you adapt the code then you don't have to cite it, as it has a MIT license, but it would be appropriate to at least mention and acknowledge it, depending on how close your adaption is. -->


### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Replication", "Enhancements", "Model", "Idealized methodology", "Idealized survey", "Pollster review", "Surveys, sampling, and observational data")) |>
  tt()
```


### Previous examples

- 2022: [Bilal Haq and Ritvik Puri](inputs/pdfs/paper-4-2022-BilalHaq_RitvikPuri.pdf); 
[Charles Lu, Mahak Jain, and Yujun Jiao](inputs/pdfs/paper-4-2022-CharlesLu_MahakJain_YujunJiao.pdf); 
[Jacob Yoke Hong Si](inputs/pdfs/paper-4-2022-jacob_yoke_hong_si.pdf); and 
[Pascal Lee Slew and Yunkyung Park](inputs/pdfs/paper-4-2022-PascalLeeSlew_YunkyungPark.pdf).


## *Spadina* Paper {#sec-spadina}

### Task

- Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model and then write a short paper telling a story. Some ideas for aspects you could tackle include:
    - Revisit the dataset that you used in @sec-paper-one. Build a linear model for one of the variables, and consider the results.
    - Pick one of the examples in @sec-its-just-a-generalized-linear-model, and change the situation slightly, and then build a generalized linear model.
- You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class paper.


### Checks

- Be careful to thoroughly explain the model. Also consider the assumptions of the model and the threats to its validity.

### FAQ

- What does "change the situation slightly" mean? You are welcome to use the same, or similar, data, but consider a different aspect. For instance:
    - In the logistic regression example of US political support, you may use the CES from a different year, and/or with slightly different explanatory variables.
    - In the Poisson regression example of the letters used in *Jane Eyre*, you may consider a different novel.
    - In the negative binomial regression of mortality in Alberta, you may consider a different geographic area.
- Can I use Alberta mortality data? No.

### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Replication", "Enhancements", "Idealized methodology", "Idealized survey", "Pollster review", "Surveys, sampling, and observational data")) |>
  tt()
```


### Previous examples

- 2024: 
[Alaina Hu](inputs/pdfs/2024-spadina-alaina_hu.pdf); 
[Irene Huynh](inputs/pdfs/2024-spadina-irene_huynh.pdf); 
[Janssen Myer Rambaud, Timothius Prajogi](inputs/pdfs/2024-spadina-Rambaud_Prajogi.pdf); and 
[Qi Er (Emma) Teng, Wentao Sun, Yang Cheng](inputs/pdfs/2024-spadina-teng_sun_cheng.pdf).

<!-- ### Task -->

<!-- - Paper about causal inference. -->


<!-- - You must include a DAG (probably in the model section). -->


## *St George* Paper {#sec-st-george-paper}

### Task

- Working as part of a team of one to three people, and in an entirely reproducible way, please build a linear, or generalized linear, model to forecast the winner of the upcoming US presidential election using "poll-of-polls" [@Blumenthal2014; @Pasek2015] and then write a short paper telling a story.
- You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
- You are welcome to use R, Python, or a combination.
- You can get data about polling outcomes from [here](https://projects.fivethirtyeight.com/polls/president-general/2024/national/) (search for "Download the data", then select Presidential general election polls (current cycle), then "Download").
- Pick one pollster in your sample, and deep-dive on their methodology in an appendix to your paper. In particular, in addition to conveying a sense of the pollster of interest, this appendix should include a discussion of the survey's methodology, and its key features, strengths, and weaknesses. For instance: 
  - what is the population, frame, and sample; 
  - how is the sample recruited; 
  - what sampling approach is taken, and what are some of the trade-offs of this; 
  - how is non-response handled; 
  - what is good and bad about the questionnaire.
- In another appendix, please put together an idealized methodology and survey that you would run if you had a budget of $100K and the task of forecasting the US presidential election. You should detail the sampling approach that you would use, how you would recruit respondents, data validation, and any other relevant aspects of interest. Also be careful to address any poll aggregation or other features of your methodology. You should actually implement your survey using a survey platform like Google Forms. A link to this should be included in the appendix. Additionally, a copy of the survey should be included in the appendix.
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class paper.


### Checks

- Check that you have both appendices required.

### FAQ

- Do I need to use all the predictors in the dataset? No, you should be deliberate and thoughtful about the predictors that you use.
- What about the electoral college? US presidential elections are won based on the electoral college. It is fine to just focus the popular vote. But exceptional submissions would consider the popular vote, possibly by state, and then construct an electoral college estimate, being careful to propagate uncertainty.


### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Replication", "Enhancements", "Datasheet", "Surveys, sampling, and observational data")) |>
  tt()
```


### Previous examples

- 2024: 
[Talia Fabregas, Aliza Mithwani, and Fatimah Yunusa](https://github.com/taliafabs/USPresidentialPollingForecast2024/blob/main/paper/USPresidentialPollingForecast2024.pdf); 
[Sophia Brothers, Deyi Kong, and Rayan Awad Alim](https://github.com/eeeee-cmd/US_Election/blob/main/paper/paper.pdf); 
[Colin Sihan Yang, Lexun Yu, and Siddharth Gowda](https://github.com/yulexun/2024uselectionprediction/blob/main/paper/paper.pdf); and 
[Yuanyi (Leo) Liu, Dezhen Chen, and Ziyuan Shen](https://github.com/leoyliu/Forecasting-the-2024-US-Presidential-Election/blob/main/paper/paper.pdf).


## *Spofforth* Paper {#sec-paper-five}

### Task

- Working as part of a team of one to three people, please forecast the popular vote of the upcoming US election using multilevel regression with post-stratification and then write a short paper telling a story. 
- This requires individual-level survey data, post-stratification data, and a model that brings them together. Given the expense of collecting these data, and the privilege of having access to them, please be sure to properly cite all datasets that you use.
- You will need to: 
    - Get access to an individual-level survey dataset.
    - Get access to a post-stratification dataset.
    - Clean and prepare both these datasets to make them useable together. 
    - Estimate a model using the survey dataset.
    - Apply the trained model to the post-stratification dataset to forecast the election result.
- You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class paper.


### FAQ

- How much should I write? Most students submit something in the 10-to-15-page range, but it is up to you. Be precise and thorough.


### Rubric

```{r}
#| eval: true
#| warning: false
#| message: false
#| echo: false

rubric |>
  filter(!Component %in% c("Replication", "Enhancements", "Idealized methodology", "Idealized survey", "Pollster review", "Datasheet", "Surveys, sampling, and observational data")) |>
  tt()
```

### Previous examples

- 2024: 
[Jeongwoo Kim, Jiwon Choi](inputs/pdfs/kim_choi.pdf); 
[Talia Fabregas, Fatimah Yunusa, Aamishi Sundeep Avarsekar](inputs/pdfs/fabregas_yunusa_avarsekar.pdf).
- 2020: [Alen Mitrovski, Xiaoyan Yang, Matthew Wankiewicz](inputs/pdfs/paper_five_2020-Mitrovski_Yang_Wankiewicz.pdf) (this paper received an "Honorable Mention" in the ASA December 2020 Undergraduate Statistics Research Project competition.)


## Final paper {#sec-final-paper}

### Task

- Working **individually** and in an entirely reproducible way please write a paper that involves original work to tell a story with data.
- Develop a research question that is of interest to you, then obtain or create a relevant dataset and put together a paper that answers it.
- You should use this [starter folder](https://github.com/RohanAlexander/starter_folder).
- You are welcome to use R, Python, or a combination.
- Please include an Appendix where you focus on an aspect of surveys, sampling or observational data, related to your paper. This should be an in-depth exploration, akin to the "idealized methodology/survey/pollster methodology" sections of Paper 2. Some aspect of this is likely covered in the Measurement sub-section of your Data section, but this Appendix would be much more detailed, and might include aspects like simulation, links to the literature, explorations and comparisons, among other aspects.
- Some dataset ideas:
  - Jacob Filipp's groceries dataset [here](https://jacobfilipp.com/hammer/). 
  - An IJF dataset [here](https://theijf.org/) (you would then be eligible for the IJF best paper award). 
  - Revisiting a Open Data Toronto dataset (you would then be eligible for the Open Data Toronto best paper award)
  - A dataset from @sec-datasets.
- All the guidance and expectations from earlier papers applies to this one.
- Submit a link to the GitHub repo. Please do not update the repo after the deadline.
- There should be no evidence that this is a class paper.

### Checks

- Do not use a dataset from Kaggle, UCI, or Statistica. Mostly this is because everyone else uses these datasets and so it does nothing to make you stand out to employers, but there are sometimes also concerns that the data are old, or you do not know the provenance.

### FAQ

- Can I work as part of a team? No. You must have some work that is entirely your own. You really need your own work to show off for job applications etc.
- How much should I write? Most students submit something that has 10-to-20-pages of main content, with additional pages devoted to appendices, but it is up to you. Be concise but thorough.
- Can I use any model? You are welcome to use any model, but you need to thoroughly explain it and this can be difficult for more complicated models. Start small. Pick one or two predictors. Once you get that working, then complicate it. Remember that every predictor and the outcome variable needs to be graphed and explained in the data section.

### Rubric

```{r}
#| echo: false
#| eval: true
#| message: false
#| warning: false

rubric |>
  filter(!Component %in% c("Replication", "Idealized methodology", "Idealized survey", "Pollster review", "Datasheet")) |>
  tt()
```

### Previous examples

- 2024: [Mary Cheng](https://github.com/marycx/us_poverty_analysis_2019/blob/main/paper/US_Poverty_Analysis_2019.pdf); 
[Kaavya Kalani](https://github.com/kaavyakalani26/himalayan-expeditions-analysis/blob/main/paper/paper.pdf); 
[Julia Kim](https://github.com/julia-ve-kim/US_Climate_Change_Biodiversity/blob/main/paper/paper.pdf); 
[Yunzhao Li](https://github.com/yunzhaol/aerial_bomb_priority/blob/main/paper/paper.pdf); 
[Timothius Prajogi](https://github.com/prajogt/canadian_salmon_spawn/blob/main/output/paper.pdf); 
[Benny Rochwerg](https://github.com/bennyrochwerg/profiling/blob/main/paper/paper.pdf); 
[Abbass Sleiman](https://github.com/AbbassSleiman/US_Incarceration/blob/main/paper/paper.pdf); 
[Emily Su](https://github.com/moonsdust/top-songs/blob/main/paper/paper.pdf); and 
[Hannah Yu](https://github.com/hannahyu07/Fox-News/blob/main/paper/Fake_News_vs_Fox_News.pdf).
- 2023: [Aliyah Maxine Ramos](inputs/pdfs/final-2023-aliyah_maxine_ramos.pdf); 
[Chloe Thierstein](inputs/pdfs/final-2023-chloe_thierstein.pdf); 
[Jason Ngo](inputs/pdfs/final-2023-jason_ngo.pdf); 
[Jenny Shen](inputs/pdfs/final-2023-jenny_shen.pdf); 
[Laura Lee-Chu](inputs/pdfs/final-2023-laura_lee-chu.pdf); and 
[Sebastian Rodriguez](inputs/pdfs/final-2023-sebastian_rodriguez.pdf).
- 2022: [Alicia Yang](inputs/pdfs/final_paper-2022-alicia_yang.pdf); [Ethan Sansom](inputs/pdfs/final_paper-2022-ethan_sansom.pdf); [Ivan Li](inputs/pdfs/final_paper-2022-ivan_li.pdf); [Jack McKay](inputs/pdfs/final_paper-2022-jack_mckay.pdf); [Olaedo Okpareke](inputs/pdfs/final_paper-2022-olaedo_okpareke.pdf); and [Tian Yi Zhang](inputs/pdfs/final_paper-2022-tian_yi_zhang.pdf).
- 2021: [Amy Farrow](inputs/pdfs/final_paper-2021-amy_farrow.pdf); [Jia Jia Ji](inputs/pdfs/final_paper-2021-jia_jia_ji.pdf); [Laura Cline](inputs/pdfs/final_paper-2021-laura_cline.pdf); [Lorena Almaraz De La Garza](inputs/pdfs/final_paper-2021-lorena_almaraz_de_la_garza.pdf); and [Rachael Lam](inputs/pdfs/final_paper-2021-rachael_lam.pdf).
- 2020: [Annie Collins](inputs/pdfs/final_paper-2020-annie_collins.pdf).