Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate external Huggingface data to 311 Data Huggingface repo #1714

Closed
3 tasks
ryanfchase opened this issue Apr 27, 2024 · 8 comments · Fixed by #1742 or #1747
Closed
3 tasks

Migrate external Huggingface data to 311 Data Huggingface repo #1714

ryanfchase opened this issue Apr 27, 2024 · 8 comments · Fixed by #1742 or #1747

Comments

@ryanfchase
Copy link
Member

ryanfchase commented Apr 27, 2024

Overview

We need to port in 2016-2022 data into the 311-Data HF repo so that users can have access to all available 311 request data

Action Items

  • create 2016-2022 repos on 311's Hugging Face repo
  • scrub 2016-2022 CSVs into parquet files using a one time python script (could probably be done locally)
  • move the 2016-2022 scrubbed parquet data files into 311's Hugging Face

Resources/Instructions

@ryanfchase
Copy link
Member Author

This ticket is ready to be picked up

@Skydodle Skydodle self-assigned this May 9, 2024
@Skydodle
Copy link
Member

Skydodle commented May 9, 2024

ETA: Sunday 5/19
Availability: F Sat Sun 6-9pm

@Skydodle
Copy link
Member

Updating the ETA to Sunday 6/1
Availability: F Sat Sun 6-9pm

@Skydodle Skydodle mentioned this issue May 31, 2024
4 tasks
@Skydodle
Copy link
Member

Added PR that enables 2022 data for now. Waiting for reviews to make sure no issues before continuing to add the other years with same implementation.

@Skydodle
Copy link
Member

Skydodle commented Jun 4, 2024

Most recent PR only enable 2022 data, reopening this issue to continue migrate other older years.

@Skydodle Skydodle reopened this Jun 4, 2024
@ryanfchase
Copy link
Member Author

PR is approved, we're ok to shelve this ticket until we decide we need even earlier data.

@ryanfchase
Copy link
Member Author

@Skydodle I just wanted to get a paper trail on our reasoning for fully closing this. Is it correct that dates 2019 and prior would require a serious amount of data cleaning in order to smoothly integrate it 2020-2024? Could you outline some of the technical hurdles that you had encountered when looking at those datasets?

@Skydodle
Copy link
Member

@ryanfchase

  1. There were some structural changes prior to 2020. Some examples are the csv column names and values may be different than what we have right now. We can't change how we extract and apply mutations to the data with our current FE setup because that would mess with how we display recent years' data, therefore for prior years we need to examine case by case and transform the abnormal columns into the form we accept right now.
  2. Data corruption: Some columns may be corrupt or missing values. For examples 2021 data an entire column were missing values with both file from Edwin's HF and the source file from LA data site.

What would consume the most time is that the anomalies in the csv would most likely not be detected until it's been transformed to parquet, upload to 311's hf, configured to displayed on the UI, then we'll see some data not displaying correctly or not displaying at all. And then backtrack to make the correction and redo the entire process.

I've created some tools for debugging in PR #1747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done (without merge)
Development

Successfully merging a pull request may close this issue.

2 participants