Pull data from previous years (e.g. 2023, 2022, etc) #1666

traycn · 2024-02-02T20:52:29Z

Overview

We need to pull data from 2023, 2022, etc. to show data from the previous year in our application for users to make more extensive searches.

At this time, the site is limited to display data of the current year to date.

Action Items

For the Proof of Concept that we can query multiple files:
DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html

Register a new file in the newDb instance to pull data from another 311-data/[year searched here] repo
loc: components/db/DbProvider.jsx
Query the new file using SQL
loc: components/Map/index.js

For the Proof of Concept that we can make a query when a user makes search:
loc: components/Map/index.js

Pass values of the dates being searched to the query function
Write a SQL query using the search param values
Set the data to populate on the map: SetData() //??

More Information:

The following is rough runthrough of the control flow for how data is currently being populated.

Step 1: A Parquet of the LA Open Data - 311 Call's is populated in the HuggingFace repo

https://huggingface.co/datasets/311-data/2024

Step 2: The HuggingFace repo is defined in the datasets.parquet.hfYtd value

loc: components/db/DbProvider.jsx - line 7

// List of remote dataset locations used by db.registerFileURL
const datasets = {
  parquet: {
    // huggingface
    hfYtd:
      '[HUGGINGFACE REPO URL HERE]'
     ...
  },
 ...
};

Step 3: The `datasets.parquet.hfYtd` value is used to register a new File

loc: components/db/DbProvider.jsx - line 55

// register parquet
await newDb.registerFileURL(
  'requests.parquet',
  datasets.parquet.hfYtd,
  4    // HTTP = 4. For more options: https://tinyurl.com/DuckDBDataProtocol
);

Step 5: The DbContext (later used as `this.context`) is defined and passed to the application

loc: components/db/DbProvider.jsx - line 109

<DbContext.Provider value={{ db, conn, worker }}>
  {children}
</DbContext.Provider>

Step 6: The Data is queried and set to the front-end application

loc: components/Map/index.js - line 66, 76
....
createRequestsTable = async () => {
  const { conn } = this.context;

  // Create the 'requests' table.
  const createSQL =
    'CREATE TABLE requests AS SELECT * FROM "requests.parquet"'; // parquet

  await conn.query(createSQL);
};

async componentDidMount(props) {
  this.isSubscribed = true;
  this.processSearchParams();
  await this.createRequestsTable();
  await this.setData();
}

Previous Notes

1 - The parquets are in separate huggingface repos so, I’m not sure if we can query multiple files as shown in the duckdb doc here.
… A potential solution would be putting the parquet files in a single repo (but consider the limitations of huggingface repos, doc here.

2 - We may have to make a GET call in order for this to work and I’m not sure if we have the capabilities to run a GET call after the application loads.
… Note: My understanding of how data is pulled is that it’s pulled once,
...... at the beginning when the application loads through a duckdb initialize()
(loc: components/db/DbProvider.jsx line: 88)
...... and set in the <DbContext.Provider value={{..}}>
(loc: components/db/DbProvider.jsx line: 108)
2.5 - So, my question now is, can we make API calls without a backend? Can we run an Express server to make the call?

3 - If we can't run an Express server, we can potentially look into putting 2024, 2023, 2022, etc. parquet data into a single huggingface repo and reference the aforementioned doc here to execute a query that gatther all the data on load.

Resources/Instructions

DuckDB docs - https://duckdb.org/docs/api/wasm/overview
DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html
Huggingface repo limitations - https://huggingface.co/docs/hub/repositories-recommendations

The text was updated successfully, but these errors were encountered:

ryanfchase · 2024-02-12T02:21:23Z

Per some brainstorming I did with @Skydodle, I'm listing some general thoughts I have about loading data.

WIP

I was looking at the data being displayed on the map, 1 month of data is probably the most amount of data you can visually comprehend on the map before it becomes too much. We probably can get away with only loading 1 month of data at a time, and this will keep load times relatively quick.

I propose that we limit the app to only allow the user to see 1 month of data at a given time. I think there are 2 ways that we can work with month-sized datasets...

We set a hard rule that one calendar month can be viewed at a time, e.g. the user can only load January or February. Easy to implement, but isn't good for users who want to see data that span calendar months
We allow users to ask for data that spans calendar months (e.g. Jan 12 to Feb 12), and we would load all of January and February in order to display the dates that they asked for. (longer load times, but still maxed at 2 months of data)

continuing to write notes as we discuss... I'm aware that loading data when user asks for it is the main concern that @traycn mentioned above. We'll investigate whether or not we can find workarounds for this.

traycn · 2024-02-15T05:01:03Z

Updated the ticket with new Action Items from what was discussed in todays meeting. I've also added notes on how the data is loaded in the application.

bberhane · 2024-03-15T02:41:43Z

Hi, @traycn, please update the overview to include "why" we are pulling the data. Formulation: "We need to do X for Y reason." Please provide clarification for the two proof of concept items in the action items step. Are the currently listed two options to pull the data? Thank you!

ryanfchase · 2024-04-07T23:07:10Z

@Skydodle and I have reviewed this one more time, and we're going to move ahead with the Action steps that are present. We will have a check-in for the 1st action step so that the team can see what happens when we load multiple repos of data. We will then proceed with the 2nd action step once we've reviewed and discussed as a team.

ryanfchase · 2024-04-07T23:08:46Z

@Skydodle handing off to you, please assign yourself when ready.

When you're finished with Action Item 1, please post your branch name. Please also show any changes you make to the Hugging Face repo.

ryanfchase · 2024-04-27T23:14:13Z

Follow up ticket for @Skydodle: #1714

traycn added Role: Missing size: Missing Milestone: Missing p-feature: data labels Feb 2, 2024

traycn added this to the 04 - Map Page milestone Feb 2, 2024

traycn added Role: Frontend React front end work Role: Backend Related to API or other server-side work P-feature: Map Feature: Data Quality draft size: 8pt Can be done in 31-48 hours and removed Milestone: Missing Role: Missing size: Missing labels Feb 2, 2024

traycn self-assigned this Feb 2, 2024

traycn added Discussion Needs to be discussed as a team ready for dev lead ready for developer lead to review the issue labels Feb 2, 2024

ryanfchase mentioned this issue Feb 5, 2024

Add warning for when user tries to load data prior to 2024 #1664

Closed

3 tasks

cottonchristopher mentioned this issue Feb 11, 2024

Add warning for when user tries to load data prior to 2024 #1677

Closed

5 tasks

ryanfchase mentioned this issue Mar 8, 2024

311 Data Weekly PM Meeting Agenda #1546

Closed

2 tasks

bberhane added ready for dev lead ready for developer lead to review the issue draft and removed draft Discussion Needs to be discussed as a team ready for dev lead ready for developer lead to review the issue labels Mar 14, 2024

bberhane modified the milestones: 04 - Map Page, X - Technical Debt Mar 15, 2024

cottonchristopher mentioned this issue Mar 19, 2024

311 Data Weekly Meeting Agenda #1545

Closed

traycn removed their assignment Mar 19, 2024

ryanfchase removed Role: Backend Related to API or other server-side work ready for dev lead ready for developer lead to review the issue draft labels Apr 7, 2024

ryanfchase modified the milestones: X - Technical Debt, 06 - Product Maintenance Apr 7, 2024

ryanfchase added ready for prioritization ready for PMs to consider for prioritized backlog and removed ready for prioritization ready for PMs to consider for prioritized backlog labels Apr 7, 2024

Skydodle self-assigned this Apr 7, 2024

Skydodle mentioned this issue Apr 17, 2024

1666 pull data from previous years #1711

Merged

4 tasks

bberhane mentioned this issue Apr 18, 2024

Design System Updates: Search/Filter Modal #1600

Closed

13 tasks

Skydodle closed this as completed in #1711 May 9, 2024

ExperimentsInHonesty added this to P: 311: Project Board Jun 7, 2024

ExperimentsInHonesty moved this to Done (without merge) in P: 311: Project Board Jun 7, 2024

DrAcula27 mentioned this issue Jan 16, 2025

[MVP Huggingface] Define Request Type model and GET data from HuggingFace repos. #1891

Open

14 tasks

efrenmarin45 self-assigned this Jan 16, 2025

cottonchristopher added the Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull data from previous years (e.g. 2023, 2022, etc) #1666

Pull data from previous years (e.g. 2023, 2022, etc) #1666

traycn commented Feb 2, 2024 •

edited by Skydodle

Loading

ryanfchase commented Feb 12, 2024 •

edited

Loading

traycn commented Feb 15, 2024

bberhane commented Mar 15, 2024

ryanfchase commented Apr 7, 2024

ryanfchase commented Apr 7, 2024 •

edited

Loading

ryanfchase commented Apr 27, 2024

Pull data from previous years (e.g. 2023, 2022, etc) #1666

Pull data from previous years (e.g. 2023, 2022, etc) #1666

Comments

traycn commented Feb 2, 2024 • edited by Skydodle Loading

Overview

Action Items

More Information:

Step 1: A Parquet of the LA Open Data - 311 Call's is populated in the HuggingFace repo

Step 2: The HuggingFace repo is defined in the datasets.parquet.hfYtd value

Step 3: The datasets.parquet.hfYtd value is used to register a new File

Step 5: The DbContext (later used as this.context) is defined and passed to the application

Step 6: The Data is queried and set to the front-end application

Resources/Instructions

ryanfchase commented Feb 12, 2024 • edited Loading

traycn commented Feb 15, 2024

bberhane commented Mar 15, 2024

ryanfchase commented Apr 7, 2024

ryanfchase commented Apr 7, 2024 • edited Loading

ryanfchase commented Apr 27, 2024

traycn commented Feb 2, 2024 •

edited by Skydodle

Loading

Step 3: The `datasets.parquet.hfYtd` value is used to register a new File

Step 5: The DbContext (later used as `this.context`) is defined and passed to the application

ryanfchase commented Feb 12, 2024 •

edited

Loading

ryanfchase commented Apr 7, 2024 •

edited

Loading