Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull data from previous years (e.g. 2023, 2022, etc) #1666

Closed
5 tasks done
Tracked by #1600
traycn opened this issue Feb 2, 2024 · 6 comments · Fixed by #1711
Closed
5 tasks done
Tracked by #1600

Pull data from previous years (e.g. 2023, 2022, etc) #1666

traycn opened this issue Feb 2, 2024 · 6 comments · Fixed by #1711
Assignees
Labels
Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) Feature: Data Quality p-feature: data P-feature: Map Role: Frontend React front end work size: 8pt Can be done in 31-48 hours

Comments

@traycn
Copy link
Member

traycn commented Feb 2, 2024

Overview

We need to pull data from 2023, 2022, etc. to show data from the previous year in our application for users to make more extensive searches.

At this time, the site is limited to display data of the current year to date.

Action Items

For the Proof of Concept that we can query multiple files:
DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html

  • Register a new file in the newDb instance to pull data from another 311-data/[year searched here] repo
    loc: components/db/DbProvider.jsx
  • Query the new file using SQL
    loc: components/Map/index.js

For the Proof of Concept that we can make a query when a user makes search:
loc: components/Map/index.js

  • Pass values of the dates being searched to the query function
  • Write a SQL query using the search param values
  • Set the data to populate on the map: SetData() //??

More Information:

The following is rough runthrough of the control flow for how data is currently being populated.

Step 1: A Parquet of the LA Open Data - 311 Call's is populated in the HuggingFace repo

https://huggingface.co/datasets/311-data/2024

Step 2: The HuggingFace repo is defined in the datasets.parquet.hfYtd value

loc: components/db/DbProvider.jsx - line 7

// List of remote dataset locations used by db.registerFileURL
const datasets = {
  parquet: {
    // huggingface
    hfYtd:
      '[HUGGINGFACE REPO URL HERE]'
     ...
  },
 ...
};

Step 3: The datasets.parquet.hfYtd value is used to register a new File

loc: components/db/DbProvider.jsx - line 55

// register parquet
await newDb.registerFileURL(
  'requests.parquet',
  datasets.parquet.hfYtd,
  4    // HTTP = 4. For more options: https://tinyurl.com/DuckDBDataProtocol
);

Step 5: The DbContext (later used as this.context) is defined and passed to the application

loc: components/db/DbProvider.jsx - line 109

<DbContext.Provider value={{ db, conn, worker }}>
  {children}
</DbContext.Provider>

Step 6: The Data is queried and set to the front-end application

loc: components/Map/index.js - line 66, 76
....
createRequestsTable = async () => {
  const { conn } = this.context;

  // Create the 'requests' table.
  const createSQL =
    'CREATE TABLE requests AS SELECT * FROM "requests.parquet"'; // parquet

  await conn.query(createSQL);
};

async componentDidMount(props) {
  this.isSubscribed = true;
  this.processSearchParams();
  await this.createRequestsTable();
  await this.setData();
}
Previous Notes

1 - The parquets are in separate huggingface repos so, I’m not sure if we can query multiple files as shown in the duckdb doc here.
… A potential solution would be putting the parquet files in a single repo (but consider the limitations of huggingface repos, doc here.

2 - We may have to make a GET call in order for this to work and I’m not sure if we have the capabilities to run a GET call after the application loads.
… Note: My understanding of how data is pulled is that it’s pulled once,
...... at the beginning when the application loads through a duckdb initialize()
(loc: components/db/DbProvider.jsx line: 88)
...... and set in the <DbContext.Provider value={{..}}>
(loc: components/db/DbProvider.jsx line: 108)
2.5 - So, my question now is, can we make API calls without a backend? Can we run an Express server to make the call?

3 - If we can't run an Express server, we can potentially look into putting 2024, 2023, 2022, etc. parquet data into a single huggingface repo and reference the aforementioned doc here to execute a query that gatther all the data on load.

Resources/Instructions

DuckDB docs - https://duckdb.org/docs/api/wasm/overview
DuckDB pull multiple parquets docs - https://duckdb.org/docs/data/multiple_files/overview.html
Huggingface repo limitations - https://huggingface.co/docs/hub/repositories-recommendations

@ryanfchase
Copy link
Member

ryanfchase commented Feb 12, 2024

Per some brainstorming I did with @Skydodle, I'm listing some general thoughts I have about loading data.

WIP

I was looking at the data being displayed on the map, 1 month of data is probably the most amount of data you can visually comprehend on the map before it becomes too much. We probably can get away with only loading 1 month of data at a time, and this will keep load times relatively quick.

I propose that we limit the app to only allow the user to see 1 month of data at a given time. I think there are 2 ways that we can work with month-sized datasets...

  1. We set a hard rule that one calendar month can be viewed at a time, e.g. the user can only load January or February. Easy to implement, but isn't good for users who want to see data that span calendar months
  2. We allow users to ask for data that spans calendar months (e.g. Jan 12 to Feb 12), and we would load all of January and February in order to display the dates that they asked for. (longer load times, but still maxed at 2 months of data)

continuing to write notes as we discuss... I'm aware that loading data when user asks for it is the main concern that @traycn mentioned above. We'll investigate whether or not we can find workarounds for this.

@traycn
Copy link
Member Author

traycn commented Feb 15, 2024

Updated the ticket with new Action Items from what was discussed in todays meeting. I've also added notes on how the data is loaded in the application.

@bberhane bberhane added ready for dev lead ready for developer lead to review the issue draft and removed draft Discussion Needs to be discussed as a team ready for dev lead ready for developer lead to review the issue labels Mar 14, 2024
@bberhane
Copy link
Member

Hi, @traycn, please update the overview to include "why" we are pulling the data. Formulation: "We need to do X for Y reason." Please provide clarification for the two proof of concept items in the action items step. Are the currently listed two options to pull the data? Thank you!

@ryanfchase
Copy link
Member

@Skydodle and I have reviewed this one more time, and we're going to move ahead with the Action steps that are present. We will have a check-in for the 1st action step so that the team can see what happens when we load multiple repos of data. We will then proceed with the 2nd action step once we've reviewed and discussed as a team.

@ryanfchase ryanfchase removed Role: Backend Related to API or other server-side work ready for dev lead ready for developer lead to review the issue draft labels Apr 7, 2024
@ryanfchase ryanfchase added ready for prioritization ready for PMs to consider for prioritized backlog and removed ready for prioritization ready for PMs to consider for prioritized backlog labels Apr 7, 2024
@ryanfchase
Copy link
Member

ryanfchase commented Apr 7, 2024

@Skydodle handing off to you, please assign yourself when ready.

When you're finished with Action Item 1, please post your branch name. Please also show any changes you make to the Hugging Face repo.

@ryanfchase
Copy link
Member

Follow up ticket for @Skydodle: #1714

@ExperimentsInHonesty ExperimentsInHonesty moved this to Done (without merge) in P: 311: Project Board Jun 7, 2024
@efrenmarin45 efrenmarin45 self-assigned this Jan 16, 2025
@cottonchristopher cottonchristopher added the Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) label Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Complexity: Missing This ticket needs a complexity (good first issue, small, medium, or large) Feature: Data Quality p-feature: data P-feature: Map Role: Frontend React front end work size: 8pt Can be done in 31-48 hours
Projects
Status: Done (without merge)
Development

Successfully merging a pull request may close this issue.

6 participants