Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 311 Data's request data easily accessible for data scientists #1259

Closed
2 tasks
nichhk opened this issue Jun 24, 2022 · 7 comments
Closed
2 tasks

Make 311 Data's request data easily accessible for data scientists #1259

nichhk opened this issue Jun 24, 2022 · 7 comments
Assignees
Labels
P-feature: Analytics Role: Data Science Data management, loading, or analysis Size: 5pt Can be done in 19-30 hours

Comments

@nichhk
Copy link
Member

nichhk commented Jun 24, 2022

Overview

Currently, we have no way of providing our clean request data to our data scientists. @priyakalyan is working on PR #1257, but this script takes a long time to execute to get all the data.

I see two solutions to this:

  1. Make our SQL DB accessible to data scientists + create Python methods that read from the DB and produce dataframes.
  2. During our nightly Prefect runs, add a step that also writes the request data out as a csv file to S3.

I'm leaning towards (1) right now, since it will use less disk space and has fewer moving parts. (1) will take longer to implement, however, since we currently don't have access to make infrastructure changes (waiting for SSH keys or Terraform configs from @mattyweb).

Action Items

  • Get SSH key passphrase for DB access
  • Make our SQL DB accessible to data scientists + create Python methods that read from the DB and produce dataframes
@nichhk
Copy link
Member Author

nichhk commented Jun 24, 2022

@joshuayhwu , who might be interested in working on this.

@joshuayhwu joshuayhwu self-assigned this Jun 24, 2022
@joshuayhwu
Copy link
Contributor

I was taking a look at the lacity API config and attempting to connect to postgres locally. I was wondering what's the passcode for the database, if any? (the code said it defaults to none but psycopg2 said db refused connection)

@nichhk
Copy link
Member Author

nichhk commented Jun 30, 2022

Please take a look at this section of the Terraform README. Basically, we cannot directly access our DB right now because it's in a virtual private network (or cloud?). It can only be access through the bastion server, but we don't have the SSH key from Matt yet :(

@joshuayhwu
Copy link
Contributor

Interesting. What are your thoughts on directly accessing the Socrata API and building a separate db?

@nichhk
Copy link
Member Author

nichhk commented Jun 30, 2022

Hmm I think that would help us get a solution faster, but we'd have to maintain another component in an already too-complicated system.

If we think that this task is highest priority (i.e., we want it completed in the coming weeks), I would go with option 2 listed in the first comment.

@joshuayhwu
Copy link
Contributor

Sounds good. We can discuss more tonight.

@nichhk nichhk added Role: Data Science Data management, loading, or analysis Size: 5pt Can be done in 19-30 hours P-feature: Analytics labels Jul 19, 2022
@nichhk nichhk added this to the v2.1 Launch milestone Jul 19, 2022
@EchoProject EchoProject removed this from the v2.1 Launch milestone Dec 8, 2022
@mc759
Copy link
Member

mc759 commented Dec 13, 2022

Hey @joshuayhwu Do you have an update for us on this issue?

Please update:

  • Progress:
  • Blockers:
  • Availability:
  • ETA:

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-feature: Analytics Role: Data Science Data management, loading, or analysis Size: 5pt Can be done in 19-30 hours
Projects
Status: Done (without merge)
Development

No branches or pull requests

5 participants