Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Catalog Dump #4221

Open
1 task
jbrown-xentity opened this issue Feb 24, 2023 · 4 comments
Open
1 task

Get Catalog Dump #4221

jbrown-xentity opened this issue Feb 24, 2023 · 4 comments
Labels

Comments

@jbrown-xentity
Copy link
Contributor

User Story

In order to examine all/most of data.gov's catalog datasets, Geoplatform wants a dump of the datasets made available in a singular resource.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a scan of the entire catalog is needed
    WHEN a dump of the datasets is created
    THEN a user is able to download and parse the details of the catalog

Background

Geoplatform scans our API regularly, for a number of reasons:

  1. Complete re-harvest
  2. Harvest updates (changes)
  3. Evaluation of state (how many of the data assets does geoplatform have, compared to how many data.gov has?)

They can be denied/blocked from the API based on the number of requests they sometimes make. In the past catalog hasn't scaled well enough to handle this load, but that was in the FCS days. There should be a way for authorized users to make more than the limited number of public requests...

Security Considerations (required)

None at this time

Sketch

There are a number of approaches to this:

  1. Whitelist IP's for Geoplatform (may not be ideal, as requires significant changes to create dedicated IP's and/or update regularly on our end).
  2. Create an API key on catalog-admin that has limited access, and let them go around the rate limiting in cloudfront (but admin isn't designed for this, and slamming our main solr with extra queries may have extra risks).
  3. Create a job that is run regularly that takes a "dump" of our datasets and stores it on S3, and makes it available to the public.

I really don't like 1; I think it represents a lot of toil. 2 seems better, but may be risky long term. 3 represents the most work, but the most stable and useful for a variety of customers (I don't think Geoplatform is the only group that does complete scans of our system). If we start doing it weekly/daily, we can even do historical analysis in the future!

@FuhuXia
Copy link
Member

FuhuXia commented Feb 27, 2023

We used to have a jsonl-export function that dump all datasets into a jsonl file on S3 avaiable for download on a monthly basis. It became defunct at certain point but it should not be hard to get it back working.

@jbrown-xentity
Copy link
Contributor Author

There's also a csv option that we didn't deprecate (it would probably need some work though). Personally I think csv makes more sense in this context; it's much easier to "stream" csv instead of JSON (json needs to be closed, csv can be streamed and parsed line by line). It's also more efficient, in download and storage.

@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 2, 2023
@robert-bryson robert-bryson moved this from 📔 Product Backlog to New Dev in data.gov team board Aug 31, 2023
@btylerburton btylerburton added the H2.0/Harvest-General General Harvesting 2.0 Issues label Oct 13, 2023
@btylerburton
Copy link
Contributor

@jbrown-xentity Is this labeled as Harvesting 2.0 because it's a requirement for future dev?

@jbrown-xentity
Copy link
Contributor Author

I don't remember ever adding that label, GitHub helpfully says you did (see above)... :) I think that label can be removed, this isn't relevant for harvesting 2.0. I would think as new development in a post CKAN world, making a CSV publicly available of a standardized data model would be fairly trivial... @btylerburton

@btylerburton btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 12, 2023
@btylerburton btylerburton moved this from New Dev to 🧊 Icebox in data.gov team board Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

3 participants