Get Catalog Dump #4221

jbrown-xentity · 2023-02-24T23:50:30Z

User Story

In order to examine all/most of data.gov's catalog datasets, Geoplatform wants a dump of the datasets made available in a singular resource.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN a scan of the entire catalog is needed
WHEN a dump of the datasets is created
THEN a user is able to download and parse the details of the catalog

Background

Geoplatform scans our API regularly, for a number of reasons:

Complete re-harvest
Harvest updates (changes)
Evaluation of state (how many of the data assets does geoplatform have, compared to how many data.gov has?)

They can be denied/blocked from the API based on the number of requests they sometimes make. In the past catalog hasn't scaled well enough to handle this load, but that was in the FCS days. There should be a way for authorized users to make more than the limited number of public requests...

Security Considerations (required)

None at this time

Sketch

There are a number of approaches to this:

Whitelist IP's for Geoplatform (may not be ideal, as requires significant changes to create dedicated IP's and/or update regularly on our end).
Create an API key on catalog-admin that has limited access, and let them go around the rate limiting in cloudfront (but admin isn't designed for this, and slamming our main solr with extra queries may have extra risks).
Create a job that is run regularly that takes a "dump" of our datasets and stores it on S3, and makes it available to the public.

I really don't like 1; I think it represents a lot of toil. 2 seems better, but may be risky long term. 3 represents the most work, but the most stable and useful for a variety of customers (I don't think Geoplatform is the only group that does complete scans of our system). If we start doing it weekly/daily, we can even do historical analysis in the future!

FuhuXia · 2023-02-27T05:34:05Z

We used to have a jsonl-export function that dump all datasets into a jsonl file on S3 avaiable for download on a monthly basis. It became defunct at certain point but it should not be hard to get it back working.

jbrown-xentity · 2023-02-27T14:41:17Z

There's also a csv option that we didn't deprecate (it would probably need some work though). Personally I think csv makes more sense in this context; it's much easier to "stream" csv instead of JSON (json needs to be closed, csv can be streamed and parsed line by line). It's also more efficient, in download and storage.

btylerburton · 2023-12-12T00:16:04Z

@jbrown-xentity Is this labeled as Harvesting 2.0 because it's a requirement for future dev?

jbrown-xentity · 2023-12-12T14:25:12Z

I don't remember ever adding that label, GitHub helpfully says you did (see above)... :) I think that label can be removed, this isn't relevant for harvesting 2.0. I would think as new development in a post CKAN world, making a CSV publicly available of a standardized data model would be fairly trivial... @btylerburton

jbrown-xentity added this to data.gov team board Feb 24, 2023

hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 2, 2023

robert-bryson moved this from 📔 Product Backlog to New Dev in data.gov team board Aug 31, 2023

btylerburton added the H2.0/Harvest-General General Harvesting 2.0 Issues label Oct 13, 2023

btylerburton removed the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 12, 2023

btylerburton moved this from New Dev to 🧊 Icebox in data.gov team board Dec 14, 2023

btylerburton added the CKAN label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Catalog Dump #4221

Get Catalog Dump #4221

jbrown-xentity commented Feb 24, 2023

FuhuXia commented Feb 27, 2023

jbrown-xentity commented Feb 27, 2023

btylerburton commented Dec 12, 2023

jbrown-xentity commented Dec 12, 2023

Get Catalog Dump #4221

Get Catalog Dump #4221

Comments

jbrown-xentity commented Feb 24, 2023

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

FuhuXia commented Feb 27, 2023

jbrown-xentity commented Feb 27, 2023

btylerburton commented Dec 12, 2023

jbrown-xentity commented Dec 12, 2023