-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get Catalog Dump #4221
Comments
We used to have a jsonl-export function that dump all datasets into a jsonl file on S3 avaiable for download on a monthly basis. It became defunct at certain point but it should not be hard to get it back working. |
There's also a csv option that we didn't deprecate (it would probably need some work though). Personally I think csv makes more sense in this context; it's much easier to "stream" csv instead of JSON (json needs to be closed, csv can be streamed and parsed line by line). It's also more efficient, in download and storage. |
@jbrown-xentity Is this labeled as |
I don't remember ever adding that label, GitHub helpfully says you did (see above)... :) I think that label can be removed, this isn't relevant for harvesting 2.0. I would think as new development in a post CKAN world, making a CSV publicly available of a standardized data model would be fairly trivial... @btylerburton |
User Story
In order to examine all/most of data.gov's catalog datasets, Geoplatform wants a dump of the datasets made available in a singular resource.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN a dump of the datasets is created
THEN a user is able to download and parse the details of the catalog
Background
Geoplatform scans our API regularly, for a number of reasons:
They can be denied/blocked from the API based on the number of requests they sometimes make. In the past catalog hasn't scaled well enough to handle this load, but that was in the FCS days. There should be a way for authorized users to make more than the limited number of public requests...
Security Considerations (required)
None at this time
Sketch
There are a number of approaches to this:
I really don't like 1; I think it represents a lot of toil. 2 seems better, but may be risky long term. 3 represents the most work, but the most stable and useful for a variety of customers (I don't think Geoplatform is the only group that does complete scans of our system). If we start doing it weekly/daily, we can even do historical analysis in the future!
The text was updated successfully, but these errors were encountered: