From 130ed8ac2808697beeb6184168f7a9691330b238 Mon Sep 17 00:00:00 2001 From: Stephen Appel Date: Mon, 3 Jun 2024 13:40:48 -0500 Subject: [PATCH] config.yaml info in opendataharvest doc --- docs/utils/opendataharvest.md | 87 +++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) diff --git a/docs/utils/opendataharvest.md b/docs/utils/opendataharvest.md index 641058c..5e13c87 100644 --- a/docs/utils/opendataharvest.md +++ b/docs/utils/opendataharvest.md @@ -58,3 +58,90 @@ parent: GeoDiscovery Utilities Landing Page: DCAT: landingPage OGM Aardvark: dct_isPartOf_sm + +## The [config.yaml](https://github.com/UWM-Libraries/GeoDiscovery-Utils/blob/main/opendataharvest/config.yaml) file. + +YAML is a human-readable data serialization language. +It is commonly used for configuration files. +When used in conjunction with a python script, python can fetch these values as a dictionary object, allowing easy +access to the values. + +The opendataharvest tool gets both it's configuration parameters (e.g. where to store output and logs), +default values for fields, +and manifests of open data sites to harvest from. + +You can see the configuration options at the top: + +```yaml +CONFIG: + CATALOG: "DCAT_Sites" # TestSites, DCAT_Sites, or CKAN_Sites + OUTPUTDIR: "opendataharvest/output_md" + LOGDIR: "opendataharvest/log" + DEFAULTBBOX: "opendataharvest/default_bbox.csv" + MAXRETRY: 3 + SLEEPTIME: 2 + SCHEMA: "https://raw.githubusercontent.com/UWM-Libraries/GeoDiscovery/main/schema/geoblacklight-schema-aardvark.json" +``` + +Next the default values are set in the "Localization" section: + +```yaml +DEFAULT: + MEMBEROF: + - "AGSLOpenDataHarvest" + RESOURCECLASS: + - "Datasets" + ACCESSRIGHTS: "public" + MDVERSION: "Aardvark" + LANG: + - "English" + PROVIDER: "American Geographical Society Library – UWM Libraries" + SUPPRESSED: false + RIGHTS: + - Although this data is being distributed by the American Geographical Society Library at the University of Wisconsin-Milwaukee Libraries, no warranty expressed or implied is made by the University as to the accuracy of the data and related materials. The act of distribution shall not constitute any such warranty, and no responsibility is assumed by the University in the use of this data, or related materials. + RESOURCETYPE: + - "Digital maps" + FORMAT: None + DESCRIPTION: This dataset was automatically cataloged from the creator's Open Data Portal. In some cases, publication year and bounding coordinates shown here may be incorrect. Additional download formats may be available on the author's website. Please check the 'More details at' link for additional information. +``` + +Following a small section of test sites, the rest of the file has nested records for each of the Hubs or portals we harvest from. + +Here is an example of a record for the Wisconsin Department of Health Services Data Portal DCAT-compliant portal: + +```yaml + DHS_OpenData: + CreatedBy: "Wisconsin Department of Health Services" + SiteURL: "https://data.dhsgis.wi.gov/data.json" + SiteName: "DHS" + Spatial: ["Wisconsin", "United States"] + DefaultBbox: "Wisconsin" + MapList: "" + AppList: + - UUID: "e1ca38bf16f54fb8ac879b386dbce422" # Flood Risk Map + - UUID: "861fc902539e436ebef7a86a10e9337b" # Immunization Map + - UUID: "43ed2d88cf1348608230572166d76697" # Radon Map + SkipList: + - UUID: "ca921d70bdd84ae8bc84cd09abd822d7" # link to census geography website + - UUID: "00883495714c42a9be53b76b24300c8e" # GIS data disclaimer + - UUID: "200036084844418bb3119d963cd7d98c" # OSDP Help? + - UUID: "29c62b7a834944ef8196573c123d7a9d" +``` + +This stores the URL where we access the catalog information as SiteURL, +some basic metadata fields that we want to remain consistent such as SiteName and Spatial, +a default bounding box +(defined in [default_bbox.csv](https://github.com/UWM-Libraries/GeoDiscovery-Utils/blob/main/opendataharvest/default_bbox.csv) by default) +in case the script is unable to parse spatial information from the dataset or the information is missing, +and three lists of Maps, Apps, and Skips. + +The DCAT harvest script will assign special metadata attributes to datasets defined in these lists. +Datasets in the AppList will be assigned the Resource Class of "Websites". +Datasets in the MapList will be assigned the Resource Class of "Maps". + +As the name implies, the script will skip over datasets listed in the skiplist. +These are typically links to other open data portals, placeholder records, and copies of data from other repositories +including ESRI basemaps. +We don't want to ingest these into our portal, so we add them to the skiplist. + +There are some datasets that have other elements such as `DatasetPrefix` that are not being used at this time.