diff --git a/docs/utils/opendataharvest.md b/docs/utils/opendataharvest.md index 5e13c87..125bed9 100644 --- a/docs/utils/opendataharvest.md +++ b/docs/utils/opendataharvest.md @@ -5,59 +5,28 @@ nav_order: 3 parent: GeoDiscovery Utilities --- -# OpenDataHarvest +# OpenDataHarvest Tool [GitHub Repo](https://github.com/UWM-Libraries/GeoDiscovery-Utils/tree/main/opendataharvest) -## Basic crosswalk mapping: +### Overview +The OpenDataHarvest tool is a component of the GeoDiscovery-Utils repository. This tool is designed to facilitate the harvesting and processing of open geospatial data for integration into the GeoDiscovery portal, a platform aimed at providing access to a wide range of geospatial datasets. - Title: - DCAT: title - OGM Aardvark: dct_title_s +### Features +- **Automated Data Harvesting**: OpenDataHarvest automates the process of collecting geospatial data from various open data sources. This ensures that the GeoDiscovery portal is continually updated with the latest datasets available. +- **Data Transformation**: The tool includes functionalities to transform the harvested data into formats that are compatible with the GeoDiscovery portal, ensuring seamless integration. +- **Metadata Handling**: OpenDataHarvest handles metadata extraction and processing, ensuring that all datasets are accompanied by comprehensive metadata for better discoverability and usability. - Description: - DCAT: description - OGM Aardvark: dct_description_sm +### Usage +The tool can be integrated into workflows for regularly updating the GeoDiscovery portal with new and updated datasets. It is suitable for use by libraries, research institutions, and other organizations involved in managing geospatial data. - Keywords/Tags: - DCAT: keyword - OGM Aardvark: dct_subject_sm +### Integration +OpenDataHarvest is part of a broader set of utilities in the GeoDiscovery-Utils repository, all of which support the functionalities of the GeoDiscovery portal. The tool is designed to work in conjunction with other components to provide a robust geospatial data management and discovery solution. - Publisher: - DCAT: publisher - OGM Aardvark: dct_publisher_sm - - Contact Point: - DCAT: contactPoint - OGM Aardvark: dct_contributor_sm - - Access Rights: - DCAT: accessLevel - OGM Aardvark: dct_accessRights_s - - Temporal Coverage: - DCAT: temporal - OGM Aardvark: dct_temporal_sm - - Spatial Coverage: - DCAT: spatial - OGM Aardvark: dct_spatial_sm - - Identifier: - DCAT: identifier - OGM Aardvark: dct_identifier_s - - Rights: - DCAT: rights - OGM Aardvark: dct_rights_sm - - Format: - DCAT: format - OGM Aardvark: dct_format_s - - Landing Page: - DCAT: landingPage - OGM Aardvark: dct_isPartOf_sm +### Benefits +- **Efficiency**: Automates the repetitive task of data harvesting, saving time and resources. +- **Up-to-date Data**: Ensures that the GeoDiscovery portal remains current with the latest available geospatial data. +- **Enhanced Discoverability**: Through comprehensive metadata processing, the tool enhances the discoverability of datasets within the portal. ## The [config.yaml](https://github.com/UWM-Libraries/GeoDiscovery-Utils/blob/main/opendataharvest/config.yaml) file. @@ -70,7 +39,7 @@ The opendataharvest tool gets both it's configuration parameters (e.g. where to default values for fields, and manifests of open data sites to harvest from. -You can see the configuration options at the top: +### Configuration Options ```yaml CONFIG: @@ -83,7 +52,7 @@ CONFIG: SCHEMA: "https://raw.githubusercontent.com/UWM-Libraries/GeoDiscovery/main/schema/geoblacklight-schema-aardvark.json" ``` -Next the default values are set in the "Localization" section: +### Localization and Default Values ```yaml DEFAULT: @@ -107,6 +76,8 @@ DEFAULT: Following a small section of test sites, the rest of the file has nested records for each of the Hubs or portals we harvest from. +### Example of a data portal in the YAML file + Here is an example of a record for the Wisconsin Department of Health Services Data Portal DCAT-compliant portal: ```yaml @@ -145,3 +116,54 @@ including ESRI basemaps. We don't want to ingest these into our portal, so we add them to the skiplist. There are some datasets that have other elements such as `DatasetPrefix` that are not being used at this time. + +## Basic crosswalk mapping: + + Title: + DCAT: title + OGM Aardvark: dct_title_s + + Description: + DCAT: description + OGM Aardvark: dct_description_sm + + Keywords/Tags: + DCAT: keyword + OGM Aardvark: dct_subject_sm + + Publisher: + DCAT: publisher + OGM Aardvark: dct_publisher_sm + + Contact Point: + DCAT: contactPoint + OGM Aardvark: dct_contributor_sm + + Access Rights: + DCAT: accessLevel + OGM Aardvark: dct_accessRights_s + + Temporal Coverage: + DCAT: temporal + OGM Aardvark: dct_temporal_sm + + Spatial Coverage: + DCAT: spatial + OGM Aardvark: dct_spatial_sm + + Identifier: + DCAT: identifier + OGM Aardvark: dct_identifier_s + + Rights: + DCAT: rights + OGM Aardvark: dct_rights_sm + + Format: + DCAT: format + OGM Aardvark: dct_format_s + + Landing Page: + DCAT: landingPage + OGM Aardvark: dct_isPartOf_sm +