Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest changes: XML Harvester, CKAN API actions, fix bugs #16

Merged
merged 19 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ DEFAULT_LICENSE=http://creativecommons.org/licenses/by/4.0/
DEFAULT_LICENSE_ID=cc-by
#TODO: Not implemented
PARALLELIZATION=False
## CKAN Dataset schema
CKAN_DATASET_SCHEMA=geodcatap
## CKAN Dataset schema (geodcatap_es, geodcatap_eu)
CKAN_DATASET_SCHEMA=geodcatap_eu
## ckan-ogc unverified mode (True/False). SSL certificate from host will download if SSL_UNVERIFIED_MODE=True, to avoid SSL error when certificate was self-signed.
SSL_UNVERIFIED_MODE=False
## If desired to export metadata records (GeoDCAT-AP/ISO19139) as a distributions of the CKAN dataset, set METADATA_DISTRIBUTIONS=True
Expand Down
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ metadata/*
!metadata/README.md
samples
.env
data/*
!data/README.md
!data/template-table.xlsx
!data/sample
data/*
ogc2ckan/mappings/organizations/*
!ogc2ckan/mappings/organizations/README.md
!ogc2ckan/mappings/organizations/template-org.yaml
Expand Down
64 changes: 55 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
<a href="#overview">Overview</a> •
<a href="#quick-start">Quick start</a> •
<a href="#debug">Debug</a> •
<a href="#additional-info">Additional info</a>
<a href="#containers">Containers</a>
</p>

Expand All @@ -17,18 +18,18 @@
Docker Compose environment for ingesting metadata from different spatial/semantic/general metadata sources into CKAN.

* OGC harvester (WCS/WFS, WMS & WMTS services)
* CSW harvester (ISO 19115/19139 Metadata)
* CKAN API. - WIP
* Tabular data (CSV, TSV)
* CSW harvester (ISO 19115/19139 Metadata Catalogue Services)
* Spreadsheets (XLS/XLSX)
* Metadata files (XML ISO19139) - WIP
* Metadata files (XML ISO19139)
* CKAN API - WIP
* Semantic metadata files (RDF/TTL) - WIP
* Tabular data (CSV, TSV) - WIP

>**Note**<br>
> It can be tested with an open data portal of the CKAN type such as: : [mjanez/ckan-docker](https://github.com/mjanez/ckan-docker)[^1]

## Quick start
First copy the `.env.example` template and configure by changing the `.env` file. Change `PYCSW_URL` and `CKAN_URL`, as well as the published port `PYCSW_PORT`, if needed.
First copy the `.env.example` template and configure by changing the `.env` file. Change `PYCSW_URL` and `CKAN_URL`, as well as the Harvester info `OGC2CKAN INFO`, if needed.

```bash
cp .env.example .env
Expand All @@ -43,8 +44,9 @@ Custom ennvars:
- `DEFAULT_LICENSE`: Default license for the harvested datasets. Open Data default: `http://creativecommons.org/licenses/by/4.0/`
- `DEFAULT_LICENSE_ID`: Default license ID for the harvested datasets, ID list: `{ckan_site_url}/api/3/action/license_list`. Open Data default: `cc-by-4.0`
- `PARALLELIZATION`: [WIP] Parallelization of the harvesters. Default: `False`
- `CKAN_DATASET_SCHEMA`: Dataset schema of the CKAN Endpoint. Default: `geodcatap`
- `CKAN_DATASET_SCHEMA`: Dataset schema of the CKAN Endpoint. Default: `geodcatap_eu`
- `SSL_UNVERIFIED_MODE`: SSL certificate from host will download if `SSL_UNVERIFIED_MODE=True`. Ennvar to avoid SSL error when certificate was self-signed.
- `METADATA_DISTRIBUTIONS`: If need to create a metadata distributions as CKAN resources (GeoDCAT-AP/ISO19139), set `METADATA_DISTRIBUTIONS=True`. Default: `False`

>**Warning**<br>
> `SSL_UNVERIFIED_MODE=True` is not recommended for production environments. Update your certificate or use a valid one. **Check the container log if it fails, and put `True` in the `.env` file.**
Expand Down Expand Up @@ -101,7 +103,12 @@ pdm install --no-self
configure your custom `config.yaml`. Define the harvest servers and the CKAN DCAT default info.

```bash
cp ckan-ogc/conf/config.yaml.template ckan-ogc/conf/config.yaml
cp ckan-ogc/conf/config.yaml.template ./config.yaml
```

Remember to configure your `.env`
```yaml
cp .env.example .env
```

Run:
Expand All @@ -111,9 +118,48 @@ pdm run python ogc2ckan/ogc2ckan.py

## Debug
### VSCode
#### Python debugger with Docker
1. Build and run container.
2. Attach Visual Studio Code to container
3. Start debugging on `ogc2ckan.py` Python file (`Debug the currently active Python file`).
2. Attach Visual Studio Code to container.
3. Start debugging on `ogc2ckan.py` Python file (`Debug the currently active Python file`) in the container.

#### Python debugger without Docker
1. Update the previously created `.env` file in the root of the `ckan-ogc` repo and move it to: [`/ogc2ckan`](/ogc2ckan)
2. Open [`ogc2ckan.py`](/ogc2ckan/ogc2ckan.py).
3. Start debugging on `ogc2ckan.py` Python file (`Debug the currently active Python file`).

>**Note**<br
> By default, the Python extension looks for and loads a file named `.env` in the current workspace folder. More info about Python debugger and [Enviromental variables use](https://code.visualstudio.com/docs/python/environments#_environment-variables).

## Additional info
### CKAN Schemas
The CKAN output schemas are located in the [`ogc2ckan/ckan_datasets`](./ogc2ckan/ckan_datasets) folder. The schemas are used to map the metadata fields from the different sources to the CKAN dataset fields. Now are available the following schemas:
* `geodcatap`: Schema based in [GeoDCAT-AP Schema for CKAN](https://github.com/mjanez/ckanext-scheming_dcat).
* `base`: A DCAT schema with the basic fields.

You can create your own Schema.

### Harvester
The harvester is located in the [`ogc2ckan/harvesters`](./ogc2ckan/harvesters) folder. The harvester is a Python script that harvests the metadata from the different sources and creates the datasets in CKAN.

There are differente harvesters:
* `csw`: Harvests the metadata from a CSW server using OWSLib.
* `table`: Harvests the metadata from a XLS/XLSX file that contains the metadata records in a table format using the CKAN `field_name` of the [custom schemas](./ogc2ckan/mappings/ckan_fields) as the column name.
* `ogc`: Harvests the metadata from a OGC server (WCS/WFS, WMS & WMTS services) using OWSLib.
* `xml`: Harvests the metadata from a XML file that contains the metadata records in a ISO19139 format.

You can create your own Harvester.

### Configuration file (`config.yaml`)
The configuration file is located in the [`ckan-ogc/conf/config.yaml.template`](./config.yaml.template) file. It is a YAML file that contains the configuration of the harvesters and the CKAN DCAT default info.

The configuration file contains the elements that are used to configure the harvesters and the CKAN DCAT default info. The required elements inside the Harvesters are specified at Harvester Schema in the [`ogc2ckan/model/harvest_schema.py`](./ogc2ckan/model/harvest_schema.py) file.

### Custom organizations
A custom organization is a YAML file that contains the custom metadata fields that will be used to create the dataset in CKAN. The custom organization is located in the [`ogc2ckan/mappings/organizations`](./ogc2ckan/mappings/organizations) folder.

If you need create yous custom organization YAML file use the [`template-org.yaml`](./ogc2ckan/mappings/organizations/template-org.yaml) to create your custom file. Specified by the `dataset_id` the custom organization will be used in the harvested datasets (if the organization exists in the CKAN instance) to create the dataset with the custom metadata fields.


## Containers
List of *containers*:
Expand Down
1 change: 0 additions & 1 deletion ckan-ogc/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,5 @@ RUN pdm install --no-self --group prod
COPY ckan-ogc/conf/config.yaml.template config.yaml
COPY ckan-ogc/docker-entrypoint.d/entrypoint.sh entrypoint.sh
COPY ogc2ckan ogc2ckan
COPY data data

ENTRYPOINT ["/bin/bash", "./entrypoint.sh"]
2 changes: 0 additions & 2 deletions ckan-ogc/Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,6 @@ RUN pdm install --no-self --group prod

COPY ckan-ogc/conf/config.yaml.template config.yaml
COPY ckan-ogc/docker-entrypoint.d/entrypoint_dev.sh entrypoint.sh
COPY ogc2ckan ogc2ckan
COPY data data

EXPOSE 5678/TCP

Expand Down
5 changes: 0 additions & 5 deletions ckan-ogc/Dockerfile.ghcr
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,8 @@ ENV TIMEOUT=300
ENV SSL_UNVERIFIED_MODE=False

WORKDIR ${APP_DIR}
COPY pyproject.toml pdm.lock .

RUN pdm install --no-self --group prod

COPY ckan-ogc/conf/config.yaml.template config.yaml
COPY ckan-ogc/docker-entrypoint.d/entrypoint.sh entrypoint.sh
COPY ogc2ckan ogc2ckan
COPY data data

ENTRYPOINT ["/bin/bash", "./entrypoint.sh"]
5 changes: 0 additions & 5 deletions ckan-ogc/Dockerfile.ghcr.dev
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,9 @@ ENV TIMEOUT=300
ENV SSL_UNVERIFIED_MODE=False

WORKDIR ${APP_DIR}
COPY pyproject.toml pdm.lock .

RUN pdm install --no-self --group prod

COPY ckan-ogc/conf/config.yaml.template config.yaml
COPY ckan-ogc/docker-entrypoint.d/entrypoint_dev.sh entrypoint.sh
COPY ogc2ckan ogc2ckan
COPY data data

EXPOSE 5678/TCP

Expand Down
Loading