Extendable harvester framework built with TypeScript. Harvest data from a variety of sources into your PortalJS portal.
With out-of-the-box support to 🌀 PortalJS Cloud
The following sources are supported out-of-the-box:
You can run this tool in any platform that supports Node, such as GitHub actions.
- Install dependecies with
npm install
- Setup the environment variables according to the configuration section
- Run
npm run start
See the GitHub Action example.
The following environment variables can be used to configure the tool:
HARVESTER_NAME
- E.g., "CkanHarvester". Literally the name of the harvester class as defined in ./src/harvesters.SOURCE_API_URL
- E.g., "http://ckan.com". The source URL from which you want to harvest datasets.SOURCE_API_KEY
- (Optional) Used for authenticated requests when private data should be harvested.PORTALJS_CLOUD_API_URL
- (Optional) Defaults to https://api.cloud.portaljs.com/.PORTALJS_CLOUD_MAIN_ORG
- The name of your main organization in PortalJS Cloud.PORTALJS_CLOUD_API_KEY
- You can create PortalJS Cloud API keys in your PortalJS Cloud account profile.DRY_RUN
- (Optional). Whether data should be ingested or just logged. Eithertrue
or undefined.
You can set these environment variables either with a .env
file or in the runner's environment.
For development and testing harvesters locally:
- Clone this repo
- Install dependencies with
npm i
- Duplicate
.env.example
and rename it to.env
- Customize the
.env
as you'd like (see configuration) - Start harvesting with
npm run start
Tip
Dry runs are supported via the DRY_RUN=true
environment variable
This tool is built to be extendable by design.
It can be customized to harvest data from any source by extending either a preexisting built-in harvesters or the base harvester.
One common use case would be, for example, if you want to havest data from a CKAN instance that uses a custom metadata schema.
In this case, you could simply create a new harvester extending the CKAN harvester and override the Source to Target mapping, as shown in the example below.
- Create a new file in the
src/harvesters/
directory. - Extend
BaseHarvester
(or any other pre-built harvester class) and decorate it with@Harvester
. - Implement overrides:
getSourceDatasets()
→ Fetch and return all datasets from your source.mapSourceDatasetToTarget()
→ Convert source dataset schema into the PortalJS Cloud dataset schema.
- Set
HARVESTER_NAME=YourCustomHarvester
in.env
and run. The name of your custom harvester is simply the name of the class that defines it.
The base harvester handles concurrency, rate limit, retries, upsert, but note that all these can be fleely overriden and customized.
import { CkanPackage } from "@/schemas/ckanPackage";
import { PortalJsCloudDataset } from "@/schemas/portaljs-cloud";
import { Harvester } from ".";
import { BaseHarvesterConfig } from "./base";
import { CkanHarvester } from "./ckan";
import { env } from "../../config";
type CustomCkanPortalDataset = CkanPackage & {
data_owner_email: string;
};
@Harvester
class CustomCkanPortalHarvester extends CkanHarvester<CustomCkanPortalDataset> {
constructor(args: BaseHarvesterConfig) {
super(args);
}
mapSourceDatasetToTarget(pkg: CustomCkanPortalDataset): PortalJsCloudDataset {
const owner_org = env.PORTALJS_CLOUD_MAIN_ORG;
return {
owner_org,
name: `${owner_org}--${pkg.name}`,
title: pkg.title,
notes: pkg.notes || "no description",
resources: (pkg.resources || []).map((r: any) => ({
name: r.name,
url: r.url,
format: r.format,
...(r.id ? { id: r.id } : {}),
})),
language: pkg.language || "EN",
contact_point: pkg.data_owner_email // <== Custom field to PortalJS Cloud mapping
};
}
}
export { CustomCkanPortalHarvester };