PortalJS Harvesters

Extendable harvester framework built with TypeScript. Harvest data from a variety of sources into your PortalJS portal.

With out-of-the-box support to 🌀 PortalJS Cloud

Built-in Harvesters

The following sources are supported out-of-the-box:

CKAN
DKAN
🚧 Socrata Open Data
🚧 OpenDataSoft (ODS)
🚧 ArcGIS Hub/Portal
🚧 Dataverse Repository

Running Harvesters

You can run this tool in any platform that supports Node, such as GitHub actions.

Install dependecies with npm install
Setup the environment variables according to the configuration section
Run npm run start

See the GitHub Action example.

Configuration

The following environment variables can be used to configure the tool:

HARVESTER_NAME - E.g., "CkanHarvester". Literally the name of the harvester class as defined in ./src/harvesters.
SOURCE_API_URL - E.g., "http://ckan.com". The source URL from which you want to harvest datasets.
SOURCE_API_KEY - (Optional) Used for authenticated requests when private data should be harvested.
PORTALJS_CLOUD_API_URL - (Optional) Defaults to https://api.cloud.portaljs.com/.
PORTALJS_CLOUD_MAIN_ORG - The name of your main organization in PortalJS Cloud.
PORTALJS_CLOUD_API_KEY - You can create PortalJS Cloud API keys in your PortalJS Cloud account profile.
DRY_RUN - (Optional). Whether data should be ingested or just logged. Either true or undefined.

You can set these environment variables either with a .env file or in the runner's environment.

Development

For development and testing harvesters locally:

Clone this repo
Install dependencies with npm i
Duplicate .env.example and rename it to .env
Customize the .env as you'd like (see configuration)
Start harvesting with npm run start

Tip

Dry runs are supported via the DRY_RUN=true environment variable

Extending

This tool is built to be extendable by design.

It can be customized to harvest data from any source by extending either a preexisting built-in harvesters or the base harvester.

One common use case would be, for example, if you want to havest data from a CKAN instance that uses a custom metadata schema.

In this case, you could simply create a new harvester extending the CKAN harvester and override the Source to Target mapping, as shown in the example below.

Creating a Custom Harvester

Create a new file in the src/harvesters/ directory.
Extend BaseHarvester (or any other pre-built harvester class) and decorate it with @Harvester.
Implement overrides:
- getSourceDatasets() → Fetch and return all datasets from your source.
- mapSourceDatasetToTarget() → Convert source dataset schema into the PortalJS Cloud dataset schema.
Set HARVESTER_NAME=YourCustomHarvester in .env and run. The name of your custom harvester is simply the name of the class that defines it.

The base harvester handles concurrency, rate limit, retries, upsert, but note that all these can be fleely overriden and customized.

Example: Harvester for a CKAN instance with a custom dataset metadata schema

import { CkanPackage } from "@/schemas/ckanPackage";
import { PortalJsCloudDataset } from "@/schemas/portaljs-cloud";
import { Harvester } from ".";
import { BaseHarvesterConfig } from "./base";
import { CkanHarvester } from "./ckan";
import { env } from "../../config";

type CustomCkanPortalDataset = CkanPackage & {
    data_owner_email: string;
};

@Harvester
class CustomCkanPortalHarvester extends CkanHarvester<CustomCkanPortalDataset> {
  constructor(args: BaseHarvesterConfig) {
    super(args);
  }

  mapSourceDatasetToTarget(pkg: CustomCkanPortalDataset): PortalJsCloudDataset {
    const owner_org = env.PORTALJS_CLOUD_MAIN_ORG;
    return {
      owner_org,
      name: `${owner_org}--${pkg.name}`,
      title: pkg.title,
      notes: pkg.notes || "no description",
      resources: (pkg.resources || []).map((r: any) => ({
        name: r.name,
        url: r.url,
        format: r.format,
        ...(r.id ? { id: r.id } : {}),
      })),
      language: pkg.language || "EN",
      contact_point: pkg.data_owner_email // <== Custom field to PortalJS Cloud mapping
    };
  }
}

export { CustomCkanPortalHarvester };

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ts		config.ts
gen-schema.ts		gen-schema.ts
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PortalJS Harvesters

Built-in Harvesters

Running Harvesters

Configuration

Development

Extending

Creating a Custom Harvester

Example: Harvester for a CKAN instance with a custom dataset metadata schema

About

Uh oh!

Releases

Contributors 3

Uh oh!

Languages

License

datopian/harvesterjs

Folders and files

Latest commit

History

Repository files navigation

PortalJS Harvesters

Built-in Harvesters

Running Harvesters

Configuration

Development

Extending

Creating a Custom Harvester

Example: Harvester for a CKAN instance with a custom dataset metadata schema

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 3

Uh oh!

Languages