SingularData Data Engine

This repository hosts the data harvesting engine for SingularData.

Idea

Many open data providers provide API that exposes a full list of published datasets. A program can be developed to collect metadata from these APIs and create a unified search index.

A survey on the data provider shows that the following platforms are providing standarized APIs that allow 3rd party developrs to harvest.

Design

An continuously data harvest system is built on several AWS services:

S3 to store the data source list
Lambda function to publish data harvesting jobs weekly
SQS to store data harvesting jobs for future execution
ECS to host a dockerized Node.js for continuously executing data harvesting and update the search index with harvested metadata.
ElasticSearch to provide search service

A workflow of the system is

The bootstrapper lambda function reads the data source list from S3 and publish a series of FetchSource jobs in the SQS queue. This function is scheduled to run every week.
The data engine will keep pulling jobs from the SQS queue and execute them based on their types (see next section).
The search index will be updated every time when a harvesting job is done.

Harvesting Engine

The harvesting engine is a program that continuously runs the pull-and-process for jobs from the SQS queue. A job has the following data structure:

{
  // message id used by aws sdk
  "messageId": "SQS message id",
  // type of job
  "type": "job type",
  "data": {
    // type-specific data
  }
}

Each job type has different handling logic:

For the FetchSource job, the data engine will send a request to the given data source and retrieve all scrapable urls. A FetchDataset job will be published for each url for data scrapping.
For the FetchDataset job, the data engine will

download all dataset metadata from the given url
filter out already existing dataset metadata
convert new dataset metadata into W3C DCAT schema
publish within IndexDataset jobs

For the IndexDataset job, data engine will index all dataset metadata with bulk index request.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
image		image
scripts		scripts
src		src
test		test
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.nvmrc		.nvmrc
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
serverless.yml		serverless.yml
tsconfig.json		tsconfig.json
tslint.json		tslint.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SingularData Data Engine

Idea

Design

Harvesting Engine

License

About

Releases 1

Packages

Languages

License

SingularData/data-engine

Folders and files

Latest commit

History

Repository files navigation

SingularData Data Engine

Idea

Design

Harvesting Engine

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages