-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The Sunshine State Digital Network harvester is composed of two tools:
- manatus - a configurable data harvesting client
- Apache Airflow - an open-source ETL application
manatus performs both the harvesting and transformation data. Airflow schedules and monitors the individial tasks. Here is a visual overview of all of the pieces:
Airflow organizes data pipelines into documents called DAGs. DAGs are python files containing instructions for individual tasks. Tasks can be anything from python functions to bash shell commands. Additionally, the DAG ties the tasks together using logic that defines upstream and downstream tasks, dependancies, and instructions for handling failed tasks.
DAGs can be scheduled to run automatically at specific time-frames, or they can be triggered mantually from the Airflow dashboard.
Manatus is a metadata harvester written in python. Harvests and transforms are controlled through configuration files or the command line. Manatus supports multiple metadata formats and custom partner metadata mapping.
See the manatus documentation for instructions on initial set up and running.
ssdn_manatus_configs store the harvesting and transformation data required by manatus. Partner data sources and transformation options are encoded here for manatus.
ssdn_manatus_maps is the repository holding the maps used by manatus for data transformation.
ssdn_dags contains the information Airflow needs to instantiate and control the data harvest and transform tasks.