Skip to content

Analysis job authoring and organization

zaven edited this page Aug 14, 2020 · 3 revisions

Analysis Jobs

Motivation

Analysis jobs are tasks that process image and other captured data to extract information or make inferences about a tree capture. Analysis jobs can, and will often, have dependencies on outputs from another analysis job. Analysis jobs need to be started automatically when a new tree capture enters the system, automatically run to completion, and have all outputs stored and attached to the tree capture. Therefore, analysis jobs must run in a reliable pipeline. Additionally, formatting a coded analysis for inclusion in this pipeline needs to be straightforward for data scientists researching and coding new analyses.

Technology

We will use Apache Airflow to implement the treetracker analysis pipeline. Airflow is an an open source platform that facilitates the execution of analysis jobs with complex dependency relationships. It is a python3 based platform, and information is available at https://airflow.apache.org/.

Authoring jobs

Jobs will be authored in python3 and stored in a Greenstand github repository, probably https://github.com/Greenstand/treetracker-analysis-jobs. The repository will be structured as follows:

/README.md
/job-name-1
     analysis.py
     upstream-jobs.py
     lib/
        ... other files necessary for this analysis       
/job-name-2
... etc

Each analysis will thus be self-contained, and use upstream-jobs.py to list the name of jobs that it depends on outputs from.

We will need to identify a package management system for loading all required shared library dependencies for jobs

Data scientists implementing jobs can submit new jobs in separate PRs.

Format of analysis.py

analysis.py should contain a single function, which the entry point to the job and returns a dictionary of outputs from the job. pseudo-code follows:

func runAnalysis(treeCaptureData, upstreamAnalysisOutputs){
   ... compute analysis ...

   return { 
        value1 : value1,
        value2 : value2
   }

treeCaptureData - contains all data captured in the field, including URLS of tree images upstreamAnalysisOutputs - a dictionary of previous analysis outputs that are available for this tree capture at run time

Storage of outputs

Analysis outputs will be stored in Postgres as JSON, using the UUID of the tree capture and the name of the analysis as a compound unique key.

Images

For the base implementation, images will need to be loaded by URL for each analysis that needs them. We can also consider implementing a more local LRU cache for images that are being used repeatedly by our Airflow cluster since loading the same data from S3 repeatedly incurs transfer costs and latency.

GIS Data

Some analyses may need to make use of GIS data beyond that provided by treeCaptureData and stored in Postgres/postGIS. In this case analysis.py will need access to a database connection configured in Airflow. Therefore runAnalysis or analysis.py will need a mechanism for handing this connection, which makes used of SQLAlchemy ORM. We could consider implementing each analysis.py as an object that takes external context in a constructor, or simply continuing with the functional approach by passing context to runAnalysis().