Skip to content

Analysis job authoring and organization

zaven edited this page Aug 14, 2020 · 3 revisions

#Analysis Jobs

##Motivation Analysis jobs are tasks that process image and other captured data to extract information or make inferences about a tree capture. Analysis jobs can, and will often, have dependencies on outputs from another analysis job. Analysis jobs need to be started automatically when a new tree capture enters the system, automatically run to completion, and have all outputs stored and attached to the tree capture. Therefore, analysis jobs must run in a reliable pipeline. Additionally, formatting a coded analysis for inclusion in this pipeline needs to be straightforward for data scientists researching and coding new analyses.

##Technology We will use Apache Airflow to implement the treetracker analysis pipeline. Airflow is an an open source platform that facilitates the execution of analysis jobs with complex dependency relationships. It is a python3 based platform, and information is available at https://airflow.apache.org/.

##Authoring jobs

Jobs will be authored in python3 and stored in a Greenstand github repository, probably https://github.com/Greenstand/treetracker-analysis-jobs. The repository will be structured as follows:

/README.md
/job-name-1
     analysis.py
     upstream-jobs.py
     lib/
        ... other files necessary for this analysis       
/job-name-2
... etc