From 6b29898ea209e4b6da94237049147dc260a43546 Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:11:45 -0700 Subject: [PATCH 01/10] Adding ADR for integrating Apache Airflow --- ...sing-service-apache-airflow-integration.md | 48 +++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md new file mode 100644 index 0000000..61d6412 --- /dev/null +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -0,0 +1,48 @@ +# Data Services custom metadata search + +## Data Service: Custom Metadata Searchability + +### **Status** + +What is the status, such as proposed, accepted, rejected, deprecated, superseded, etc.? Maintain the Date in this section and previous statuses as well: + +| Status | Date | +| -------- | ---------- | +| Proposed | 10/12/2023 | +| | | + +### **Context** + +As we begin to populate collection specific metadata within the data catalog, users want to be able to search for this information. We have 2 realistic options for this searching- one is to update or use the DAPA items search to create queries that can search custom metadata, the other is to expose the Elastic Search directly to users for searching and filtering on whatever metadata they prefer. + +### Alternatives + +Option 1 - Expose Elastic Search cluster directly to user for custom metadata search + +Option 2 - Utilize Common Query Language (CQL) for custom metadata filtering **(proposed solution)** + +### **Decision and Rationale** + +Ultimately, the decision to expose elastic search directly to the user, while preferable from a technical level and the ability to filter/aggregate searches using elastic search queries is ideal, the custom metadata and collection/item metadata are housed in multiple elastic search instances and cannot be cross queries. For now, the CQL filter will be used to hide the multi-es instances behind the scenes. We will revisit this if/when the Elastic Search database become unified. + +The proposed solution also does not require exposing (and thus understanding the auth model) of ES. + +Furthermore, the ability to create STAC outputs from elasict search queries is not supported currently, and so some way of mapping the ES direct queries to elastic search qould be required to _use_ the results. + +### **Impacts** + +CQL will need to be documented and tested. Common filters (e.g. a single value, string, etc) might be pretty straight forward but complex metadata (e.g. nested JSON) might not be supported. This does allow us to migrate to different technologies (e.g. elastic search, databases, etc) in the future without impacting the users. + +The use of CQL will require development effort within the DAPA request and we're not sure this will be supported by the process mapper functionality. + +The CQL development will also duplicate native capabilties of elastic search and this was a primary concern with the cost of development. + +Lastly, the results will be returned via STAC so this should integrate with stage in requests as well where needed. + +### References + +(Optional) Any other references that make sense. Documentation links, other ADRs, etc. + +{% embed url="https://docs.up42.com/developers/api-assets/stac-cql" %} + +{% embed url="https://pystac-client.readthedocs.io/en/latest/tutorials/cql2-filter.html" %} From ed29555e99741a8d833fe91a80ce7dee33a5910d Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:16:53 -0700 Subject: [PATCH 02/10] Update science-processing-service-apache-airflow-integration.md --- ...ssing-service-apache-airflow-integration.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 61d6412..79ec8af 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -1,20 +1,24 @@ -# Data Services custom metadata search - -## Data Service: Custom Metadata Searchability +## Science Processing Service: Apache Airflow Integration ### **Status** -What is the status, such as proposed, accepted, rejected, deprecated, superseded, etc.? Maintain the Date in this section and previous statuses as well: - | Status | Date | | -------- | ---------- | -| Proposed | 10/12/2023 | +| Proposed | 11/10/2023 | | | | ### **Context** -As we begin to populate collection specific metadata within the data catalog, users want to be able to search for this information. We have 2 realistic options for this searching- one is to update or use the DAPA items search to create queries that can search custom metadata, the other is to expose the Elastic Search directly to users for searching and filtering on whatever metadata they prefer. +In recent years, Apache Airflow has emerged as one of the leading open source orchestration engines for scalable jobs processing. Additionally, it is gaining attention and traction at JPL across several projects in Earth and Planetray sciences. We are proposig to integrate the Airflow architecture in the Unity model, as such: + +* The core components of Airflow (Web Server, Scheduler, Database) will compose the front-end EMS Unity layer (which provides orchestration and monitoring across multiple back-ends) +* The Airflow Operators will be used to submit workloads to multiple pluggable ADES back-ends (Celery Workers, EKS, ECS, etc.) +Additionally, Unity may decide to provide Airflow extensions as follows: +* An OGC WPS-T interface to allow clients to submit job requests that conform to this API specificiation +* An Airflow HySDS Operator to allow projects to execute workloads on the HySDS system +* An Airflow WPS-T Operatorn to allow projects to subnmit requests to any WPS-T compliant back-end + ### Alternatives Option 1 - Expose Elastic Search cluster directly to user for custom metadata search From 94b3d0a89fb42483fa018216054e49aa33804f3d Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:18:10 -0700 Subject: [PATCH 03/10] Update science-processing-service-apache-airflow-integration.md --- .../science-processing-service-apache-airflow-integration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 79ec8af..2defea9 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -9,9 +9,9 @@ ### **Context** -In recent years, Apache Airflow has emerged as one of the leading open source orchestration engines for scalable jobs processing. Additionally, it is gaining attention and traction at JPL across several projects in Earth and Planetray sciences. We are proposig to integrate the Airflow architecture in the Unity model, as such: +In recent years, Apache Airflow has emerged as one of the leading open source orchestration engines for scalable jobs processing. Additionally, it is gaining attention and traction at JPL across several projects in Earth and Planetary sciences. We are proposig to integrate the Airflow architecture in the Unity model, as such: -* The core components of Airflow (Web Server, Scheduler, Database) will compose the front-end EMS Unity layer (which provides orchestration and monitoring across multiple back-ends) +* The core components of Airflow (Web Server, Scheduler, Database) will compose the front-end EMS layer (which provides orchestration and monitoring across multiple back-ends) * The Airflow Operators will be used to submit workloads to multiple pluggable ADES back-ends (Celery Workers, EKS, ECS, etc.) Additionally, Unity may decide to provide Airflow extensions as follows: From 596324d581ea79c487dd4885f7405f0a0fb59732 Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:22:51 -0700 Subject: [PATCH 04/10] Update science-processing-service-apache-airflow-integration.md --- ...sing-service-apache-airflow-integration.md | 21 ++++++++++++------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 2defea9..efab383 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -17,21 +17,26 @@ In recent years, Apache Airflow has emerged as one of the leading open source or Additionally, Unity may decide to provide Airflow extensions as follows: * An OGC WPS-T interface to allow clients to submit job requests that conform to this API specificiation * An Airflow HySDS Operator to allow projects to execute workloads on the HySDS system -* An Airflow WPS-T Operatorn to allow projects to subnmit requests to any WPS-T compliant back-end +* An Airflow WPS-T Operator to allow projects to submit requests to any WPS-T compliant back-end ### Alternatives -Option 1 - Expose Elastic Search cluster directly to user for custom metadata search +The following architectural options have been investigated to offer users the ability to leverage the Airflow functionality +(see the referenced presentation for details): -Option 2 - Utilize Common Query Language (CQL) for custom metadata filtering **(proposed solution)** +* Option 1: Integrate Airflow as simply a possible ADES back-end +* Option 2: Fork and maintain CWL-Airflow as a possible ADES back-end +* Option 3: Integrate Airflow as the Unity EMS layer and use Airflow operators to execute workloads on different ADES back-ends. This is the option that was chosen to provide the most functionality and long-term benefits. ### **Decision and Rationale** -Ultimately, the decision to expose elastic search directly to the user, while preferable from a technical level and the ability to filter/aggregate searches using elastic search queries is ideal, the custom metadata and collection/item metadata are housed in multiple elastic search instances and cannot be cross queries. For now, the CQL filter will be used to hide the multi-es instances behind the scenes. We will revisit this if/when the Elastic Search database become unified. - -The proposed solution also does not require exposing (and thus understanding the auth model) of ES. - -Furthermore, the ability to create STAC outputs from elasict search queries is not supported currently, and so some way of mapping the ES direct queries to elastic search qould be required to _use_ the results. +We propose to choose Option 3 above for the following reasons: +* It provides Unity with an EMS (orchestration) layer out-of-the-box, which otherwise Unity would have to custom develop (lengthy and costly) +* It offers the full Airflow functionality to our users +* It provides integration paths with the previous Unity APIs and workload engines, specifically: + * Supporting the OGC WPS-T spec + * Offering HySDS as processing engine +* Overall, it allows users the flexibility to author their workflows in pure Python (Airflow), or CWL, and to request execution via the WPS-T or Airflow APIs ### **Impacts** From 5b0d991b7040cdb22099c0d59534f0a7a8852a42 Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:23:48 -0700 Subject: [PATCH 05/10] Update science-processing-service-apache-airflow-integration.md --- .../science-processing-service-apache-airflow-integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index efab383..6155f7e 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -24,7 +24,7 @@ Additionally, Unity may decide to provide Airflow extensions as follows: The following architectural options have been investigated to offer users the ability to leverage the Airflow functionality (see the referenced presentation for details): -* Option 1: Integrate Airflow as simply a possible ADES back-end +* Option 1: Integrate Airflow simply as a possible ADES back-end * Option 2: Fork and maintain CWL-Airflow as a possible ADES back-end * Option 3: Integrate Airflow as the Unity EMS layer and use Airflow operators to execute workloads on different ADES back-ends. This is the option that was chosen to provide the most functionality and long-term benefits. From 4f5b2e1406a2046b23e083f0d98d2f2f89ae9a0d Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:25:07 -0700 Subject: [PATCH 06/10] Update science-processing-service-apache-airflow-integration.md --- .../science-processing-service-apache-airflow-integration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 6155f7e..5f1b5dc 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -34,8 +34,8 @@ We propose to choose Option 3 above for the following reasons: * It provides Unity with an EMS (orchestration) layer out-of-the-box, which otherwise Unity would have to custom develop (lengthy and costly) * It offers the full Airflow functionality to our users * It provides integration paths with the previous Unity APIs and workload engines, specifically: - * Supporting the OGC WPS-T spec - * Offering HySDS as processing engine + * Supporting the OGC WPS-T specification + * Offering HySDS as a possible processing engine * Overall, it allows users the flexibility to author their workflows in pure Python (Airflow), or CWL, and to request execution via the WPS-T or Airflow APIs ### **Impacts** From acc2938d553c7349380caf02be33a5fd39f30436 Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:27:33 -0700 Subject: [PATCH 07/10] Update science-processing-service-apache-airflow-integration.md --- ...-processing-service-apache-airflow-integration.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 5f1b5dc..20b9ff7 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -40,18 +40,12 @@ We propose to choose Option 3 above for the following reasons: ### **Impacts** -CQL will need to be documented and tested. Common filters (e.g. a single value, string, etc) might be pretty straight forward but complex metadata (e.g. nested JSON) might not be supported. This does allow us to migrate to different technologies (e.g. elastic search, databases, etc) in the future without impacting the users. - -The use of CQL will require development effort within the DAPA request and we're not sure this will be supported by the process mapper functionality. - -The CQL development will also duplicate native capabilties of elastic search and this was a primary concern with the cost of development. - -Lastly, the results will be returned via STAC so this should integrate with stage in requests as well where needed. +The development work of offering Airflow as part of the Unity infrastructure has been scoped and should not take more than a few months. Providing interoperability with WPS-T and HySDS will take longer, perhaps another few months. This architecture improvement will provide long term benefits and longevity to the Unity project. ### References (Optional) Any other references that make sense. Documentation links, other ADRs, etc. -{% embed url="https://docs.up42.com/developers/api-assets/stac-cql" %} +{% embed url="https://airflow.apache.org/" %} -{% embed url="https://pystac-client.readthedocs.io/en/latest/tutorials/cql2-filter.html" %} +{% embed url="https://docs.google.com/presentation/d/1ibGSqWkvZXVBxvkm08ZHAk4bjcs03oZYgFmV1DPPxWU/edit#slide=id.g291c340cd4b_0_0" %} From 6c500d0e7cd48c22c8dfee563a2ada66b8561abb Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:35:51 -0700 Subject: [PATCH 08/10] Update science-processing-service-apache-airflow-integration.md --- .../science-processing-service-apache-airflow-integration.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 20b9ff7..c2ae140 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -46,6 +46,6 @@ The development work of offering Airflow as part of the Unity infrastructure has (Optional) Any other references that make sense. Documentation links, other ADRs, etc. -{% embed url="https://airflow.apache.org/" %} +[Apache Airflow](https://airflow.apache.org/") -{% embed url="https://docs.google.com/presentation/d/1ibGSqWkvZXVBxvkm08ZHAk4bjcs03oZYgFmV1DPPxWU/edit#slide=id.g291c340cd4b_0_0" %} +[SPS presentation to Unity (private)](https://docs.google.com/presentation/d/1ibGSqWkvZXVBxvkm08ZHAk4bjcs03oZYgFmV1DPPxWU/edit#slide=id.g291c340cd4b_0_0") From 34b68a74281609da604cb23514eda2ddba715d9c Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:37:04 -0700 Subject: [PATCH 09/10] Update science-processing-service-apache-airflow-integration.md --- ...science-processing-service-apache-airflow-integration.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index c2ae140..47cc216 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -44,8 +44,6 @@ The development work of offering Airflow as part of the Unity infrastructure has ### References -(Optional) Any other references that make sense. Documentation links, other ADRs, etc. +* [Apache Airflow](https://airflow.apache.org/) -[Apache Airflow](https://airflow.apache.org/") - -[SPS presentation to Unity (private)](https://docs.google.com/presentation/d/1ibGSqWkvZXVBxvkm08ZHAk4bjcs03oZYgFmV1DPPxWU/edit#slide=id.g291c340cd4b_0_0") +* [SPS presentation to Unity (private)](https://docs.google.com/presentation/d/1ibGSqWkvZXVBxvkm08ZHAk4bjcs03oZYgFmV1DPPxWU/edit#slide=id.g291c340cd4b_0_0) From 17195d3f9ff78813f6d84dba072ef80a7a232c67 Mon Sep 17 00:00:00 2001 From: Luca Cinquini Date: Fri, 10 Nov 2023 09:40:07 -0700 Subject: [PATCH 10/10] Update science-processing-service-apache-airflow-integration.md --- .../science-processing-service-apache-airflow-integration.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md index 47cc216..62f2c6a 100644 --- a/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md +++ b/architecture/architectural-decision-records/science-processing-service-apache-airflow-integration.md @@ -9,7 +9,7 @@ ### **Context** -In recent years, Apache Airflow has emerged as one of the leading open source orchestration engines for scalable jobs processing. Additionally, it is gaining attention and traction at JPL across several projects in Earth and Planetary sciences. We are proposig to integrate the Airflow architecture in the Unity model, as such: +In recent years, Apache Airflow has emerged as one of the leading open source orchestration engines for scalable jobs processing. Additionally, it is gaining attention and traction at JPL across several projects in Earth and Planetary sciences. We are proposig to integrate the Airflow architecture in the Unity model, as such (see diagram below): * The core components of Airflow (Web Server, Scheduler, Database) will compose the front-end EMS layer (which provides orchestration and monitoring across multiple back-ends) * The Airflow Operators will be used to submit workloads to multiple pluggable ADES back-ends (Celery Workers, EKS, ECS, etc.) @@ -18,6 +18,9 @@ Additionally, Unity may decide to provide Airflow extensions as follows: * An OGC WPS-T interface to allow clients to submit job requests that conform to this API specificiation * An Airflow HySDS Operator to allow projects to execute workloads on the HySDS system * An Airflow WPS-T Operator to allow projects to submit requests to any WPS-T compliant back-end + +Screenshot 2023-11-10 at 09 38 20 + ### Alternatives