Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly scalable, chainable and flexible processes centered around 'Collections' as inputs & outputs #47

Open
jerstlouis opened this issue Jul 30, 2019 · 22 comments

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Jul 30, 2019

One might see a process as acting on 0, 1 or multiple "collection" (data) input, possibly in addition to parameters, and outputs 0, 1 or multiple "collection" (data) outputs.

In addition to nicely tying OGC API - Processes with Features and Coverages, it makes daisy chaining very easy to accomplish.

The "data delivery" building blocks (e.g. Features & Coverages) provide the mechanisms by which to identify "collections" of vector features or grid cells (data), and retrieve them. Built-in space partitioning mechanism such as bounding boxes or more complex sub-setting can be used to efficiently retrieve part of the data, or special building blocks such as Tiles or DGGS can offer additional exchange mechanisms. This makes processes highly scalable as different tiles, DGGS cells or subsets of the data could be processed in a massively parallel manner, all through a potential chain of processes. The specific space partitioning modular API blocks (e.g. Tiles or DGGS cell), or the base Features/Coverage mechanisms can be negotiated at any exchange level throughout the chain.

Since it seems that we have to stick with the generic but loaded term of "collections" for e.g. OGC API - Features, I think it would be a mistake to overload the term of "collection" with a meaning other than a "set of geospatial data with particular characteristics", such as a generic group of things (e.g. a list of available processes). One can imagine the confusion if what OGC processes do is act on "collections", while "collection" is also used to mean a regrouping of multiple processes.

A simple example is vectorization of a digital elevation model.
You have a collection representing a DEM (it is important here that the level of a "collection" should represent a "collection of gridded cells", not a collection of a collection of gridded cells, unless said collection of collections is intended to be treated as if the cells were all together in one single DEM layer).
This is your input collection to a "vectorization" process.
You invoke it with parameters, and you get a "vector" collection back.
You can then chain another process working on a vector collection.

It can also be seen that rendering a "map" (as in the Maps API) is clearly a process rendering 1 or more "collections", taking a style as parameter, and producing an imagery ((A)RGB natural color image) "collection" as an output (the rendered map). The result can be retrieved through various data partitioning mechanism (e.g. bounding box, or tiles), and the process can be optimized to render portions as required, cached, in parallel, etc.

We think it makes a lot of sense for OGC API - Processes to support both synchronous (POST and get the 200 output result) and asynchronous modes, Maps and Routes being an important example of how this is useful for simple POST requests returning an output collection (e.g. a route or rendered map) right away.

There could be standardized "well known" processes such as the Maps Rendering process, the Routing process (open routing pilot), Vectorization process, etc, which would further greatly enhance interoperability and chainability of different processes. Most processes would also provide the output in a "collection" interface, which any other process, data delivery, or geospatial viewer supporting the OGC API would readily support.
This also makes it possible to only process data on an as-needed basis, e.g. when first visualizing the output (or being requested by further down the raw data-process-visualization daisy chain) for a certain area or resolution level.

The concept of workflow could be simply described as a chain of OGC API resources and parameters. The actual communication between any 2 services in the chain can be negotiated (e.g. supporting tiles, supporting DGGS, supported data formats), and need not be included in that workflow description, so that alternate implementations with different support could apply the same workflow.

All this makes for a great universal geospatial processing framework by which all processes and data collections become readily compatible, and one can easily discover and combine data sets from all over and using processes hosted anywhere.

I brought up these ideas to the Process group at the OGC API Hackathon, as well as previously within OGC API Common discussions.

Additional background in opengeospatial/ogcapi-common#17

#33 #2 #35 #30

@jerstlouis
Copy link
Member Author

jerstlouis commented Jan 7, 2020

Attached
ogcapi.txt

is an example request for an OGC API combining and chaining diffent modular blocks.
It would render a map combining 3 layers: satellite imagery, soil characteristics and elevation contours.
It uses the following OGC API modular blocks:

  • Processes (with 'Maps' as a specialized process),
  • Coverages (a 'data access' modular block, i.e. delivering a specific type of data layer aka 'collection')
  • Features (a 'data access' modular block, i.e. delivering a specific type of data layer aka 'collection', including STAC as a specialized Features API)
  • Styles
  • Tiles (optionally)

This JSON along with BBOX, WIDTH and HEIGHT parameters can be POSTed to an OGC API implementing the "Map" process at the end of the daisy chain

It can also be used together with a Tiles API (or alternatively potentially the DGGS API) to provide an efficient mechanism to cache the results on either or both client & server.

Tiles (or DGGS, or subsetting, or other data partioning/delivery modular blocks) can also be used along the daisy chain, between the final Map process and
the intermediate processes & data delivery APIs, as indpendently negotiated between the different hops of the daisy chain (invisible to the user at the end of the chain).
This could potentially following the client tiles end-requests, or be driven by the hops themselves determining that to be the most efficient way to perform the requests,
based on declared API conformance classes.
This partitioning or chunking of the data can offer distributed computing possibility, allowing the whole process to complete in a short time synchronously.

The nested structure of this JSON document is based on the following simple types:

   class URLOrID    : String;  // A text string identifying either a resource of particular type on the local server (if a simple identifier), or a URL to a remote server
   class Process    : UrlOrID; // A process resource
   class Collection : UrlOrID; // A collection resource
                               // (e.g. a Feature Collection, a Coverage (including ready-to display imagery or rendered map as a specialized kind of gridded coverage
                               // whose cells are pixels)

   class AbstractCollection
   {
      String id;                                  // An id allowing to name the individual inputs, e.g. useful to use as layer IDs within a style sheet
      Map<String, any_object> parameters;         // The list of parameters to supply for retrieving this collection (e.g. query parameters after the ?)
                                                  // These are parameters for either a collection end-point (e.g. Feature or Coverage) or for a process
                                                  // If a data partitioning/delivery API is additional combined (e.g. tiles) (based on negotiation by the daisy chain)
                                                  // these parameters should still apply
      Array<String, any_object> inputParameters;  // The list of parameters specific to this one collection, which will be used by the process using this collection as an INPUT.
                                                  // If this collection is a process, these parameters will be used with the OUTPUT of that process.
   }

   class CollectionRetrieval : AbstractCollection
   {
      Collection collection;  // The collection resource
   }

   class ProcessInvokation : AbstractCollection
   {
      Process process;                    // The process resource
      Array<AbstractCollection> inputs;   // The collections to use as inputs (which could themselves be ProcessInvokation to form a daisy chain)
   }

It also attempts to encode Cascading Map Style Sheets (http://docs.opengeospatial.org/per/18-025.pdf -- Appendix C) as a JSON object to define
portrayal rules directly within this document, as an alternative to referencing an external style sheet
or encoding a style sheet in a single text field.

The processes involved are the following:

  • Map (local) -- The map rendering processing end-point that this request is being POSTed to)
  • Map (remote) -- A remote Map process from Landcare Research NZ
  • STACToCoverage -- A process taking as input the items returned from a STAC feature collection output, access data from where the imagery is hosted
    (e.g. S3 bucket), and will provide this data as a Coverage API collection
  • Sentinel2BandsToARGB -- A process creating an ARGB image out of a Sentinel2 Coverage (e.g. selecting bands, panchromatic sharpening, atmospheric correction)
    The output is a specialized pixels-based gridded coverage, for which a simple WM(T)S-like API might be suffcient,
    i.e. it can simply return a PNG image based on a BBOX or tiles for example.
  • ElevationContours -- A process generating elevation contours from a Digital Elevation Model, at a specific distance interval

Some ways this request can be used include:

  • Include additional BBOX, WIDTH and HEIGHT parameters directly in this document (or as additional query parameters for the GET), and get a PNG map back
  • POST this document to a /collections/ end-point to create a new virtual collection, to which you could then also append a /tiles/ to retrieve it as tiles
  • POST this document to a /map/tiles/ end-point and get back a templated URL for retrieving this map rendered as tiles

If the user wants to perform client-side rendering, data layers of Coverage/Imagery or Features can be requested in the same manner (skipping the Map rendering process).

More complex or resource intensive workflows might be, which this system should also support.
An additional "callback" field on the top-level ProcessInvokation might be useful for asynchronous support.
Similarly, estimate and billing (including chainable calculation support) can be implemented as additional fields to this request (e.g. "estimateOnly" : true).
Even though the processes can (optionally?) be all linked/described at /processes/ , the actual end-point for any process could be anywhere.
(Including on a separate for /processes/ to act as a catalog of process, just like /collections/ could act as a catalog of collections found elsewhere,
and both could provide a search mechanism for finding useful processes & collections from all over the world, which a user might find useful).

The processes can be discoverable at /processes .
A single process resource can be by itself a very versatile tool, effectively making it possible to upload code, virtual machines, containers, or algorithms as parameters, e.g.:

  • WCPS Runner
  • ADES Runner
  • R runner
  • Python runner
    Alternatively one could also POST to /processes to create a new process pre-baking some parameters and/or inputs.

The top level object is a ProcessInvokation.

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 27, 2020

Ecere, with 11 other collaborating organizations (9 of them OGC members, and 2 additional research centers), are investigating this approach in a GeoConnections project funded by NRCan (short title: Modular OGC API Workflows (MOAW)) running from June 2020 to March 2021. The project will also include application research in the context of Smart City / IoT / 3D and Earth Observation & the environment, to really put it in practice, evaluate the capabilities from researchers point of view, and demonstrate what can be accomplished with flexible client-driven parallel workflows. A first presentation about the project was given to the TC in June at the WPS SWG: https://portal.ogc.org/files/?artifact_id=93652

In our initial phase of the project, we have made some progress towards specifications for a process execution document converging from both the work done in the January Coverage & Analytics Code Sprint API Convergence group and at the April OGC API - Tiles Code Sprint, as well as the current OGC API - Processes process execution document schema.

We have summarized the recommended changes to the current OGC API - Processes process execution document schema needed in order to support this MOAW approach in 3 categories:

  • Minor general changes: syntactical improvements to the schema to make it simpler, mediaType rather than mimeType, and a new list type of input to support multiplicity
  • Extensions: functionality to support referencing collections and processes as first class objects, nested workflows, and adding support for input-specific parameters
  • Making things optional: A number of things are made optional to support specifying and negotiating them between workflow hops, outside of the workflow definition (outputs, mode, response, format...).

General changes

  • Change format's mimeType to mediaType (For execution schema 'format', use 'mediaType' rather than 'mimeType' #87)
  • Inside an input, drop the "input" : { "value" : { part, and directly have e.g. href next to the id. Other properties currently going under input or value (e.g. format when needed) would go at the same level as that.
  • Rename inlineValue to simply value. When using value, numbers can be directly as numeric values, not strings (e.g. "value" : 0.05) -- it was confirmed that this is already possible.
  • Introduce list as a type of input which itself is a list of multiple inputs. This enables associating multiple inputs with a single input ID, supporting the multiplicity from traditional WPS.

@pvretano We are particularly interested in feedback from the processing group in terms of the possibility to adopt these general changes. We would also be happy to share more of the lengthy discussions and debates over the last few months which led us to make these recommendations.

Extensions for MOAW approach

  • Introduce collection and process as types of inputs. The value for process points to an OGC API .../processes/{processID} and also has an inputs array property along with it. The value for collection points to an OGC API .../collections/{collectionId}.
  • Allow specifying process at the top-level for the process executing the overall workflow (in effect the top-level document is like an inputs array element). This allows the processes execution schema to be self-describing and just load it in a GIS client or publish it as a new virtual collection to any OGC API service. This also means that there can be an id at the top level which would give a name to the resulting output.
  • Introduce useWith as a property of an inputs array element for specifying parameters which are used by the process making use of this input (as opposed to nested process inputs contributing to the creation of that input). This solves e.g. the style specific to one layer/collection for Maps.

Properties made optional for MOAW approach (handled otherwise)

  • Don't require specificying outputs. If a process results in multiple outputs, the outputs can be used to select one as part of the workflow. For the nested processes, for now focus on using one output at a time (select one with outputs if a process returns more than one). An overall workflow would return all outputs if outputs is not specified (e.g. so that all resulting layers would be added to GIS client), and each of these would be identified by IDs. The ID(s) to choose could also be selected higher up (e.g. by the client).
  • Don't require specifying mode, response (the same workflow can be used in any mode).
  • In general we would not specify format (mediaType, schema), CRS, APIs or BBOX etc inside the workflow document (they are negotiated and added as query parameters or HTTP headers).

A UML diagram illustrating this (please ignore format being a string, it should be an object with schema and mediaType).
UMLInputs

Simple process execution

{
    "id" : "BufferedRoads",
    "process" : "http://geoprocessing.demo.52north.org:8080/javaps/rest/processes/SimpleBufferAlgorithm",
    "inputs" : [
        {
            "id" : "data",
            "collection" : "http://geoprocessing.demo.52north.org:8080/geoserver/ogcapi/collections/topp:tasmania_roads"
        },
        {
            "id" : "width",
            "value" : 0.05
        }
    ]
}

More complex workflow

{
   "id" : "ContoursRoadsRouteMap",
   "process" : "https://maps.ecere.com/ogcapi/processes/RenderMap",
   "inputs" : [
      { "id" : "transparency", "value" : true },
      {
         "id" : "layers",
         "list" : [
            {
               "id" : "contours",
               "process" : "https://maps.ecere.com/ogcapi/processes/ElevationContours",
               "inputs" : [
                  {
                     "id" : "data",
                     "collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/DTED"
                  },
                  { "id" : "distance", "value" : 250 },
               ],
               "useWith" : { "style" : "night" }
            },
            {
               "id" : "roads",
               "collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/TransportationGroundCrv",
               "useWith" : { "style" : "night" }
            },
            {
               "id" : "route",
               "process" : "https://maps.ecere.com/ogcapi/processes/OSMEcereRoutingEngine",
               "inputs" : [
                  {
                     "id" : "roadsNetworkSource",
                     "collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/TransportationGroundCrv"
                  },
                  {
                     "id" : "elevationModel",
                     "collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/DTED"
                  },
                  {
                     "id" : "request",
                     "value" :
                     {
                        "waypoints": {
                           "type": "MultiPoint",
                           "coordinates": [
                              [ 36.00210415, 32.54581061 ],
                              [ 36.11221587, 32.67020983 ]
                           ]
                        }
                     }
                  }
               ],
               "useWith" : { "style" : { "stroke" : { "color" : "lightBlue", "width" : 10.0 } } }
            }
         ]
      }
   ]
}

Details on how such a document could be POSTed to a .../processes/{processId} end-point, with potentially the new ?mode=deferred mode, are discussed in opengeospatial/ogcapi-maps#42

Suggested conformance classe for the MOAW approach so far are:

  • Processing-Nested -- Support for nested processes in execution document
  • Processing-Deferred -- Support for deferred approach (passing AOI+resolution or asking for tiles later)
  • Processing-Tiled -- Support for deferred tilesets approach (including directly asking for a specific TMS)

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

You may not need to reinvent the wheel here: As WPS did not support process chaining directly, we had to come up with our own things in openEO. It seems we do very similar things with exposing geospatial data at /collections (due to STAC/OGC API Features) and processes at /processes. Maybe you can use that as a basis: https://api.openeo.org/#section/Processes

Example: https://api.openeo.org/assets/pg-evi-example.json

@jerstlouis
Copy link
Member Author

@m-mohr Thanks for the openEO info! We are also basing this on OGC API - Common - Part 2: Geospatial Data (/collections) and Processes (/processes), including process description.

However one objective is that a client can instantly query a complex workflow without it having to be first published, and the nesting is simply built into the Process execution document. Yet the first hand-shake request will be able to fully validate the workflow, and subsequent requests can query an specific AOI/format/API etc. It is also entirely driven from the end client (the researcher discovering data and processes, and building and tweaking the workflow), and we are focusing on tile-based and DGGS-based access, which will be very short individual duration requests, so /jobs will not be needed.

We are definitely interested in interoperability with openEO and I hope that will be possible as the OGC API standards evolves, including Common and Processes in particular. However I don't see us adopting that exact approach directly.

What we are implementing is mostly fully described above and in issue opengeospatial/ogcapi-maps#42, yet it offers a lot of flexibility.

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

However one objective is that a client can instantly query a complex workflow without it having to be first published, and the nesting is simply built into the Process execution document.

@jerstlouis If I understand you correctly: Totally possible in openEO, see /results. Basically what is synchronous in WPS would be /result in openEO and asynchronous would be /jobs.

What we are implementing is mostly fully described above and in issue opengeospatial/OGC-API-Maps#42, yet it offers a lot of flexibility.

Then there's also the option for tiled synchronous in /services as part of a Maps API or WMTS or so. Although /services would also allow more than tiled, could also be a WCS or whatever. On other hand, this is not (yet) directly submitting the process chain with the Maps API / WMTS request. But from Maps issue 42 it seems you don't do that either as you are POSTing things and I'd assume that GET would be more helpful with web mapping libraries.

We are definitely interested in interoperability with openEO and I hope that will be possible as the OGC API standards evolves, including Common and Processes in particular.

Great to hear. Similarly, I could see a next iteration of openEO to work on the alignment with OGC API - Processes.

However I don't see us adopting that exact approach directly.

It was not my intention to say you should adopt it as is, but was more meant as inspiration.

The nice thing with our processes is the re-usability. You can define small "sub-algorithms" and re-use and share them. I thinks that's a big advantage and something that is missing from OGC API Processes (maybe I'm missing something).

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 28, 2020

@m-mohr Thank you. Definitely inspired by the example, as the following shows :)
My attempt at showing what this would look like in our MOAW approach...

You could save the following in a .moaw file and load this directly in QGIS and pan & zoom around and the client would only fetch the visualized area/resolution (using ?mode=deferred described in Maps issue 42, and assuming some temporal controls as well).

Or you could POST it to:
http://maps.ecere.com/ogcapi/processes/CoverageProcessor?bbox=16.1,48.6,16.6,47.2&time=2018-01-01,2018-02-01&f=geotiff
to download (200 result) the full GeoTIFF file.

{
   "id" : "EVIExample",
   "process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
   "inputs" : [
      {
         "id" : "data",
         "process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
         "inputs" : [
            {
               "id" : "data",
               "collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2"
            },
            {
               "id" : "code",
               "value" :
                  "double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
                  "return 2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE);"
            }
         ]
      },
      {
         "id" : "code",
         "value" : "return min[time](data);"
      }
   ]
}

To clarify, those POSTs I am talking about don't create something on the server but all return content / 200.

Of course that could be written simple with a single CoverageProcessor, but the separate ones were to showcase the nested processing there :) (they could have been on separate servers too)

{
   "id" : "EVIExample2",
   "process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
   "inputs" : [
      { "id" : "data", "collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2" },
      { "id" : "code", "value" :
         "double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
         "return min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE));"
      }
   ]
}

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

@jerstlouis Interesting! In indeed missed some parts of your #47 (comment) regarding MOAW.

I think the difference between openEO here and your MOAW approach is that I wasn't speaking about the whole processing instructions from loading data to storing results, but a subset. Like in the example I posted above there are some data cube operations (loading data, reducing operations, saving result) included and then a sub-process that computes the EVI. The EVI part could be made a separate process with it's own defined parameters and return values. Then you can re-use that EVI process in other process chains again. Hard to really explain correctly in just a few sentences though.

You could save this following in a .moaw file and load this directly in QGIS and pan & zoom around and the client would only fetch the visualized area/resolution (using ?mode=deferred described in Maps issue 42).

Yeah, that's very similar. Your file is probably a bit more versatile as with our approach QGIS would first need to create a service and then could do the interactions described by you.

To clarify, those POSTs I am talking about don't create something on the server but all return content / 200.

Understood. I was more referring to the point that a normal WMTS/XYZ/whatever layer in OpenLayers/Leaflet just usually makes GET requests and not POSTs. Our approach with creating the service beforehand makes it better suit the current way these libraries work whereas your approach likely needs new extensions to those libraries. There's pros and cons with both alternatives and it depends on the use-case which is better suited.

Interesting example. That even goes beyond my example as it allows direct execution of code. We have that in openEO, too, but differently. What seems to be a bigger difference is what a process is. The http://maps.ecere.com/ogcapi/processes/CoverageProcessor process seems to be something bigger (the processing engine?) whereas our add/min/... processes are more fine-grained and you just use them in code and there's no reference to .../processes/min or so. So that's a bit confusing as it's unclear where they come from (where are they exposed to users? Also in /processes?).

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 28, 2020

@m-mohr

The EVI part could be made a separate process with it's own defined parameters and return values. Then you can re-use that EVI process in other process chains again

Yes I understand that, we were also thinking about that. Right now the inputs can be another process, but you could also potentially publish a MOAW workflow as itself being a process, by leaving some most-nested input data undefined, which I guess you would need to map somehow to the top-level inputs. We have not yet thought this part through fully, but that should be doable and make all that possible.

just usually makes GET requests and not POSTs.

Well GET is limited in terms of passing complex requests, so this is where POST can be used.
However we were also planning for the possibility to publish these workflows as virtual collections, allowing to create new collections at the regular /collection/{collectionId} (e.g. by POSTing to /collections) which can then be accessed using OGC API - Features, Coverage, Tiles, Maps GET methods just like any other collections... But that is for when you are satistied and no longer tweaking your workflow and want to publish it. One could also optionally be allowed do a GET /collection/{collectionId}/executionSchema to retrieve the source workflow definition, which could then be re-used as a basis for new workflows (another way to re-use these workflows).

So that's a bit confusing as it's unclear where they come from (where are they exposed to users? Also in /processes?).

Not sure what you meant by they there? CoverageProcessor was just an example of a potential process that might be sitting at /processes (discoverable and describable using OGC API - Processes), supporting that hypothetical coverage processing language I just made up, likely similar in some ways to WCPS. min() etc. in the code would all be contructs of that language understood by the CoverageProcessor process.

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

@jerstlouis

Right now the inputs can be another process, but you could also potentially publish a MOAW workflow as itself being a process

That sounds very useful! Good to hear that.

by leaving some most-nested input data undefined which I guess you would need to map somehow to the top-level inputs. We have not yet thought this part through fully, but that should be doable and make all that possible.

Yes, in openEO you can just defined a full new process again with all the metadata you get from /processes. But yeah, in the end it's similar to your "undefined input" thought.

Well GET is limited in terms of passing complex requests, so this is where POST can be used.

Indeed, I was just saying that this probably needs work in the tooling, but could likely be that the virtual collections make up for that.

Not sure what you meant by they there?

I meant min, +, -, etc.

CoverageProcessor was just an example of a potential process that might be sitting at /processes (discoverable and describable using OGC API - Processes), supporting that hypothetical coverage processing language I just made up, likely similar in some ways to WCPS. min() etc. in the code would all be contructs of that language understood by the CoverageProcessor process.

Indeed, that looks very similar to WCPS. Not a big fan of that. The number of WCPS implementations is limited, but now I'm far away from the issue's topic. ;-) There's just seems to be a bit of a difference between OAPro and openEO in what a process is. We express the WCPS code itself as individual processes so that it's just one "language". That enables us to be completely independent of the underlying processing engine, but needs a separate translation step (e.g. from openEO to WCPS for rasdaman).

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 28, 2020

@m-mohr

Perhaps something like this could work, where this new input type of input will map to inputs used with the workflow itself (used as a process):

{
   "id" : "EVI",
   "process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
   "inputs" : [
      { "id" : "data", "input" : "data" },
      { "id" : "code", "value" :
         "double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
         "return min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE));"
      }
   ]
}

Then this could be published to http://maps.ecere.com/ogcapi/processes/EVI and used simply as:

{
   "id" : "EVI",
   "process" : "http://maps.ecere.com/ogcapi/processes/EVI",
   "inputs" : [ { "id" : "data", "collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2" } ]
}

We were also considering being able to deploy code and/or a containerized process to a 'generic' process, much like the ADES (in fact being able to re-use containers meant for an ADES is one objective), so e.g. being able to POST that CoverageProcessor code directly to create a new "EVI" process would be another approach.

that looks very similar to WCPS. Not a big fan of that.

Maybe if we define something simple enough to implement it could get enough adoption.
It could be be translated back and forth to the current openEO structure as well.
I am not a fan of expressing operations in JSON objects :)
These workflows files are meant to be easy to hand edit. I like nice and elegant operators / code.

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

@jerstlouis

Perhaps something like this could work, where this new input type of input will map to inputs used with the workflow itself (used as a process):

Yes, indeed that should work if it unambiguous. In openEO we have {from_parameter: "data"} for that, see for example the x and y parameters in https://github.com/Open-EO/openeo-processes/blob/master/normalized_difference.json

We were also considering being able to deploy code and/or a containerized process to a 'generic' process

Yes, we have that too. It's called user-defined functions (UDFs). Guess that's the most challenging task to get right with regards to performance, data exchange and security.

Maybe if we define something simple enough to implement it could get enough adoption.

I doubt it gets much simpler than WCPS (or openEO) without being too limiting (just considering the math part).

It could be be translated back and forth to the current openEO structure as well.

Yes, it looks like that. At least when reading through many of the issues, it's all very similar.
We already translate openEO to WCPS. ;-)

I am not a fan of expressing operations in JSON objects :)

Well, that's JSON generated by clients. Our users don't hand-write that. We have JS, Python, R and a Web client for that. For example in the Python client, you can just type native Python code through operator overloading for example and then the JSON stuff is handled without the user noticing it. Just for reference, that's the Python client code generating something very similar to the JSON example I posted above ( https://api.openeo.org/assets/pg-evi-example.json ):

import openeo

connection = openeo.connect("https://openeo.org")
sentinel2_data_cube = connection.datacube("Sentinel-2")

B02 = sentinel2_data_cube.band('B02')
B04 = sentinel2_data_cube.band('B04')
B08 = sentinel2_data_cube.band('B08')
evi_cube = (2.5 * (B08 - B04)) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)

min_composite = evi_cube.min_time(dimension="t")
min_composite.download("out.tiff",format="GTiff")

That should be easy enough and is actually in the language the user prefers.

These workflows files are meant to be easy to hand edit. I like nice and elegant operators / code.

Agreed, but we let the clients do the work and don't expect users to fiddle with JSON at all.

@jerstlouis
Copy link
Member Author

@m-mohr Very cool. Thanks for sharing that Python code.

If I understand correctly, one JSON workflow file defines a workflow which end up running like a single process on a single server?
In a sense it is more or less equivalent to WCPS in being able to define the processing to perform?
Or could one of the process in the workflow also be located elsewhere?

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

@jerstlouis

Yes, indeed, similar to WCPS in most cases. Although it's not necessarily a single server, but parts of the workflow can be distributed across a cluster, run in parallel or even redirect things to external servers (currently through run_udf_externally, but that leads to its own challenges, e.g. the amount and speed of transmitting the files over the network). The important thing is that the user shouldn't really notice in most cases.

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 28, 2020

@m-mohr With MOAW, a difference is that those processes are really distinct processes, which themselves could directly be on another server, and the interface to feed data from one process into another is an OGC API -- e.g. Processing for { "process" } input (which could involve a nested part of the MOAW workflow), or Coverages or Features, all of which could be using Tiles or DGGS or BBOX or SUBSET to request specific AOI/resolution in parallel.

So potentially we could have e.g. http://maps.ecere.com/ogcapi/processes/openeo as a process that can execute an openEO workflow, and that could be part of a larger MOAW workflow. And one approach could potentially feed into the other...

Because each hop of the workflow negotiates the most efficient way to transfer things (e.g. which API / formats / partitioning mechanism to use), and the requests can be done in parallel on subsets, things should perform fairly well over the network (as opposed to waiting for large files to be processed / encoded / decoded).

@m-mohr
Copy link

m-mohr commented Jul 28, 2020

Interesting, thanks for pointing that out. That sounds useful on the one hand, but also pretty complex on the other hand. Especially to make client to support that in an easy way.

Because each hop of the workflow negotiates the most efficient way to transfer things (e.g. which API / formats / partitioning mechanism to use), and the requests can be done in parallel on subsets, things should perform fairly well over the network (as opposed to waiting for large files to be processed / encoded / decoded).

I'd be really interested to see evidence on that at some point. From our experiments: Once network transfer between providers started (e.g. with DIASes in Europe), performance degraded relatively quickly. So really interested in how others have solved that.
But we may discuss that separately. This issue already became quite long and now covers much more than originally intended, I guess.

@jerstlouis
Copy link
Member Author

jerstlouis commented Jul 28, 2020

complex on the other hand. Especially to make client to support that in an easy way.

Actually it is not that bad...

  • If you're dealing with a nested process, you simply POST the nested JSON object to the process identified by "process", potentially using the same mode you were invoked from, e.g.
    • Deferred (POST for Handshake/Validation first which returns links, then GET requests for Tile or BBOX requests later),
    • or Sync (POST returns data right away),
    • or Async (e.g. callback URL, which themselves could chain).
  • If it's a "collection", you access that /collections/{collectionId} resource, and you see what APIs are available (potentially using links/relation types), e.g. Features and/or Coverages, and you go by what you're being requested and what you support. Of course, if the collection is local on the server it can directly access it.
  • Media type negotiations can be done via HTTP headers and/or via an ?f= format parameter

But you could also implement more complex mechanics, potentially handling different partitioning mechanisms (e.g. translating Tiles to BBOX for a server that does not support Tiles or a particular TileMatrixSet), and you may implement things like CRS conversions to bridge gaps.

I'd be really interested to see evidence on that at some point.

Of course a major goal of the project is to validate this approach, and although for sure things spread out on different servers will degrade performance, I believe that some of the approach (e.g. pre-establishing a connection / validation of the workflow, and small parallel requests per tiles, and tiles caching and preemption of requests) will somewhat mitigate the impact of that. Assessing this is an important part of our project.

This issue already became quite long and now covers much more than originally intended, I guess.

This openEO input and comparison is actually very interesting and relevant to our on-going design and development, so thanks a lot for the great overview and feedback! We are definitely interested in integrating with the openEO platforms going forward, as it fits well within the key objectives of our project to improve the findability, discoverability, interoperability and usability of data.

@bpross-52n
Copy link
Contributor

See also #53 and #35

@jerstlouis
Copy link
Member Author

jerstlouis commented Aug 18, 2020

Practical examples based on initial successful experiments of the Modular OGC API Workflow approach:

Synchronous Processing

POST workflow (enhanced process execution document / .moaw) to any of these end-points to receive map, vector, or coverage data back:

  • (Maps) {serviceAPI}/processes/{processId}/map/default
  • (Features) {serviceAPI}/processes/{processId}/items
  • (Single feature) {serviceAPI}/processes/{processId}/items/{itemId}
  • (Coverage) {serviceAPI}/processes/{processId}/coverage
  • (Features / Coverage data tile) {serviceAPI}/processes/{processId}/tiles/{tileMatrixSetId}/{tileRow}/{tileCol}
  • (Map tile) {serviceAPI}/processes/{processId}/map/default/tiles/{tileMatrixSetId}/{tileRow}/{tileCol}

Non-tiled Maps/Features/Coverages supporting bbox (ClipBox for Features), width and height, f parameters.

cURL Examples

With the following .moaw JSON file named 1000mContours.moaw:

{
   "id" : "1000mContours",
   "process" : "http://server.com/ogcapi/processes/ElevationContours",
   "inputs" : [
      {
         "id" : "data",
         "collection" : "http://server.com/ogcapi/collections/SRTM_ViewFinderPanorama"
      },
      { "id" : "distance", "value" : 1000 }
   ]
}

Deferred Processing

POST workflow (.moaw) to any of these end-points to set up a deferred workflow, receiving back a resource mapping directly to the corresponding OGC API based on the resource posted to. All of these will also return an HTTP 303 return code (See other) re-directing to a URL returning the same results via a GET method. Any of the links returned may eventually return 410 Gone if the resource has expired, at which point the client should re-submit the workflow again. The response header also explicitly says not to cache the response, as the resource may come back online at a later point (e.g. after re-submitting the workflow).

  • (Output collection description) {serviceAPI}/processes/{processId}
  • (Map) {serviceAPI}/processes/{processId}/map
  • (List of data tilesets) {serviceAPI}/processes/{processId}/tiles
  • (List of map tilesets) {serviceAPI}/processes/{processId}/map/default/tiles
  • (Single data tileset) {serviceAPI}/processes/{processId}/tiles/{tileMatrixSetId}
  • (Single map tileset) {serviceAPI}/processes/{processId}/map/default/tiles/{tileMatrixSetId}

GET from any resources linked (equivalent for each of the Synchronous Processing resources described earlier)

cURL Examples

With the same .moaw JSON file named 1000mContours.moaw as above:

Virtual collections

  • GET {virtualCollection}/workflow, where {virtualCollection} is either a persistent virtual collection at {datasetAPI}/collections/{collectionId} or a deferred collection at an arbitrary collection description URL to retrieve the MOAW workflow (process execution document).

cURL Example

The following has not yet been implemented:

  • POST to {datasetAPI}/collections to create a new virtual collection (named by the top level id) from a .moaw workflow (just like it could also be done from a Feature Collection e.g. shapefile, or a Coverage e.g. GeoTIFF)
  • PUT to {datasetAPI}/collections/{collectionId} to create a virtual collection from a .moaw workflow specifically named `{collectionId}
  • DELETE {datasetAPI}/collections/{collectionId} to delete an existing collection
  • PUT to {serviceAPI}/processes/{processId} to create a process from a .moaw workflow (taking in a flexible input) with a fixed name

@jerstlouis
Copy link
Member Author

jerstlouis commented Aug 29, 2020

Work in progress OpenAPI with (mostly) working prototype in the Swagger interface:

https://app.swaggerhub.com/apis/jerstlouis/MOAW/MOAW-0.2#/

@m-mohr
Copy link

m-mohr commented Dec 10, 2020

Is there an example?

@jerstlouis
Copy link
Member Author

jerstlouis commented Dec 10, 2020

@m-mohr The OpenAPI specifications mentioned above contain examples, and there are also more as sample workflow execution documents on the the processes pages themselves for now:

https://maps.ecere.com/ogcapi/processes/RenderMap
https://maps.ecere.com/ogcapi/processes/ElevationContours
https://maps.ecere.com/ogcapi/processes/OSMERE
https://maps.ecere.com/ogcapi/processes/MOAWAdapter

The workflow execution document will be adjusted to follow the latest SWG decision on #105 where the inputs will be a dictionary/map instead of containing "id" to name them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants