-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highly scalable, chainable and flexible processes centered around 'Collections' as inputs & outputs #47
Comments
Attached is an example request for an OGC API combining and chaining diffent modular blocks.
This JSON along with BBOX, WIDTH and HEIGHT parameters can be POSTed to an OGC API implementing the "Map" process at the end of the daisy chain It can also be used together with a Tiles API (or alternatively potentially the DGGS API) to provide an efficient mechanism to cache the results on either or both client & server. Tiles (or DGGS, or subsetting, or other data partioning/delivery modular blocks) can also be used along the daisy chain, between the final Map process and The nested structure of this JSON document is based on the following simple types: class URLOrID : String; // A text string identifying either a resource of particular type on the local server (if a simple identifier), or a URL to a remote server
class Process : UrlOrID; // A process resource
class Collection : UrlOrID; // A collection resource
// (e.g. a Feature Collection, a Coverage (including ready-to display imagery or rendered map as a specialized kind of gridded coverage
// whose cells are pixels)
class AbstractCollection
{
String id; // An id allowing to name the individual inputs, e.g. useful to use as layer IDs within a style sheet
Map<String, any_object> parameters; // The list of parameters to supply for retrieving this collection (e.g. query parameters after the ?)
// These are parameters for either a collection end-point (e.g. Feature or Coverage) or for a process
// If a data partitioning/delivery API is additional combined (e.g. tiles) (based on negotiation by the daisy chain)
// these parameters should still apply
Array<String, any_object> inputParameters; // The list of parameters specific to this one collection, which will be used by the process using this collection as an INPUT.
// If this collection is a process, these parameters will be used with the OUTPUT of that process.
}
class CollectionRetrieval : AbstractCollection
{
Collection collection; // The collection resource
}
class ProcessInvokation : AbstractCollection
{
Process process; // The process resource
Array<AbstractCollection> inputs; // The collections to use as inputs (which could themselves be ProcessInvokation to form a daisy chain)
} It also attempts to encode Cascading Map Style Sheets (http://docs.opengeospatial.org/per/18-025.pdf -- Appendix C) as a JSON object to define The processes involved are the following:
Some ways this request can be used include:
If the user wants to perform client-side rendering, data layers of Coverage/Imagery or Features can be requested in the same manner (skipping the Map rendering process). More complex or resource intensive workflows might be, which this system should also support. The processes can be discoverable at /processes .
The top level object is a |
Ecere, with 11 other collaborating organizations (9 of them OGC members, and 2 additional research centers), are investigating this approach in a GeoConnections project funded by NRCan (short title: Modular OGC API Workflows (MOAW)) running from June 2020 to March 2021. The project will also include application research in the context of Smart City / IoT / 3D and Earth Observation & the environment, to really put it in practice, evaluate the capabilities from researchers point of view, and demonstrate what can be accomplished with flexible client-driven parallel workflows. A first presentation about the project was given to the TC in June at the WPS SWG: https://portal.ogc.org/files/?artifact_id=93652 In our initial phase of the project, we have made some progress towards specifications for a process execution document converging from both the work done in the January Coverage & Analytics Code Sprint API Convergence group and at the April OGC API - Tiles Code Sprint, as well as the current OGC API - Processes process execution document schema. We have summarized the recommended changes to the current OGC API - Processes process execution document schema needed in order to support this MOAW approach in 3 categories:
General changes
@pvretano We are particularly interested in feedback from the processing group in terms of the possibility to adopt these general changes. We would also be happy to share more of the lengthy discussions and debates over the last few months which led us to make these recommendations. Extensions for MOAW approach
Properties made optional for MOAW approach (handled otherwise)
A UML diagram illustrating this (please ignore format being a string, it should be an object with Simple process execution {
"id" : "BufferedRoads",
"process" : "http://geoprocessing.demo.52north.org:8080/javaps/rest/processes/SimpleBufferAlgorithm",
"inputs" : [
{
"id" : "data",
"collection" : "http://geoprocessing.demo.52north.org:8080/geoserver/ogcapi/collections/topp:tasmania_roads"
},
{
"id" : "width",
"value" : 0.05
}
]
} More complex workflow {
"id" : "ContoursRoadsRouteMap",
"process" : "https://maps.ecere.com/ogcapi/processes/RenderMap",
"inputs" : [
{ "id" : "transparency", "value" : true },
{
"id" : "layers",
"list" : [
{
"id" : "contours",
"process" : "https://maps.ecere.com/ogcapi/processes/ElevationContours",
"inputs" : [
{
"id" : "data",
"collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/DTED"
},
{ "id" : "distance", "value" : 250 },
],
"useWith" : { "style" : "night" }
},
{
"id" : "roads",
"collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/TransportationGroundCrv",
"useWith" : { "style" : "night" }
},
{
"id" : "route",
"process" : "https://maps.ecere.com/ogcapi/processes/OSMEcereRoutingEngine",
"inputs" : [
{
"id" : "roadsNetworkSource",
"collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/TransportationGroundCrv"
},
{
"id" : "elevationModel",
"collection" : "https://maps.ecere.com/ogcapi/datasets/vtp-Daraa/collections/DTED"
},
{
"id" : "request",
"value" :
{
"waypoints": {
"type": "MultiPoint",
"coordinates": [
[ 36.00210415, 32.54581061 ],
[ 36.11221587, 32.67020983 ]
]
}
}
}
],
"useWith" : { "style" : { "stroke" : { "color" : "lightBlue", "width" : 10.0 } } }
}
]
}
]
} Details on how such a document could be POSTed to a Suggested conformance classe for the MOAW approach so far are:
|
You may not need to reinvent the wheel here: As WPS did not support process chaining directly, we had to come up with our own things in openEO. It seems we do very similar things with exposing geospatial data at /collections (due to STAC/OGC API Features) and processes at /processes. Maybe you can use that as a basis: https://api.openeo.org/#section/Processes |
@m-mohr Thanks for the openEO info! We are also basing this on OGC API - Common - Part 2: Geospatial Data (/collections) and Processes (/processes), including process description. However one objective is that a client can instantly query a complex workflow without it having to be first published, and the nesting is simply built into the Process execution document. Yet the first hand-shake request will be able to fully validate the workflow, and subsequent requests can query an specific AOI/format/API etc. It is also entirely driven from the end client (the researcher discovering data and processes, and building and tweaking the workflow), and we are focusing on tile-based and DGGS-based access, which will be very short individual duration requests, so /jobs will not be needed. We are definitely interested in interoperability with openEO and I hope that will be possible as the OGC API standards evolves, including Common and Processes in particular. However I don't see us adopting that exact approach directly. What we are implementing is mostly fully described above and in issue opengeospatial/ogcapi-maps#42, yet it offers a lot of flexibility. |
@jerstlouis If I understand you correctly: Totally possible in openEO, see /results. Basically what is synchronous in WPS would be /result in openEO and asynchronous would be /jobs.
Then there's also the option for tiled synchronous in /services as part of a Maps API or WMTS or so. Although /services would also allow more than tiled, could also be a WCS or whatever. On other hand, this is not (yet) directly submitting the process chain with the Maps API / WMTS request. But from Maps issue 42 it seems you don't do that either as you are POSTing things and I'd assume that GET would be more helpful with web mapping libraries.
Great to hear. Similarly, I could see a next iteration of openEO to work on the alignment with OGC API - Processes.
It was not my intention to say you should adopt it as is, but was more meant as inspiration. The nice thing with our processes is the re-usability. You can define small "sub-algorithms" and re-use and share them. I thinks that's a big advantage and something that is missing from OGC API Processes (maybe I'm missing something). |
@m-mohr Thank you. Definitely inspired by the example, as the following shows :) You could save the following in a .moaw file and load this directly in QGIS and pan & zoom around and the client would only fetch the visualized area/resolution (using Or you could POST it to: {
"id" : "EVIExample",
"process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
"inputs" : [
{
"id" : "data",
"process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
"inputs" : [
{
"id" : "data",
"collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2"
},
{
"id" : "code",
"value" :
"double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
"return 2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE);"
}
]
},
{
"id" : "code",
"value" : "return min[time](data);"
}
]
} To clarify, those POSTs I am talking about don't create something on the server but all return content / 200. Of course that could be written simple with a single CoverageProcessor, but the separate ones were to showcase the nested processing there :) (they could have been on separate servers too) {
"id" : "EVIExample2",
"process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
"inputs" : [
{ "id" : "data", "collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2" },
{ "id" : "code", "value" :
"double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
"return min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE));"
}
]
} |
@jerstlouis Interesting! In indeed missed some parts of your #47 (comment) regarding MOAW. I think the difference between openEO here and your MOAW approach is that I wasn't speaking about the whole processing instructions from loading data to storing results, but a subset. Like in the example I posted above there are some data cube operations (loading data, reducing operations, saving result) included and then a sub-process that computes the EVI. The EVI part could be made a separate process with it's own defined parameters and return values. Then you can re-use that EVI process in other process chains again. Hard to really explain correctly in just a few sentences though.
Yeah, that's very similar. Your file is probably a bit more versatile as with our approach QGIS would first need to create a service and then could do the interactions described by you.
Understood. I was more referring to the point that a normal WMTS/XYZ/whatever layer in OpenLayers/Leaflet just usually makes GET requests and not POSTs. Our approach with creating the service beforehand makes it better suit the current way these libraries work whereas your approach likely needs new extensions to those libraries. There's pros and cons with both alternatives and it depends on the use-case which is better suited. Interesting example. That even goes beyond my example as it allows direct execution of code. We have that in openEO, too, but differently. What seems to be a bigger difference is what a process is. The http://maps.ecere.com/ogcapi/processes/CoverageProcessor process seems to be something bigger (the processing engine?) whereas our add/min/... processes are more fine-grained and you just use them in code and there's no reference to .../processes/min or so. So that's a bit confusing as it's unclear where they come from (where are they exposed to users? Also in /processes?). |
Yes I understand that, we were also thinking about that. Right now the inputs can be another process, but you could also potentially publish a MOAW workflow as itself being a process, by leaving some most-nested input data undefined, which I guess you would need to map somehow to the top-level inputs. We have not yet thought this part through fully, but that should be doable and make all that possible.
Well GET is limited in terms of passing complex requests, so this is where POST can be used.
Not sure what you meant by they there? CoverageProcessor was just an example of a potential process that might be sitting at |
That sounds very useful! Good to hear that.
Yes, in openEO you can just defined a full new process again with all the metadata you get from /processes. But yeah, in the end it's similar to your "undefined input" thought.
Indeed, I was just saying that this probably needs work in the tooling, but could likely be that the virtual collections make up for that.
I meant min, +, -, etc.
Indeed, that looks very similar to WCPS. Not a big fan of that. The number of WCPS implementations is limited, but now I'm far away from the issue's topic. ;-) There's just seems to be a bit of a difference between OAPro and openEO in what a process is. We express the WCPS code itself as individual processes so that it's just one "language". That enables us to be completely independent of the underlying processing engine, but needs a separate translation step (e.g. from openEO to WCPS for rasdaman). |
Perhaps something like this could work, where this new {
"id" : "EVI",
"process" : "http://maps.ecere.com/ogcapi/processes/CoverageProcessor",
"inputs" : [
{ "id" : "data", "input" : "data" },
{ "id" : "code", "value" :
"double BLUE = data[bands:'B02'], RED = data[bands:'B04'], NIR = data[bands:'B08'];"
"return min[time](2.5 * (NIR - RED) / (1 + NIR + 6*RED + -7.5*BLUE));"
}
]
} Then this could be published to {
"id" : "EVI",
"process" : "http://maps.ecere.com/ogcapi/processes/EVI",
"inputs" : [ { "id" : "data", "collection" : "http://rasdaman.org/ogcapi/collections/sentinel-2" } ]
} We were also considering being able to deploy code and/or a containerized process to a 'generic' process, much like the ADES (in fact being able to re-use containers meant for an ADES is one objective), so e.g. being able to POST that CoverageProcessor code directly to create a new "EVI" process would be another approach.
Maybe if we define something simple enough to implement it could get enough adoption. |
Yes, indeed that should work if it unambiguous. In openEO we have
Yes, we have that too. It's called user-defined functions (UDFs). Guess that's the most challenging task to get right with regards to performance, data exchange and security.
I doubt it gets much simpler than WCPS (or openEO) without being too limiting (just considering the math part).
Yes, it looks like that. At least when reading through many of the issues, it's all very similar.
Well, that's JSON generated by clients. Our users don't hand-write that. We have JS, Python, R and a Web client for that. For example in the Python client, you can just type native Python code through operator overloading for example and then the JSON stuff is handled without the user noticing it. Just for reference, that's the Python client code generating something very similar to the JSON example I posted above ( https://api.openeo.org/assets/pg-evi-example.json ): import openeo
connection = openeo.connect("https://openeo.org")
sentinel2_data_cube = connection.datacube("Sentinel-2")
B02 = sentinel2_data_cube.band('B02')
B04 = sentinel2_data_cube.band('B04')
B08 = sentinel2_data_cube.band('B08')
evi_cube = (2.5 * (B08 - B04)) / ((B08 + 6.0 * B04 - 7.5 * B02) + 1.0)
min_composite = evi_cube.min_time(dimension="t")
min_composite.download("out.tiff",format="GTiff") That should be easy enough and is actually in the language the user prefers.
Agreed, but we let the clients do the work and don't expect users to fiddle with JSON at all. |
@m-mohr Very cool. Thanks for sharing that Python code. If I understand correctly, one JSON workflow file defines a workflow which end up running like a single process on a single server? |
Yes, indeed, similar to WCPS in most cases. Although it's not necessarily a single server, but parts of the workflow can be distributed across a cluster, run in parallel or even redirect things to external servers (currently through run_udf_externally, but that leads to its own challenges, e.g. the amount and speed of transmitting the files over the network). The important thing is that the user shouldn't really notice in most cases. |
@m-mohr With MOAW, a difference is that those processes are really distinct processes, which themselves could directly be on another server, and the interface to feed data from one process into another is an OGC API -- e.g. Processing for { "process" } input (which could involve a nested part of the MOAW workflow), or Coverages or Features, all of which could be using Tiles or DGGS or BBOX or SUBSET to request specific AOI/resolution in parallel. So potentially we could have e.g. Because each hop of the workflow negotiates the most efficient way to transfer things (e.g. which API / formats / partitioning mechanism to use), and the requests can be done in parallel on subsets, things should perform fairly well over the network (as opposed to waiting for large files to be processed / encoded / decoded). |
Interesting, thanks for pointing that out. That sounds useful on the one hand, but also pretty complex on the other hand. Especially to make client to support that in an easy way.
I'd be really interested to see evidence on that at some point. From our experiments: Once network transfer between providers started (e.g. with DIASes in Europe), performance degraded relatively quickly. So really interested in how others have solved that. |
Actually it is not that bad...
But you could also implement more complex mechanics, potentially handling different partitioning mechanisms (e.g. translating Tiles to BBOX for a server that does not support Tiles or a particular TileMatrixSet), and you may implement things like CRS conversions to bridge gaps.
Of course a major goal of the project is to validate this approach, and although for sure things spread out on different servers will degrade performance, I believe that some of the approach (e.g. pre-establishing a connection / validation of the workflow, and small parallel requests per tiles, and tiles caching and preemption of requests) will somewhat mitigate the impact of that. Assessing this is an important part of our project.
This openEO input and comparison is actually very interesting and relevant to our on-going design and development, so thanks a lot for the great overview and feedback! We are definitely interested in integrating with the openEO platforms going forward, as it fits well within the key objectives of our project to improve the findability, discoverability, interoperability and usability of data. |
Practical examples based on initial successful experiments of the Modular OGC API Workflow approach: Synchronous Processing
Non-tiled Maps/Features/Coverages supporting cURL ExamplesWith the following .moaw JSON file named 1000mContours.moaw: {
"id" : "1000mContours",
"process" : "http://server.com/ogcapi/processes/ElevationContours",
"inputs" : [
{
"id" : "data",
"collection" : "http://server.com/ogcapi/collections/SRTM_ViewFinderPanorama"
},
{ "id" : "distance", "value" : 1000 }
]
}
Deferred Processing
cURL ExamplesWith the same .moaw JSON file named 1000mContours.moaw as above:
Virtual collections
cURL ExampleThe following has not yet been implemented:
|
Work in progress OpenAPI with (mostly) working prototype in the Swagger interface: |
Is there an example? |
@m-mohr The OpenAPI specifications mentioned above contain examples, and there are also more as sample workflow execution documents on the the processes pages themselves for now: https://maps.ecere.com/ogcapi/processes/RenderMap The workflow execution document will be adjusted to follow the latest SWG decision on #105 where the inputs will be a dictionary/map instead of containing "id" to name them. |
One might see a process as acting on 0, 1 or multiple "collection" (data) input, possibly in addition to parameters, and outputs 0, 1 or multiple "collection" (data) outputs.
In addition to nicely tying OGC API - Processes with Features and Coverages, it makes daisy chaining very easy to accomplish.
The "data delivery" building blocks (e.g. Features & Coverages) provide the mechanisms by which to identify "collections" of vector features or grid cells (data), and retrieve them. Built-in space partitioning mechanism such as bounding boxes or more complex sub-setting can be used to efficiently retrieve part of the data, or special building blocks such as Tiles or DGGS can offer additional exchange mechanisms. This makes processes highly scalable as different tiles, DGGS cells or subsets of the data could be processed in a massively parallel manner, all through a potential chain of processes. The specific space partitioning modular API blocks (e.g. Tiles or DGGS cell), or the base Features/Coverage mechanisms can be negotiated at any exchange level throughout the chain.
Since it seems that we have to stick with the generic but loaded term of "collections" for e.g. OGC API - Features, I think it would be a mistake to overload the term of "collection" with a meaning other than a "set of geospatial data with particular characteristics", such as a generic group of things (e.g. a list of available processes). One can imagine the confusion if what OGC processes do is act on "collections", while "collection" is also used to mean a regrouping of multiple processes.
A simple example is vectorization of a digital elevation model.
You have a collection representing a DEM (it is important here that the level of a "collection" should represent a "collection of gridded cells", not a collection of a collection of gridded cells, unless said collection of collections is intended to be treated as if the cells were all together in one single DEM layer).
This is your input collection to a "vectorization" process.
You invoke it with parameters, and you get a "vector" collection back.
You can then chain another process working on a vector collection.
It can also be seen that rendering a "map" (as in the Maps API) is clearly a process rendering 1 or more "collections", taking a style as parameter, and producing an imagery ((A)RGB natural color image) "collection" as an output (the rendered map). The result can be retrieved through various data partitioning mechanism (e.g. bounding box, or tiles), and the process can be optimized to render portions as required, cached, in parallel, etc.
We think it makes a lot of sense for OGC API - Processes to support both synchronous (POST and get the 200 output result) and asynchronous modes, Maps and Routes being an important example of how this is useful for simple POST requests returning an output collection (e.g. a route or rendered map) right away.
There could be standardized "well known" processes such as the Maps Rendering process, the Routing process (open routing pilot), Vectorization process, etc, which would further greatly enhance interoperability and chainability of different processes. Most processes would also provide the output in a "collection" interface, which any other process, data delivery, or geospatial viewer supporting the OGC API would readily support.
This also makes it possible to only process data on an as-needed basis, e.g. when first visualizing the output (or being requested by further down the raw data-process-visualization daisy chain) for a certain area or resolution level.
The concept of workflow could be simply described as a chain of OGC API resources and parameters. The actual communication between any 2 services in the chain can be negotiated (e.g. supporting tiles, supporting DGGS, supported data formats), and need not be included in that workflow description, so that alternate implementations with different support could apply the same workflow.
All this makes for a great universal geospatial processing framework by which all processes and data collections become readily compatible, and one can easily discover and combine data sets from all over and using processes hosted anywhere.
I brought up these ideas to the Process group at the OGC API Hackathon, as well as previously within OGC API Common discussions.
Additional background in opengeospatial/ogcapi-common#17
#33 #2 #35 #30
The text was updated successfully, but these errors were encountered: