Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 3 workflows are not very flexible and require replication (tree-like instead of graphs) #421

Open
m-mohr opened this issue Jun 21, 2024 · 7 comments

Comments

@m-mohr
Copy link

m-mohr commented Jun 21, 2024

The workflow language as defined in Part 3 is pretty much a tree instead of a graph structure, if I understand the documentation correctly. This is not very flexible and makes workflows that start and end with a single node but split in-between overly complex and hard to maintain.

A process that is a graph such as the openEO process graph looks like this:
grafik

A similar workflows expressed in Part 3 workflows looks conceptually like this:
grafik

You can also see this in the coastal erosion example, which loads the DEM collection twice:
https://docs.ogc.org/DRAFTS/21-009.html#_coastal_erosion_susceptibility_example_workflow

This leads to duplication, is hard to maintain by users and difficult to parallelize by servers.

The workflow language should ideally be a directed acyclic graph. Languages such as CWL and the openEO process graph allow that. Thus I'm wondering whether another existing option should be chosen instead of creating a new language with limitations that other's have already faced and solved for a reason.

I'm not sure whether it's relevant here, but additions such as (remote) collection input & output can be solved through pre-defined processes for example.

@fmigneault
Copy link
Contributor

I think that might not be the best example to represent (all) the issues about Part 3 Workflows, since there seems to be ways (allegedly as discussed with @jerstlouis, please fill in on this with more specific references) to indicate parts to reuse in the chain to avoid duplicating the processing.

A good example IMO to better highlight issues about Part 3's tree-based workflow rather than a graph would be if we removed the part of the workflow doing the aggregation/merge, leaving us with more than one "topmost process node".

For example (please bear with my ASCII art), the following is possible with CWL (is it possible in openEO as well?).

						+-> PROCESS_1 -> PROCESS_2 -+
                        |                           |
WORKFLOW_INPUT (array) -+ [scatter]                 +-> WORKFLOW_OUTPUT (array)
                        |                           |
						+-> PROCESS_1 -> PROCESS_2 -+

There is no need for a dedicated "merge" process if the operation only consists of "collecting" the values into an array. The workflow language itself can resolve the graph and figure out what to do to combine the results. Since the final PROCESS_2 in the chain is a not unique call (N times the number of input array items), a tree cannot be built from it with nested processes. One would need to define a cwl-runner to allow resolving that by submitting it to /processes/cwl-runner/execution, which had unnecessary overhead if the OAP implementation already handles CWL as it main process execution engine.

This leads to duplication, is hard to maintain by users and difficult to parallelize by servers.
The workflow language should ideally be a directed acyclic graph. Languages such as CWL and the openEO process graph allow that. Thus I'm wondering whether another existing option should be chosen instead of creating a new language with limitations that other's have already faced and solved for a reason.

I agree, Part 3's implicit definitions makes it much harder to resolve the intended behavior. It is convenient for a simple linear chain of processes, but Workflows can be much more convoluted than that. I think both approaches are valid, but they should definitely be named differently since they cannot be resolved/converted back-and-forth in the same fashion.

@p3dr0
Copy link
Member

p3dr0 commented Jun 22, 2024

I said this often in the past, but just a reminder that I think that part 3 as it is should not be called "workflow" but "chaining"

@jerstlouis
Copy link
Member

jerstlouis commented Jun 24, 2024

@fmigneault @m-mohr Issue #304 is about adding the $ref capability to re-use workflow components, which allows to explicitly write the workflow as illustrated with the reduce_dimension. However, even if the workflow would duplicate the input as in the second diagram (which is also the result of expanding the $refs), the server could (should) still of course be smart about it and process the input to reduce_dimension only once. The server could decide to implement this smart logic any way it wants, but in particular with the collection input/output mechanisms, some caching should be in place that automatically resolves identical workflow and cache results in a 2DTMS tile or DGGS manner, and then this happens automagically.

Regarding:

the coastal erosion example, which loads the DEM collection twice

We really need to do a deep dive into "Collection Input" vs. OpenEO "load collection" (which might be quite similar to the "loadCube" process of WHU in Testbed 19 -- see 5.1.6.5. OGC API — Processes implementation).

A "collection" (Collection Input) in Part 3 workflow simply indicates that this collection is an input.

It does not imply any actual loading of the data.

If the same collection is used for multiple processes (like the Slope and Aspect in Coastal Erosion workflow example) which end up executed on the same server, then of course the server could connect only once to retrieve description information and relevant links about that collection.

But most importantly, data is only retrieved from that collection as needed, that is as output data is retrieved from the output GDC for a particular Area/Time/Resolution of interest. Of course this also means that the same DEM data retrieved for calculating the Slope can be re-used to compute the Aspect.

At no point is the entire DEM read (the DEM might be several gigabytes or terabytes).

So my question is: does the openEO "loadCollection" process necessarily imply reading the whole data of the collection?

A "Collection Input" is really not a "process" -- it is an input "data source", just like an "href" or "value" in the execution request. However, it is a data source not specific to a particular Time/Area/Resolution of interest, which easily adapts to the data of interested being requested as the output of the process, which in turn easily supports the "Collection Output". And this easily allows to plug Collection Outputs as Collection Inputs to another process (including in distributed workflows, and supporting clients triggering processing simply by submitting GDC / OGC API - Coverages data access requests).

@m-mohr
Copy link
Author

m-mohr commented Jun 24, 2024

does the openEO "loadCollection" process necessarily imply reading the whole data of the collection?

No, it can further filter the datacube based on spatial extent, temporal extension, bands and other metadata.
See: https://processes.openeo.org/#load_collection

@jerstlouis
Copy link
Member

@m-mohr But these are arguments to the process that need to be explicitly hardcoded in the workflow definition, right?

Rephrasing my question, does the loadCollection imply reading all of the data for the subset specified by the arguments to the process?

With Part 3 Collection Input, "collection" does not imply a process of reading any data at all, but is simply saying "this is the data source to use as an input".

Perhaps an analogy is pointing to a huge COG as a process input in an "href", which does not imply that the whole thing needs to be read by the process receiving it as an input if it does not need to.

@m-mohr
Copy link
Author

m-mohr commented Jun 24, 2024

@m-mohr But these are arguments to the process that need to be explicitly hardcoded in the workflow definition, right?

Yes and No. They can be flexible until they get submitted for an actual execution.

Rephrasing my question, does the loadCollection imply reading all of the data for the subset specified by the arguments to the process?

No, back-ends don't need to read anything until they actually need the data.

@jerstlouis
Copy link
Member

jerstlouis commented Jun 24, 2024

Yes and No. They can be flexible until they get submitted for an actual execution.
No, back-ends don't need to read anything until they actually need the data.

Thanks for the clarification. Then the two approaches are probably very similar conceptually, even if superficially different syntactically! :)

Could you elaborate a bit on "until they get submitted for an actual execution."?

Can you submit the workflow any other way than "an actual execution"?

In particular, would it be possible (now or with the planned integration efforts) to submit a flexible workflow definition resulting in a virtual GDC, which would access input GDC collections (e.g., via /coverage) filling in those spatiotemporal / bands arguments based on requests (e.g., at /coverage) from that virtual GDC (essentially achieving what I describe in opengeospatial/ogcapi-geodatacubes#8 (comment))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants