Populating Algorithms: Mogpr from Fusets #80

Pratichhya · 2025-01-09T12:53:49Z

The service is also available at https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17

The modification in the openeo_udp of this repo in comparison to the existing one is:

The dependencies are passed within the files and need not pass separately in the job_options

Pratichhya · 2025-01-09T13:02:06Z

@soxofaan @HansVRP
Before changing it as Ready from Draft, could you please provide me with your suggestion on:

Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.
In the benchmark scenario, the process graph includes multiple nodes, but according to documentation(correct me if I am wrong) it should only have a single node that points to the namespace.

HansVRP · 2025-01-10T09:21:23Z

algorithm_catalog/fusets_mogpr.json

+        "type": "apex_algorithm",
+        "title": "Multi output gaussian process regression",
+        "description": "Integrates timeseries in data cube using multi-output gaussian process regression. The service is designed to enable multi-output regression analysis using Gaussian Process Regression (GPR) on geospatial data. It provides a powerful tool for understanding and predicting spatiotemporal phenomena by filling gaps based on other indicators that are correlated with each other.",
+        "cost_estimate": 12,


was this calculated with the standard job options? or did you reevalute this cost?

No this was when updating the job_option. As mentioned in the documentation, 'executor-memory': '7g' was set, however I used:
"executor-memory": "1G",
"executor-memoryOverhead": "500m",
"python-memory": "3G"

Okay, so did this influence the cost estimate?

Also did you use the same memory settings in your benchmark?

not much: by 4/5 credits.
Usually in my setting it is around 15 credits but with the executor-memory: 7g it is 19credits for a month and same aoi

HansVRP · 2025-01-10T10:11:56Z

algorithm_catalog/fusets_mogpr.json

+            "rel": "openeo-process",
+            "type": "application/json",
+            "title": "openEO Process Definition",
+            "href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/mogpr_v1/openeo_udp/fusets_mogpr/fusets_mogpr.json"


this branch will probably be deleted afterwards, so make sure to update once merged

HansVRP · 2025-01-10T10:13:48Z

openeo_udp/fusets_mogpr/README.md

+
+### Synchronous calls
+
+TODO: Replace with actual measurements!!!


please do so

It fails due to Read_time out (minimum 13/15 mins)

remove the measurements section in the readme

HansVRP · 2025-01-10T10:14:09Z

openeo_udp/fusets_mogpr/README.md

+
+## Configuration & Resource Usage
+
+Run configurations for different ROI/TOI with memory requirements and estimated run durations.


please do so

updated the readme to similar content as in the marketplace

HansVRP · 2025-01-10T10:14:20Z

openeo_udp/fusets_mogpr/README.md

+
+### Batch jobs
+
+TODO: Replace with actual measurements!!!


please do so

HansVRP · 2025-01-10T10:15:04Z

openeo_udp/fusets_mogpr/fusets_mogpr.json

+            "process_id": "apply_neighborhood",
+            "arguments": {
+                "data": {
+                    "from_parameter": "input_raster_cube"


here we need to discuss if this fits the current APEx way of working,...

I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

Updated the parameter name from "input_raster_cube" to "data".

But, @JanssenBrm @jdries instead of requesting for datacube:

suggesting to do the load collection

masking

call udp for mogpr

aggregate for timeseries.

As shown in the usage example here: https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17

Is there a reason for why are we not requesting only for spatial and temporal extent and returning back the filled time-series in this service?

Discussed the reason in person and the upgraded version of this process "MOGPR" is "MOGPR_s1_s2". So, I will update this process in the same PR addressing all the suggested comments since the content for both will be almost the same.

HansVRP · 2025-01-10T12:22:46Z

openeo_udp/fusets_mogpr/generate.py

+from openeo.processes import ProcessBuilder, apply_neighborhood
+from openeo.rest.udp import build_process_dict
+
+from fusets.openeo import load_mogpr_udf


I do not believe that fusets is part of this environment. @soxofaan what would be the prefered way of working here?

this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt to this folder as initial solution

In addition to generating the UDP it is also when publishing and running the UDP.
Updated the requirements.txt

HansVRP · 2025-01-10T12:23:24Z

openeo_udp/fusets_mogpr/generate.py

+from openeo.rest.udp import build_process_dict
+
+from fusets.openeo import load_mogpr_udf
+from fusets.openeo.services.publish_mogpr import NEIGHBORHOOD_SIZE


does it come with fixed input sizes?

Yes seems like the size is already defined within FuseTS

This is also different from the standard way of working, could you take a look into the code to investigate what those standards are?

NEIGHBORHOOD_SIZE used in 32px

https://github.com/Open-EO/FuseTS/blob/main/src/fusets/openeo/services/publish_mogpr.py#L12

Instead of importing the values, I type in 32 in the apply_neighborhood of generate.py

HansVRP · 2025-01-10T12:24:17Z

openeo_udp/fusets_mogpr/set_path.py

+        sys.path.insert(0, directory)
+
+@functools.lru_cache(maxsize=5)
+def setup_dependencies(dependencies_url,DEPENDENCIES_DIR):


@soxofaan should we include file_locking to avoid concurrency issues?

you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR

soxofaan · 2025-01-13T12:40:59Z

Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.

I'm not sure I understand what your are asking. Is this about raising the issue of lack of documentation? Or just if it is ok to use a single data cube parameter instead of extent parameters?

soxofaan · 2025-01-13T12:45:00Z

In the benchmark scenario, the process graph includes multiple nodes, but according to documentation(correct me if I am wrong) it should only have a single node that points to the namespace.

in the docs I just see "typically a single node":

process graph will typically just contain a single node

so it does not say "it should". At first sight, I think it's ok to have more than one nodes in the benchmark process graph

soxofaan

a couple of notes

soxofaan · 2025-01-13T12:47:04Z

benchmark_scenarios/fusets_mogpr.json

+                    },
+                    "temporal_extent": [
+                        "2022-05-01",
+                        "2023-07-31"


Is it intentional to have such a large (15 months if I see correctly) temporal extent?

No that was not intentional and thanks for pointing as it is one of the cause of high credits 😅
Updated the scenario to use only a month.

soxofaan · 2025-01-13T12:49:35Z

openeo_udp/fusets_mogpr/fusets_mogpr.json

+            "process_id": "apply_neighborhood",
+            "arguments": {
+                "data": {
+                    "from_parameter": "input_raster_cube"


I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

soxofaan · 2025-01-13T12:51:24Z

openeo_udp/fusets_mogpr/generate.py

+from openeo.processes import ProcessBuilder, apply_neighborhood
+from openeo.rest.udp import build_process_dict
+
+from fusets.openeo import load_mogpr_udf


this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt to this folder as initial solution

soxofaan · 2025-01-13T12:55:56Z

openeo_udp/fusets_mogpr/set_path.py

+    """
+    with zipfile.ZipFile(zip_path, "r") as zip_ref:
+        zip_ref.extractall(extract_to)
+    os.remove(zip_path)  # Clean up the zip file after extraction


I've seen this pattern elsewhere, but I think it's bad style to hardcode removal of a zip file from a function that extracts it (separation of concerns). Removal should be handled from the context/function where it was downloaded

replaced the os.remove(zip_path) within the setup_dependencies function

soxofaan · 2025-01-13T12:58:13Z

openeo_udp/fusets_mogpr/set_path.py

+    Adds a directory to the Python sys.path if it's not already present.
+    """
+    if directory not in sys.path:
+        sys.path.insert(0, directory)


I think it's kind of a security issue if prepending a path to sys.path is the only/default way. Appending should be the default, with a prepend-mode just for special cases.

I went with the insert path instead of append, following the solution adapted for PR: #78

because there was a conflict in modules already existing to that of extracted and also it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"

How do you mean by prepend it?

prepend = insert at index 0

prepending to sys.path is usually an antipattern (because of security and stability risks), and we should not implicitly promote that by copying that pattern all over the place.

because there was a conflict in modules already existing to that of extracted

that's what I mean: trying to fix this problem by prepending is like shooting a mosquito with a bazooka: you risk to break a lot more than what you intend to.

it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"

In these error logs I see ModuleNotFoundError: No module named 'fusets' . I don't really get how prepending instead of appending would fix that.

@Pratichhya I can assist you on digging deeper into this issue;

Perhaps, there is another file added to system path with the same name?

This could be because the dependency includes not only fusets but also many other packages with a specific version. Not sure if they were affected.

soxofaan · 2025-01-13T13:07:43Z

openeo_udp/fusets_mogpr/set_path.py

+        sys.path.insert(0, directory)
+
+@functools.lru_cache(maxsize=5)
+def setup_dependencies(dependencies_url,DEPENDENCIES_DIR):


you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR

soxofaan · 2025-01-13T13:09:20Z

openeo_udp/fusets_mogpr/set_path.py

+DEPENDENCIES_DIR2 = 'venv_static'
+
+DEPENDENCIES_URL1 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets_venv.zip"
+DEPENDENCIES_URL2 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets.zip"


I'm not sure if it is very valuable to define these as constants here. Each value is only used once, so you could just use these values directly in the setup_dependencies() calls

updated the values directly in the setup_dependencies

soxofaan · 2025-01-13T13:18:43Z

openeo_udp/fusets_mogpr/set_path.py

+    """
+    import os
+
+    return Path(os.path.realpath(__file__)).read_text()


just curious: why do you need the os.path.realpath here (and the embedded import os)?. Wouldn't just Path(__file__).read_text() work fine?

Also, it seems a bit overkill at the moment to define this load_set_path() function here. When you call it from generate.py, you can just do directly Path("set_path.py").read_text() which does the same in less code (overall) and requires less clicking around to understand what's happning

just curious: why do you need the os.path.realpath here (and the embedded import os)?. Wouldn't just Path(__file__).read_text() work fine?

I simply fetched the solution from the fusets's load_mogpr_udf itself to make it working 😬

Also, it seems a bit overkill at the moment to define this load_set_path() function here. When you call it from generate.py, you can just do directly Path("set_path.py").read_text() which does the same in less code (overall) and requires less clicking around to understand what's happning

This indeed is a simpler and nice solution and works exactly the same, thank you so much I updated the generate.py and set_path.py with the suggested solution.

Pratichhya · 2025-01-14T14:43:45Z

Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.

I'm not sure I understand what your are asking. Is this about raising the issue of lack of documentation?

Could be said such 👀 but No No not exactly. It is a concern on if there is standard way to do as for spatial_extent and temporal extent in terms of naming

Or just if it is ok to use a single data cube parameter instead of extent parameters?

Yes, this one esp, because I saw no example of doing such

soxofaan · 2025-01-17T13:46:18Z

It is a concern on if there is standard way to do as for spatial_extent and temporal extent in terms of naming

as mentioned elsewhere in this issue, I'd recommend in most cases to use data for raster cubes, to align with standard openEO process naming. The only standard openEO process that deviates from this, as far as I could think of, is merge_cubes with cube1 and cube2

Or just if it is ok to use a single data cube parameter instead of extent parameters?

Yes, this one esp, because I saw no example of doing such

I think it depends basically, It depends on the usage/applicability of the "algorithm": is the algorithm tied closely to a certain data set (e.g. biopar stuff), then it makes sense to include the load_collection in the UDP. But if it is a generic algorithm (e.g. producing monthly composites of whatever you feed it), then you can only expect a generic data parameter to get the raster data cube.

Pratichhya added 5 commits January 9, 2025 10:29

mogpr to openeo_udp

5788e60

changed process id

615c748

moved dir due to sys path issue

cfe6b0c

algorithm catalog

682134d

benchmark scenario

8711703

Pratichhya marked this pull request as draft January 9, 2025 12:53

Pratichhya added 3 commits January 9, 2025 14:08

updated namespace

711c44d

preetify json

044b57b

ruff checked

c3b998f

HansVRP requested changes Jan 10, 2025

View reviewed changes

HansVRP reviewed Jan 10, 2025

View reviewed changes

soxofaan reviewed Jan 13, 2025

View reviewed changes

Pratichhya added 7 commits January 14, 2025 13:37

updated the README.md file

80fd98f

updated requirement txt

7ced4d2

addressed the suggested changes on the set_path function

c0129ca

finalised changes with udp

ce18bc8

fusets_version

2af64c8

back to insert instead of append

6bfda52

updated benchmark scenario and results

0a18fa0

Pratichhya marked this pull request as ready for review January 14, 2025 14:46


		### Synchronous calls

		TODO: Replace with actual measurements!!!


		## Configuration & Resource Usage

		Run configurations for different ROI/TOI with memory requirements and estimated run durations.

Populating Algorithms: Mogpr from Fusets #80

Are you sure you want to change the base?

Populating Algorithms: Mogpr from Fusets #80

Conversation

Pratichhya commented Jan 9, 2025

Pratichhya commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pratichhya Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soxofaan commented Jan 13, 2025

soxofaan commented Jan 13, 2025

soxofaan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pratichhya Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pratichhya commented Jan 14, 2025

soxofaan commented Jan 17, 2025

Pratichhya Jan 14, 2025 •

edited

Loading

Pratichhya Jan 14, 2025 •

edited

Loading