Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populating Algorithms: Mogpr from Fusets #80

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Populating Algorithms: Mogpr from Fusets #80

wants to merge 15 commits into from

Conversation

Pratichhya
Copy link
Contributor

The service is also available at https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17

The modification in the openeo_udp of this repo in comparison to the existing one is:

  • The dependencies are passed within the files and need not pass separately in the job_options

@Pratichhya Pratichhya marked this pull request as draft January 9, 2025 12:53
@Pratichhya
Copy link
Contributor Author

@soxofaan @HansVRP
Before changing it as Ready from Draft, could you please provide me with your suggestion on:

  • Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.
  • In the benchmark scenario, the process graph includes multiple nodes, but according to documentation(correct me if I am wrong) it should only have a single node that points to the namespace.

"type": "apex_algorithm",
"title": "Multi output gaussian process regression",
"description": "Integrates timeseries in data cube using multi-output gaussian process regression. The service is designed to enable multi-output regression analysis using Gaussian Process Regression (GPR) on geospatial data. It provides a powerful tool for understanding and predicting spatiotemporal phenomena by filling gaps based on other indicators that are correlated with each other.",
"cost_estimate": 12,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this calculated with the standard job options? or did you reevalute this cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was when updating the job_option. As mentioned in the documentation, 'executor-memory': '7g' was set, however I used:
"executor-memory": "1G",
"executor-memoryOverhead": "500m",
"python-memory": "3G"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so did this influence the cost estimate?

Also did you use the same memory settings in your benchmark?

Copy link
Contributor Author

@Pratichhya Pratichhya Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not much: by 4/5 credits.
Usually in my setting it is around 15 credits but with the executor-memory: 7g it is 19credits for a month and same aoi

"rel": "openeo-process",
"type": "application/json",
"title": "openEO Process Definition",
"href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/mogpr_v1/openeo_udp/fusets_mogpr/fusets_mogpr.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this branch will probably be deleted afterwards, so make sure to update once merged


### Synchronous calls

TODO: Replace with actual measurements!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails due to Read_time out (minimum 13/15 mins)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the measurements section in the readme


## Configuration & Resource Usage

Run configurations for different ROI/TOI with memory requirements and estimated run durations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the readme to similar content as in the marketplace


### Batch jobs

TODO: Replace with actual measurements!!!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

"process_id": "apply_neighborhood",
"arguments": {
"data": {
"from_parameter": "input_raster_cube"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we need to discuss if this fits the current APEx way of working,...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

Updated the parameter name from "input_raster_cube" to "data".

But, @JanssenBrm @jdries instead of requesting for datacube:

  • suggesting to do the load collection
  • masking
  • call udp for mogpr
  • aggregate for timeseries.

As shown in the usage example here: https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17

Is there a reason for why are we not requesting only for spatial and temporal extent and returning back the filled time-series in this service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed the reason in person and the upgraded version of this process "MOGPR" is "MOGPR_s1_s2". So, I will update this process in the same PR addressing all the suggested comments since the content for both will be almost the same.

from openeo.processes import ProcessBuilder, apply_neighborhood
from openeo.rest.udp import build_process_dict

from fusets.openeo import load_mogpr_udf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe that fusets is part of this environment. @soxofaan what would be the prefered way of working here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt to this folder as initial solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to generating the UDP it is also when publishing and running the UDP.
Updated the requirements.txt

from openeo.rest.udp import build_process_dict

from fusets.openeo import load_mogpr_udf
from fusets.openeo.services.publish_mogpr import NEIGHBORHOOD_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it come with fixed input sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes seems like the size is already defined within FuseTS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also different from the standard way of working, could you take a look into the code to investigate what those standards are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEIGHBORHOOD_SIZE used in 32px

https://github.com/Open-EO/FuseTS/blob/main/src/fusets/openeo/services/publish_mogpr.py#L12

Instead of importing the values, I type in 32 in the apply_neighborhood of generate.py

sys.path.insert(0, directory)

@functools.lru_cache(maxsize=5)
def setup_dependencies(dependencies_url,DEPENDENCIES_DIR):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soxofaan should we include file_locking to avoid concurrency issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR

@soxofaan
Copy link
Contributor

Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.

I'm not sure I understand what your are asking. Is this about raising the issue of lack of documentation? Or just if it is ok to use a single data cube parameter instead of extent parameters?

@soxofaan
Copy link
Contributor

In the benchmark scenario, the process graph includes multiple nodes, but according to documentation(correct me if I am wrong) it should only have a single node that points to the namespace.

in the docs I just see "typically a single node":

process graph will typically just contain a single node

so it does not say "it should". At first sight, I think it's ok to have more than one nodes in the benchmark process graph

Copy link
Contributor

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of notes

},
"temporal_extent": [
"2022-05-01",
"2023-07-31"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to have such a large (15 months if I see correctly) temporal extent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that was not intentional and thanks for pointing as it is one of the cause of high credits 😅
Updated the scenario to use only a month.

"process_id": "apply_neighborhood",
"arguments": {
"data": {
"from_parameter": "input_raster_cube"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to just use data as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter

from openeo.processes import ProcessBuilder, apply_neighborhood
from openeo.rest.udp import build_process_dict

from fusets.openeo import load_mogpr_udf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt to this folder as initial solution

"""
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(extract_to)
os.remove(zip_path) # Clean up the zip file after extraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen this pattern elsewhere, but I think it's bad style to hardcode removal of a zip file from a function that extracts it (separation of concerns). Removal should be handled from the context/function where it was downloaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced the os.remove(zip_path) within the setup_dependencies function

Adds a directory to the Python sys.path if it's not already present.
"""
if directory not in sys.path:
sys.path.insert(0, directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's kind of a security issue if prepending a path to sys.path is the only/default way. Appending should be the default, with a prepend-mode just for special cases.

Copy link
Contributor Author

@Pratichhya Pratichhya Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the insert path instead of append, following the solution adapted for PR: #78

because there was a conflict in modules already existing to that of extracted and also it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"

How do you mean by prepend it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepend = insert at index 0

prepending to sys.path is usually an antipattern (because of security and stability risks), and we should not implicitly promote that by copying that pattern all over the place.

because there was a conflict in modules already existing to that of extracted

that's what I mean: trying to fix this problem by prepending is like shooting a mosquito with a bazooka: you risk to break a lot more than what you intend to.

it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"

In these error logs I see ModuleNotFoundError: No module named 'fusets' . I don't really get how prepending instead of appending would fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pratichhya I can assist you on digging deeper into this issue;

Perhaps, there is another file added to system path with the same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be because the dependency includes not only fusets but also many other packages with a specific version. Not sure if they were affected.

sys.path.insert(0, directory)

@functools.lru_cache(maxsize=5)
def setup_dependencies(dependencies_url,DEPENDENCIES_DIR):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR

DEPENDENCIES_DIR2 = 'venv_static'

DEPENDENCIES_URL1 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets_venv.zip"
DEPENDENCIES_URL2 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is very valuable to define these as constants here. Each value is only used once, so you could just use these values directly in the setup_dependencies() calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the values directly in the setup_dependencies

"""
import os

return Path(os.path.realpath(__file__)).read_text()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: why do you need the os.path.realpath here (and the embedded import os)?. Wouldn't just Path(__file__).read_text() work fine?

Also, it seems a bit overkill at the moment to define this load_set_path() function here. When you call it from generate.py, you can just do directly Path("set_path.py").read_text() which does the same in less code (overall) and requires less clicking around to understand what's happning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: why do you need the os.path.realpath here (and the embedded import os)?. Wouldn't just Path(__file__).read_text() work fine?

I simply fetched the solution from the fusets's load_mogpr_udf itself to make it working 😬

Also, it seems a bit overkill at the moment to define this load_set_path() function here. When you call it from generate.py, you can just do directly Path("set_path.py").read_text() which does the same in less code (overall) and requires less clicking around to understand what's happning

This indeed is a simpler and nice solution and works exactly the same, thank you so much I updated the generate.py and set_path.py with the suggested solution.

@Pratichhya
Copy link
Contributor Author

Here, the input parameter is only datacube, and there is no definition/example on if there is a restriction to passing this as a parameter instead of spatial and temporal extent as done in other cases.

I'm not sure I understand what your are asking. Is this about raising the issue of lack of documentation?

Could be said such 👀 but No No not exactly. It is a concern on if there is standard way to do as for spatial_extent and temporal extent in terms of naming

Or just if it is ok to use a single data cube parameter instead of extent parameters?

Yes, this one esp, because I saw no example of doing such

@Pratichhya Pratichhya marked this pull request as ready for review January 14, 2025 14:46
@soxofaan
Copy link
Contributor

It is a concern on if there is standard way to do as for spatial_extent and temporal extent in terms of naming

as mentioned elsewhere in this issue, I'd recommend in most cases to use data for raster cubes, to align with standard openEO process naming. The only standard openEO process that deviates from this, as far as I could think of, is merge_cubes with cube1 and cube2

Or just if it is ok to use a single data cube parameter instead of extent parameters?

Yes, this one esp, because I saw no example of doing such

I think it depends basically, It depends on the usage/applicability of the "algorithm": is the algorithm tied closely to a certain data set (e.g. biopar stuff), then it makes sense to include the load_collection in the UDP. But if it is a generic algorithm (e.g. producing monthly composites of whatever you feed it), then you can only expect a generic data parameter to get the raster data cube.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants