-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Populating Algorithms: Mogpr from Fusets #80
base: main
Are you sure you want to change the base?
Conversation
@soxofaan @HansVRP
|
"type": "apex_algorithm", | ||
"title": "Multi output gaussian process regression", | ||
"description": "Integrates timeseries in data cube using multi-output gaussian process regression. The service is designed to enable multi-output regression analysis using Gaussian Process Regression (GPR) on geospatial data. It provides a powerful tool for understanding and predicting spatiotemporal phenomena by filling gaps based on other indicators that are correlated with each other.", | ||
"cost_estimate": 12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this calculated with the standard job options? or did you reevalute this cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was when updating the job_option. As mentioned in the documentation, 'executor-memory': '7g' was set, however I used:
"executor-memory": "1G",
"executor-memoryOverhead": "500m",
"python-memory": "3G"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so did this influence the cost estimate?
Also did you use the same memory settings in your benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not much: by 4/5 credits.
Usually in my setting it is around 15 credits but with the executor-memory: 7g it is 19credits for a month and same aoi
"rel": "openeo-process", | ||
"type": "application/json", | ||
"title": "openEO Process Definition", | ||
"href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/mogpr_v1/openeo_udp/fusets_mogpr/fusets_mogpr.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this branch will probably be deleted afterwards, so make sure to update once merged
openeo_udp/fusets_mogpr/README.md
Outdated
|
||
### Synchronous calls | ||
|
||
TODO: Replace with actual measurements!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails due to Read_time out (minimum 13/15 mins)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the measurements section in the readme
openeo_udp/fusets_mogpr/README.md
Outdated
|
||
## Configuration & Resource Usage | ||
|
||
Run configurations for different ROI/TOI with memory requirements and estimated run durations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the readme to similar content as in the marketplace
openeo_udp/fusets_mogpr/README.md
Outdated
|
||
### Batch jobs | ||
|
||
TODO: Replace with actual measurements!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
"process_id": "apply_neighborhood", | ||
"arguments": { | ||
"data": { | ||
"from_parameter": "input_raster_cube" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we need to discuss if this fits the current APEx way of working,...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to just use data
as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to just use
data
as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter
Updated the parameter name from "input_raster_cube" to "data".
But, @JanssenBrm @jdries instead of requesting for datacube:
- suggesting to do the load collection
- masking
- call udp for mogpr
- aggregate for timeseries.
As shown in the usage example here: https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17
Is there a reason for why are we not requesting only for spatial and temporal extent and returning back the filled time-series in this service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed the reason in person and the upgraded version of this process "MOGPR" is "MOGPR_s1_s2". So, I will update this process in the same PR addressing all the suggested comments since the content for both will be almost the same.
from openeo.processes import ProcessBuilder, apply_neighborhood | ||
from openeo.rest.udp import build_process_dict | ||
|
||
from fusets.openeo import load_mogpr_udf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe that fusets is part of this environment. @soxofaan what would be the prefered way of working here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt
to this folder as initial solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to generating the UDP it is also when publishing and running the UDP.
Updated the requirements.txt
openeo_udp/fusets_mogpr/generate.py
Outdated
from openeo.rest.udp import build_process_dict | ||
|
||
from fusets.openeo import load_mogpr_udf | ||
from fusets.openeo.services.publish_mogpr import NEIGHBORHOOD_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it come with fixed input sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes seems like the size is already defined within FuseTS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also different from the standard way of working, could you take a look into the code to investigate what those standards are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NEIGHBORHOOD_SIZE used in 32px
Instead of importing the values, I type in 32 in the apply_neighborhood of generate.py
sys.path.insert(0, directory) | ||
|
||
@functools.lru_cache(maxsize=5) | ||
def setup_dependencies(dependencies_url,DEPENDENCIES_DIR): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soxofaan should we include file_locking to avoid concurrency issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR
I'm not sure I understand what your are asking. Is this about raising the issue of lack of documentation? Or just if it is ok to use a single data cube parameter instead of extent parameters? |
in the docs I just see "typically a single node":
so it does not say "it should". At first sight, I think it's ok to have more than one nodes in the benchmark process graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of notes
}, | ||
"temporal_extent": [ | ||
"2022-05-01", | ||
"2023-07-31" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional to have such a large (15 months if I see correctly) temporal extent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that was not intentional and thanks for pointing as it is one of the cause of high credits 😅
Updated the scenario to use only a month.
"process_id": "apply_neighborhood", | ||
"arguments": { | ||
"data": { | ||
"from_parameter": "input_raster_cube" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to just use data
as parameter name (like in the majority of raster cube processes) if you have a single data cube as input parameter
from openeo.processes import ProcessBuilder, apply_neighborhood | ||
from openeo.rest.udp import build_process_dict | ||
|
||
from fusets.openeo import load_mogpr_udf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a dependency for generating the UDP, right?
in that case I would at least add a requirements.txt
to this folder as initial solution
openeo_udp/fusets_mogpr/set_path.py
Outdated
""" | ||
with zipfile.ZipFile(zip_path, "r") as zip_ref: | ||
zip_ref.extractall(extract_to) | ||
os.remove(zip_path) # Clean up the zip file after extraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this pattern elsewhere, but I think it's bad style to hardcode removal of a zip file from a function that extracts it (separation of concerns). Removal should be handled from the context/function where it was downloaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced the os.remove(zip_path) within the setup_dependencies function
Adds a directory to the Python sys.path if it's not already present. | ||
""" | ||
if directory not in sys.path: | ||
sys.path.insert(0, directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's kind of a security issue if prepending a path to sys.path is the only/default way. Appending should be the default, with a prepend-mode just for special cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the insert path instead of append, following the solution adapted for PR: #78
because there was a conflict in modules already existing to that of extracted and also it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"
How do you mean by prepend it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepend = insert at index 0
prepending to sys.path is usually an antipattern (because of security and stability risks), and we should not implicitly promote that by copying that pattern all over the place.
because there was a conflict in modules already existing to that of extracted
that's what I mean: trying to fix this problem by prepending is like shooting a mosquito with a bazooka: you risk to break a lot more than what you intend to.
it couldn't find the fusets as done with append as seen in "j-2501141413334eaab23c071c8db34078"
In these error logs I see ModuleNotFoundError: No module named 'fusets'
. I don't really get how prepending instead of appending would fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Pratichhya I can assist you on digging deeper into this issue;
Perhaps, there is another file added to system path with the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be because the dependency includes not only fusets but also many other packages with a specific version. Not sure if they were affected.
sys.path.insert(0, directory) | ||
|
||
@functools.lru_cache(maxsize=5) | ||
def setup_dependencies(dependencies_url,DEPENDENCIES_DIR): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you certainly risk concurrency problems here. However it's not trivial here, because you need a locking mechanism that works across multiple executor. I guess solving this properly is a bit out of scope of this PR
openeo_udp/fusets_mogpr/set_path.py
Outdated
DEPENDENCIES_DIR2 = 'venv_static' | ||
|
||
DEPENDENCIES_URL1 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets_venv.zip" | ||
DEPENDENCIES_URL2 = "https://artifactory.vgt.vito.be:443/artifactory/auxdata-public/ai4food/fusets.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is very valuable to define these as constants here. Each value is only used once, so you could just use these values directly in the setup_dependencies()
calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the values directly in the setup_dependencies
openeo_udp/fusets_mogpr/set_path.py
Outdated
""" | ||
import os | ||
|
||
return Path(os.path.realpath(__file__)).read_text() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: why do you need the os.path.realpath
here (and the embedded import os
)?. Wouldn't just Path(__file__).read_text()
work fine?
Also, it seems a bit overkill at the moment to define this load_set_path()
function here. When you call it from generate.py, you can just do directly Path("set_path.py").read_text()
which does the same in less code (overall) and requires less clicking around to understand what's happning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: why do you need the
os.path.realpath
here (and the embeddedimport os
)?. Wouldn't justPath(__file__).read_text()
work fine?
I simply fetched the solution from the fusets's load_mogpr_udf itself to make it working 😬
Also, it seems a bit overkill at the moment to define this
load_set_path()
function here. When you call it from generate.py, you can just do directlyPath("set_path.py").read_text()
which does the same in less code (overall) and requires less clicking around to understand what's happning
This indeed is a simpler and nice solution and works exactly the same, thank you so much I updated the generate.py and set_path.py with the suggested solution.
Could be said such 👀 but No No not exactly. It is a concern on if there is standard way to do as for spatial_extent and temporal extent in terms of naming
Yes, this one esp, because I saw no example of doing such |
as mentioned elsewhere in this issue, I'd recommend in most cases to use
I think it depends basically, It depends on the usage/applicability of the "algorithm": is the algorithm tied closely to a certain data set (e.g. biopar stuff), then it makes sense to include the |
The service is also available at https://marketplace-portal.dataspace.copernicus.eu/catalogue/app-details/17
The modification in the openeo_udp of this repo in comparison to the existing one is: