-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process to load a vector cube #322
Comments
Have you had a look at #319 (WIP)?
I'm confused. It's present since years in JS, Web Editor and R and also present in the GEE back-end. EODC also claims that it's available.
Indeed, sharing was never implemented or defined due to lack of time (for implementation). I'd be happy to work on a specification, but I don't see who would implement that anytime soon as other long-existing features such as the pure existance of a user file storage are not even present yet.
That's something that is also missing from raster cubes, although for both vector and raster you could likely re-use load_result with a URL although it requires STAC, which is not really embracing vector data yet.
Yes, but actually only because we had not definition for vector cubes yet and the behavior was undefined. The intention there was to re-add it once we have that defined, which we are getting closer to. |
My bad indeed, too much VITO-oriented assumptions apparently, I should have explored this more.
Yes, but I think it's significantly more important for vector data. Handling raster data is usually a big data problem, which is not ideal to solve with user-facing URLs. (Input) vector data will typically be small data (e.g. just one relatively small file), and handling them through URLs is straight-forward, both for client and back-end side.
I missed that apparently when searching for related github tickets. Some quick notes:
|
In short, I think the (technically) simplest and most versatile way to have vector cube loading functionality in the geopyspark driver and aggregator is loading from URL. In VITO and related backends we already support that in the Several options:
|
Why? Backends could surely provide some larger commonly used datasets through that, e.g. https://developers.google.com/earth-engine/datasets/tags/boundaries
In principle, you could load any URL, but this highly depends on the implementation. |
|
I think they are distinct enough and while you can implement load_external always, load_uploaded_files only works if a workspace is present. So keeping them separate is better to avoid two versions of load_external in case no user workspace is present. |
(FYI: @jdries pointed me to Open-EO/openeo-api#135 as well, which is closely related to this discussion) |
at the moment I'm focused on loading vector data/cubes, so additional Moreover, the reason to have filter options inside |
Hmm, I don't really understand how Open-EO/openeo-api#135 is closely related? It's just another way of providing a user workspace for files, right? Anyway, I'll post a PR for load_external later. Seems pretty straighforward. |
I would really prefer to have the same process for loading files both externally or from a workspace. This would again lead to the problem where process graphs would need to adapt depending on where the file is coming from. |
It's not a good idea that we design a process that supports two options, but one may not be supported. So back-ends need to change the process definition based on the API capabilities, which is prone to issues. As the UDP is coming to the table again and again, we should probably focus on getting a solution for calling processes by parameter: Open-EO/openeo-api#413 / #307 -
I don't understand the question. If you don't implement/support uploading files, you don't implement load_uploaded_files and as such don't need a path?! |
The workaround where we call processes by parameter is still more complex than simply having one process. openEO is about solving complex things in the backend, so that users don't have to deal with it. With the parameterized process proposal, both the UDP implementor and user are facing more complexity. My question was actually trying to reply to:
I don't get the problem with loading files when no user workspace is present? |
But if this is now coming up on a nearly weekly basis, then it seems that we should look into a solution. We have the issue here for the reducers, it's not a generalization. It's an actual use case (outside of openEO Platform). How shall I solve it? It's simply not possible except I do a lot of if/else, but that would also be possible for loading data, of course. The issue is that you need to adopt the process specification and remove one of the schemas if user uploads are not supported. That's all I meant. So it's not ideal seeing that many providers just copy&paste the processes without adapting them and as such exposing something that is not actually there. But I could say that's an issue for them, indeed. |
No problem to work on parameterization of process names, it is a useful thing to have for certain cases. It's indeed not ideal if backends claim support for something because of copy pasting, but it's already very good that a correct solution is in fact available through the schemas. If we then ever have a very advanced aggregator that automatically adapts strategy based on where a backend can load data from, then that would work. But this is all fairly futuristic, for now, users will indeed simply get an error if an unsupported feature is used, and then they can contact the backend. |
We already do that with the As a user I appreciate it when software or a library takes care of all the annoying file format details and storage details, and I just can use the same |
I agree with @soxofaan and I would go with load_external where it's possible to load from an URL or (if available) from the user workspace. Allowing the loading from an external URL would also allow to use vector processes on one back-end and re-use the result on another one that does not support them yet. |
Means eventually combining all load_* processes into a single one?
Okay, but user workspace is not external. So I'd propose renaming load_uploaded_files to load_file(s?) and add URL support.
That's already captured by load_results?! |
yes, maybe load_file(s) fits better both scenarios.
from my point of view no. Currently load_result doesn't support vector_cubes and it has been updated to support the same parameters as load_collection (spatial_extent, temporal_extent and so on). So it has been moved to be more raster specific than general purpose. Anyway, load_result across back-ends would work only within the federation (openEO Platform) and not with other back-ends (EURAC, GEE). |
No process really supports vector-cubes yet, we are just adding it in right now. See #319.
spatial/temporal can be used with vector, only bands is raster specific.
That's maybe a restriction of openEO Platform, but the process doesn't specify such a restriction. It allows retrieving results by URL. So if you have published your result, then you could in principle load it from everywhere, even from GEE. |
Ultimately that could be an option, but I don't think we should aim for that at this point.
👍 |
But a complicated one: How do you decide between collections, files, results? They are all just strings and you eventually will run into conflicts.
Basically, the load_* processes are now structured as such:
The load_* processes are named in a way that they reflect the "source". Data is so overloaded term and general that it could basically include all load_* processes. |
not necessarily if you use scheme prefix e.g.
FYI: openeo-processes/proposals/load_uploaded_files.json Lines 12 to 16 in d0ce91f
Going for single file is fine for loading vector cubes I guess, but for loading raster data, it's probably best to have an option to specify multiple files. |
Good point regarding the array of paths. I guess then we should allow arrays or URLs too. So, replace all "single" with "multiple". So then the difference is that load_result points to STAC catalogs and load_files points to the data files (in STAC terminology: assets) directly. |
"data" is indeed very generic. My concern is mainly that "file" has a very "static" sound to it, while URLs can be more dynamic, e.g. it can be an on-the-fly query. But that's probably just a matter of POV, the back-end usually does not have to be aware of that and can consider it to be a "file". |
|
Update to listing of #322 (comment):
|
While there are various discussions about how to conceptually define and handle vector cubes,
I don't think we have already a standardized solution to load the vector data in the first place (except for inline GeoJSON).
I'll first try to list a couple of vector loading scenario's (with varying degrees of practicality and usefulness) and initial discussion of possible solutions (if any)
load_uploaded_files
(proposal), but it currently only supports returning raster cubes (it originally supported vector cubes, but that was removed in Vector data cubes + processes #68)read_vector
: user has the ability to upload/download/construct files in their Terrascope workspaceload_result
exists, but is raster cube output only at the moment, and parameter wise it is also very raster-cube-oriented.load_collection
originally supported vector cubes, but that was removed in Vector data cubes + processes #68The text was updated successfully, but these errors were encountered: