-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
array_concat: difficulty representing heterogenous arrays in numpy #407
Comments
I think the openEO data cube assumes cell values are always numbers, while leaving implicit how they are actually implemented ( Another issue is of course how to deal with categorical variables, and in particular the category labels. The |
The process is in principle fine as it is, but may need a bit of a clarification, I think. For this it would be great to hear more implementors. For numpy you could implement different levels based on the data types, right?
No, openEO data cubes can contain any (scalar) data type. How that's translated internally is up to the implementation, but the user can expect to not worry about this. |
Thanks for the input - yes, that sounds reasonable! Agree that this probably wants to be clarified in the process description! |
The geotrellis backend also implements automatic conversion to wider data types when needed. |
@LukeWeidenwalker What exactly should we clarify here? If it's just the data type handling, it doesn't belong into the process and I could just add it to the implementation guidelines. |
I was looking at https://processes.openeo.org/#array_concat:
|
It is allowed in openEO, you can remove it of course from your implementation if not supported.
openEO itself doesn't have different numeric types so it doesn't make sense in the process description. Throwing an error depends on the implementation. |
Oh, so this would be an appropriate thing to do? I was under the impression that if we expose a process, we'd need to commit to implement all facets of it? |
openEO itself is indeed flexible and we have these schema-based processes so that you can customize them if needed. This works best if the change is in the schematic part and not in the descriptions so that clients can read and handle them. In this case, it's just an example, which means it's less "prominent" to users, but it doesn't necessarily imply to a client/user that mixed types are (not) allowed. You can only mention in the description that mixed types are not allowed. The other topic is of course what openEO Platform requires in the federation contract and here it might be more strict, especially if at some point a test suite comes into play. Still, I'll make a small change to the process examples so that we don't just have a mixed type example. |
@LukeWeidenwalker Here's an updated version: db242a8 |
Process ID: array_concat
Describe the issue:
The array_concat process currently allows concatenating arrays with arbitrary datatypes. For the Numpy/Dask implementation, this is weird, because ndarrays are expected to be homogenouous (or incur heavy performance degradation if using the
object
type). This means that effectively there's always implicit type casting going on under the hood, which might not be what the user expects. E.g. a string array and an int64 array, when concatenated together will be cast to a Unicode type, rather than staying heterogenuous as required by the spec.Proposed solution:
Disallow heterogenuous arrays from being created at all with array_concat and throw an error if this is attempted. Attempt casting between numerical types where it's safe and throw an error if it's not.
Additional context:
Illustration in numpy:
cc @ValentinaHutter
The text was updated successfully, but these errors were encountered: