Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

array_concat: difficulty representing heterogenous arrays in numpy #407

Closed
LukeWeidenwalker opened this issue Feb 20, 2023 · 10 comments
Closed

Comments

@LukeWeidenwalker
Copy link
Contributor

Process ID: array_concat

Describe the issue:
The array_concat process currently allows concatenating arrays with arbitrary datatypes. For the Numpy/Dask implementation, this is weird, because ndarrays are expected to be homogenouous (or incur heavy performance degradation if using the object type). This means that effectively there's always implicit type casting going on under the hood, which might not be what the user expects. E.g. a string array and an int64 array, when concatenated together will be cast to a Unicode type, rather than staying heterogenuous as required by the spec.

Proposed solution:
Disallow heterogenuous arrays from being created at all with array_concat and throw an error if this is attempted. Attempt casting between numerical types where it's safe and throw an error if it's not.

Additional context:
Illustration in numpy:

>>> import numpy as np
>>> array1 = np.array(["a", "b"]) # dtype: `<U1` (Unicode with 1 char)
>>> array2 = np.array([1, 2]) # dtype: `int64`

>>> result = np.concatenate([array1, array2])
array(['a', 'b', '1', '2'], dtype='<U21')

cc @ValentinaHutter

@edzer
Copy link
Member

edzer commented Feb 20, 2023

I think the openEO data cube assumes cell values are always numbers, while leaving implicit how they are actually implemented (uint8, int64 etc or floats). It may make sense at some stage to formalise this: certain operations should result in floats rather than ints in order to not loose too much precision (sin, log and such, but also divisions).

Another issue is of course how to deal with categorical variables, and in particular the category labels. The ["a", "b"] array should be an array with values (assuming 0-based) [0, 1] and be accompanied by category labels ["a", "b"]. Concatenating such an array with an array with no or different category labels should result in an error or be done with special care.

@m-mohr
Copy link
Member

m-mohr commented Feb 20, 2023

The process is in principle fine as it is, but may need a bit of a clarification, I think. For this it would be great to hear more implementors.

For numpy you could implement different levels based on the data types, right?

  1. data types are the same: great, proceed
  2. data types are not the same but can be converted easily/without loss (e.g. uint8 to int16): convert to int16
  3. data types are not the same but can not be converted easily/without loss: Throw a warning and use the object type or throw an error and abort.

I think the openEO data cube assumes cell values are always numbers

No, openEO data cubes can contain any (scalar) data type. How that's translated internally is up to the implementation, but the user can expect to not worry about this.

@LukeWeidenwalker
Copy link
Contributor Author

For numpy you could implement different levels based on the data types, right?

  1. data types are the same: great, proceed
  2. data types are the not the same but can be converted easily/without loss (e.g. uint8 to int16): convert to int16
  3. data types are not the same but can not be converted easily/without loss: Throw a warning and use the object type or throw an error and abort.

Thanks for the input - yes, that sounds reasonable! Agree that this probably wants to be clarified in the process description!

@jdries
Copy link
Contributor

jdries commented Feb 27, 2023

The geotrellis backend also implements automatic conversion to wider data types when needed.

@m-mohr
Copy link
Member

m-mohr commented Feb 27, 2023

@LukeWeidenwalker What exactly should we clarify here? If it's just the data type handling, it doesn't belong into the process and I could just add it to the implementation guidelines.

@LukeWeidenwalker
Copy link
Contributor Author

I was looking at https://processes.openeo.org/#array_concat:

  • Remove the example that concats an integer and a string array
  • In the description mention that numeric types will be cast if safe and an error will be thrown if that's not possible

@m-mohr
Copy link
Member

m-mohr commented Feb 27, 2023

  • Remove the example that concats an integer and a string array

It is allowed in openEO, you can remove it of course from your implementation if not supported.

  • In the description mention that numeric types will be cast if safe and an error will be thrown if that's not possible

openEO itself doesn't have different numeric types so it doesn't make sense in the process description. Throwing an error depends on the implementation.
But I'll add something to the implementation guide.

@LukeWeidenwalker
Copy link
Contributor Author

LukeWeidenwalker commented Feb 27, 2023

It is allowed in openEO, you can remove it of course from your implementation if not supported.

Oh, so this would be an appropriate thing to do? I was under the impression that if we expose a process, we'd need to commit to implement all facets of it?
If that's not the case, we'll just start doing that wherever we run into limitations with the underlying technologies we're using, and log appropriate errors to inform users.

@m-mohr
Copy link
Member

m-mohr commented Feb 28, 2023

openEO itself is indeed flexible and we have these schema-based processes so that you can customize them if needed. This works best if the change is in the schematic part and not in the descriptions so that clients can read and handle them.

In this case, it's just an example, which means it's less "prominent" to users, but it doesn't necessarily imply to a client/user that mixed types are (not) allowed. You can only mention in the description that mixed types are not allowed.

The other topic is of course what openEO Platform requires in the federation contract and here it might be more strict, especially if at some point a test suite comes into play.

Still, I'll make a small change to the process examples so that we don't just have a mixed type example.

@m-mohr
Copy link
Member

m-mohr commented Feb 28, 2023

@LukeWeidenwalker Here's an updated version: db242a8
I'd propose you try to convert on a best-effort basis and otherwise throw an error.
But it's still good to bring up implementation issues whenever you run into issues as we may solve some of them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants