-
Notifications
You must be signed in to change notification settings - Fork 52
Description
unique was added very early on in gh-25. The review discussions have focused on
- whether return values should be sorted or not (see Specifying a sort order when returning unique values #40), and
- (b) the issue with value-dependent output shape (see Boolean array indexing is not compatible with static memory allocation #84).
There is another issue that was not discussed: unique is the only function whose return type is polymorphic. It can return an array, or a length-2, 3 or 4 tuple of arrays:
unique(x, /, *, return_counts=False, return_index=False, return_inverse=False)This issue is painful for, e.g., libraries that have to support it in their JIT compiler. For svd, which previously was polymorphic too, we decided that it was better to introduce svdvals to get rid of the polymorphism. There's even a design principle that says as much in https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html: "In general, interfaces should avoid polymorphic return values (e.g., returning an array or a namedtuple, dependent on, e.g., an optional keyword argument)."
Given that there are 3 boolean keywords, it's not feasible to split unique into separate functions (there'd be 8 of them). An alternative may be:
- return a custom object with four named fields: (
unique,indices,inverse,counts). - no tuple unpacking
- if data is returned, its type is array
- for fields where the boolean keyword was
False, accessing the field is undefined behavior
Another question that may be relevant is: how often are the boolean keywords each used? The data at https://raw.githubusercontent.com/data-apis/python-record-api/master/data/typing/numpy.py may help there:
@overload
def unique(ar: numpy.ndarray, return_inverse: bool):
"""
usage.scipy: 14
usage.skimage: 5
usage.sklearn: 125
usage.statsmodels: 18
usage.xarray: 3
"""
...
@overload
def unique(ar: numpy.ndarray):
"""
usage.dask: 7
usage.matplotlib: 6
usage.orange3: 16
usage.scipy: 39
usage.seaborn: 5
usage.skimage: 48
usage.sklearn: 385
usage.statsmodels: 45
usage.xarray: 10
"""
...
@overload
def unique(ar: numpy.ndarray, return_inverse: bool, return_counts: bool):
"""
usage.skimage: 3
usage.sklearn: 2
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_counts: bool):
"""
usage.orange3: 3
usage.scipy: 5
usage.skimage: 8
usage.sklearn: 1
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_index: bool):
"""
usage.scipy: 5
usage.skimage: 4
usage.sklearn: 1
usage.statsmodels: 1
"""
...
@overload
def unique(ar: List[int], return_counts: bool):
"""
usage.orange3: 1
"""
...
@overload
def unique(ar: xarray.core.dataarray.DataArray):
"""
usage.xarray: 2
"""
...
@overload
def unique(ar: List[numpy.int32]):
"""
usage.statsmodels: 1
"""
...
@overload
def unique(ar: List[Literal["b", "a"]]):
"""
usage.sklearn: 2
usage.statsmodels: 1
"""
...
@overload
def unique(ar: numpy.ndarray, return_index: bool, return_inverse: bool):
"""
usage.matplotlib: 1
usage.sklearn: 6
usage.statsmodels: 3
"""
...
@overload
def unique(ar: List[Literal["4"]], return_inverse: bool):
"""
usage.statsmodels: 1
"""
...
@overload
def unique(
ar: Union[
pandas.core.series.Series,
numpy.ndarray,
pandas.core.arrays.categorical.Categorical,
List[str],
],
return_inverse: bool = ...,
return_index: bool = ...,
):
"""
usage.pandas: 22
"""
...
@overload
def unique(ar: List[float]):
"""
usage.scipy: 2
usage.sklearn: 3
"""
...
@overload
def unique(ar: numpy.ndarray, axis: int):
"""
usage.scipy: 1
"""
...
@overload
def unique(ar: pandas.core.series.Series):
"""
usage.dask: 3
usage.geopandas: 1
usage.prophet: 4
usage.seaborn: 4
usage.sklearn: 1
"""
...
@overload
def unique(ar: numpy.ma.core.MaskedArray):
"""
usage.matplotlib: 1
"""
...
@overload
def unique(ar: List[int]):
"""
usage.seaborn: 2
usage.sklearn: 26
"""
...
@overload
def unique(ar: List[numpy.int64]):
"""
usage.dask: 4
"""
...
@overload
def unique(ar: List[numpy.float64]):
"""
usage.dask: 5
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[numpy.complex128]):
"""
usage.dask: 3
"""
...
@overload
def unique(ar: List[numpy.bool_]):
"""
usage.dask: 3
"""
...
@overload
def unique(ar: List[numpy.float32]):
"""
usage.dask: 2
"""
...
@overload
def unique(
ar: numpy.ndarray, return_index: bool, return_inverse: bool, return_counts: bool
):
"""
usage.dask: 2
"""
...
@overload
def unique(ar: numpy.memmap):
"""
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "pizza", "the"]]):
"""
usage.sklearn: 2
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "pizza", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "beer", "burger", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "coke", "burger", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["burger", "coke", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "celeri", "salad", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "water", "sparkling", "salad", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "celeri", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["water", "salad", "tomato", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["copyright", "water", "salad", "tomato", "the"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["three", "two", "one"]]):
"""
usage.sklearn: 3
"""
...
@overload
def unique(ar: List[Union[float, int]], return_inverse: bool):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["b", "c", "a"]]):
"""
usage.sklearn: 1
"""
...
@overload
def unique(ar: List[Literal["3", "2", "1"]]):
"""
usage.sklearn: 2
"""
...
def unique(
ar: object,
return_index: bool = ...,
return_inverse: bool = ...,
return_counts: bool = ...,
):
"""
usage.dask: 29
usage.geopandas: 1
usage.matplotlib: 8
usage.orange3: 20
usage.pandas: 22
usage.prophet: 4
usage.scipy: 66
usage.seaborn: 11
usage.skimage: 68
usage.sklearn: 574
usage.statsmodels: 72
usage.xarray: 15Conclusion: a large majority of usage is unique(x), without any keywords. return_inverse is fairly often used, return_index almost never (checked by searching Pandas, sklearn et al.).
So another alternative count be to include two separate functions:
unique(x, /)- returns array of unique valuesunique_all(x, /)- returns tuple of 4 arrays (may need a better name)