Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to not fail on value-based casting errors #113

Closed
tomwhite opened this issue Apr 14, 2022 · 10 comments
Closed

Add an option to not fail on value-based casting errors #113

tomwhite opened this issue Apr 14, 2022 · 10 comments
Assignees

Comments

@tomwhite
Copy link
Contributor

The new (and very useful!) document on Array API Standard Compatibility has a type of difference (with regular NumPy) called strictness for "[things that] aren’t actually required by the spec, and other conforming libraries may not follow them".

As implementations start implementing the spec, it would be useful to be able to run the tests with a flag saying don't fail on strictness errors.

@honno
Copy link
Member

honno commented Apr 14, 2022

Well ideally we don't test for strictness already! But likely we haven't got this right 100% of the time—any examples off the top of your head? (excluding signature tests, which are being reworked—noted strictness in #110 (comment))

@tomwhite
Copy link
Contributor Author

tomwhite commented Apr 14, 2022

Thanks @honno that's good to know! I have a few errors for Dask that I thought were because of strictness, but I may be wrong. I will have another look at them and report back.

@tomwhite
Copy link
Contributor Author

Most of the errors seem to be to do with value-based casting differences. I've pasted a failing Array API test below, but here's a summary:

import numpy as np
import numpy.array_api as nxp
assert (np.array([1], dtype=np.uint8) + 256).dtype == np.uint16 # 256 cannot fit in uint8, so promoted to uint16
assert (nxp.asarray([1], dtype=nxp.uint8) + 256).dtype == nxp.uint8

I wonder if it would be possible to have a flag to ignore failures due to value-based casting differences? They can occur in many of the tests in test_operators_and_elementwise_functions.py, and I'd rather not have a blanket exclude on them all as they are testing lots of other cases too.

This would help when running the tests against the main NumPy namespace to check compliance there too. Also, there's work in numpy/numpy#21103 to change the behaviour of NumPy, but it's in an early stage.

ARRAY_API_TESTS_MODULE=dask.array pytest -v -rxXfE --max-examples=2 --disable-data-dependent-shapes --ci 'array_api_tests/test_operators_and_elementwise_functions.py::test_bitwise_and[bitwise_and(x1, x2)]'
========================================================================= test session starts =========================================================================
platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/tom/opt/miniconda3/envs/dask-array-api-tests/bin/python3.8
cachedir: .pytest_cache
hypothesis profile 'xp_override' -> deadline=None, max_examples=2, database=DirectoryBasedExampleDatabase('/Users/tom/projects-workspace/array-api-tests-dask/.hypothesis/examples')
rootdir: /Users/tom/projects-workspace/array-api-tests-dask
plugins: hypothesis-6.36.1
collected 1 item                                                                                                                                                      

array_api_tests/test_operators_and_elementwise_functions.py::test_bitwise_and[bitwise_and(x1, x2)] FAILED                                                       [100%]

============================================================================== FAILURES ===============================================================================
________________________________________________________________ test_bitwise_and[bitwise_and(x1, x2)] ________________________________________________________________

ctx = BinaryParamContext(<bitwise_and(x1, x2)>)

>   ???

array_api_tests/test_operators_and_elementwise_functions.py:634: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
array_api_tests/test_operators_and_elementwise_functions.py:649: in test_bitwise_and
    binary_param_assert_against_refimpl(ctx, left, right, res, "&", refimpl)
array_api_tests/test_operators_and_elementwise_functions.py:509: in binary_param_assert_against_refimpl
    binary_assert_against_refimpl(
array_api_tests/test_operators_and_elementwise_functions.py:214: in binary_assert_against_refimpl
    scalar_o = res_stype(res[o_idx])
../dask/dask/array/core.py:1801: in __int__
    return self._scalarfunc(int)
../dask/dask/array/core.py:1798: in _scalarfunc
    return cast_type(self.compute())
../dask/dask/base.py:292: in compute
    (result,) = compute(self, traverse=False, **kwargs)
../dask/dask/base.py:575: in compute
    results = schedule(dsk, keys, **kwargs)
../dask/dask/threaded.py:81: in get
    results = get_async(
../dask/dask/local.py:508: in get_async
    raise_exception(exc, tb)
../dask/dask/local.py:316: in reraise
    raise exc
../dask/dask/local.py:221: in execute_task
    result = _execute_task(task, data)
../dask/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../dask/dask/core.py:119: in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
../dask/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../dask/dask/optimization.py:990: in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
../dask/dask/core.py:149: in get
    result = _execute_task(task, cache)
../dask/dask/core.py:119: in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
../dask/dask/utils.py:39: in apply
    return func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (array(257, dtype=uint32), array([1], dtype=uint8)), kwargs = {}, dtype = dtype('uint32'), function = <ufunc 'bitwise_and'>, result = array([1], dtype=int16)

    def _enforce_dtype(*args, **kwargs):
        """Calls a function and converts its result to the given dtype.
    
        The parameters have deliberately been given unwieldy names to avoid
        clashes with keyword arguments consumed by blockwise
    
        A dtype of `object` is treated as a special case and not enforced,
        because it is used as a dummy value in some places when the result will
        not be a block in an Array.
    
        Parameters
        ----------
        enforce_dtype : dtype
            Result dtype
        enforce_dtype_function : callable
            The wrapped function, which will be passed the remaining arguments
        """
        dtype = kwargs.pop("enforce_dtype")
        function = kwargs.pop("enforce_dtype_function")
    
        result = function(*args, **kwargs)
        if hasattr(result, "dtype") and dtype != result.dtype and dtype != object:
            if not np.can_cast(result, dtype, casting="same_kind"):
>               raise ValueError(
                    "Inferred dtype from function %r was %r "
                    "but got %r, which can't be cast using "
                    "casting='same_kind'"
                    % (funcname(function), str(dtype), str(result.dtype))
                )
E               ValueError: Inferred dtype from function 'bitwise_and' was 'uint32' but got 'int16', which can't be cast using casting='same_kind'

../dask/dask/array/core.py:4725: ValueError
----------------------------------------------------------------------------- Hypothesis ------------------------------------------------------------------------------
Falsifying example: test_bitwise_and(
    data=data(...), ctx=BinaryParamContext(<bitwise_and(x1, x2)>),
)
Draw 1 (x1): dask.array<reshape, shape=(), dtype=uint32, chunksize=(), chunktype=numpy.ndarray>
Draw 2 (x2): dask.array<full_like, shape=(1,), dtype=uint8, chunksize=(1,), chunktype=numpy.ndarray>
========================================================================== warnings summary ===========================================================================
../dask/dask/array/array_api.py:15
  /Users/tom/projects-workspace/dask/dask/array/array_api.py:15: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
    from numpy import array_api as nxp

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================================================= short test summary info =======================================================================
FAILED array_api_tests/test_operators_and_elementwise_functions.py::test_bitwise_and[bitwise_and(x1, x2)] - ValueError: Inferred dtype from function 'bitwise_and' w...
==================================================================== 1 failed, 1 warning in 2.42s =====================================================================

@honno
Copy link
Member

honno commented Apr 19, 2022

Ok @tomwhite, I've puzzled on this for a bit, but think I understand now 😅 Please clarify anything I might of misunderstood!

import numpy as np
import numpy.array_api as nxp
assert (np.array([1], dtype=np.uint8) + 256).dtype == np.uint16 # 256 cannot fit in uint8, so promoted to uint16
assert (nxp.asarray([1], dtype=nxp.uint8) + 256).dtype == nxp.uint8

So in this scenario, we don't test the "logic" of nxp.asarray([1], dtype=nxp.uint8) + 256). We might end up generating this example, but we skip any assertions due to:

if res.dtype != xp.bool:
assert m is not None and M is not None # for mypy
if expected <= m or expected >= M:
continue

i.e. if the reference implementation (in this case operator.add) comes up with a result which is outside the bounds of the (promoted) dtype, don't actually test anything.

But I do think test_add is lacking, as we do end up testing the dtype and shape of the output array even if an operation not supported by the Array API has occurred, such as out-of-bounds addition. Is that what you mean?

In any case, this is a problem for testing Dask right now, and as you say testing NumPy-proper. I should explore reworking how elements are generated again—as you see, we generate them naively (for the most part), and throw away bad examples after-the-fact. IIRC, I couldn't find an easy way of never generating inputs which aren't quite valid for many of the Array API operations, but since then we've gone through a nice refactor that might lead to a more manageable solution.

(FYI testing NumPy-proper won't work in the test suite upstream, but you can use #112)


Hmm just to clarify, that failing bitwise_and() example is due to:

>>> from dask import array as da
>>> x1 = da.asarray(257, dtype="uint32")
>>> x2 = da.asarray([1], dtype="uint8")
>>> int(da.bitwise_and(x1, x2))
ValueError: ...

But if the right argument was a 0d array as opposed to a 1d array, this works as expected:

>>> x2 = da.asarray(1, dtype="uint8")
>>> out = da.bitwise_and(x1, x3)
>>> out
dask.array<bitwise_and, shape=(), dtype=uint32, chunksize=(), chunktype=numpy.ndarray>
>>> int(out)
1

Is the issue here actually to do with broadcasting behaviour? 0d arrays acting as NumPy "scalars", and so inversely being compliant? Or is this actually a value-based casting issue?

Another example I think like this can be seen in test_add, e.g.

>>> int(da.add(da.asarray(255, dtype="uint32"), da.asarray(1, dtype="uint8")))
256
>>> int(da.add(da.asarray(255, dtype="uint32"), da.asarray([1], dtype="uint8")))
>>> 0  # should be 256 - the final array is uint32 but I'm guessing uint8 as an intermediate array caused problems

@tomwhite
Copy link
Contributor Author

Thanks for investigating @honno. Sorry I haven't been able to characterise the problem very clearly.

I should explore reworking how elements are generated again—as you see, we generate them naively (for the most part), and throw away bad examples after-the-fact.

That would be great. In general, testing that value-based casting is not happening is what we want (since that's what the spec says), but the ability to disable that check across all tests would be handy.

But I do think test_add is lacking, as we do end up testing the dtype and shape of the output array even if an operation not supported by the Array API has occurred, such as out-of-bounds addition. Is that what you mean?

I think that could be a problem, even if it's not exactly the one that I'm hitting here.

BTW I've created another branch of Dask for the Array API implementation. It includes a workflow to run it here to track how the implementation is going.

@tomwhite tomwhite changed the title Add a flag to specify strictness Add an option to not fail on value-based casting errors Apr 20, 2022
@honno honno self-assigned this Apr 20, 2022
@honno
Copy link
Member

honno commented Apr 20, 2022

That would be great. In general, testing that value-based casting is not happening is what we want (since that's what the spec says), but the ability to disable that check across all tests would be handy.

I should of clarified—what seems to be the problem is that for both NumPy-proper and your dask.array branch (excited for this!), the test suite fails against value-based casting in scenarios which are out of the Array API's scope, such as out-of-bounds addition, and these scenarios shouldn't be tested in the first place. That's what I'll explore fixing first anywho.

And yep if there still end up being areas where Dask has erroneous behaviour due to value-based casting for in-scope scenarios, I can see the utility of such a flag. I'm pretty sure this is going to be a problem with NumPy-proper, due to internal "NumPy scalar" shenanigans. But yeah we can revisit this out after I sort out the aforementioned issue.

@tomwhite
Copy link
Contributor Author

Sounds great - thanks @honno!

@honno
Copy link
Member

honno commented Apr 20, 2022

@tomwhite Okay I see why you weren't sure if what I was saying was the problem you were facing heh, as it doesn't look like fixing the problem I identified (testing output dtypes/shapes from unspecified operations) does much for your failing dask.array test cases... but thanks for inadvertently discovering that issue heh.

Soooo to go back to the failing bitwise_and() example, could you expand on this comment? I think that'd help me understand more your specific issue... right now it doesn't look the issue is value based casting to me, but broadcasting discrepancies between 0d and multi-dimensional arrays. Sorry again 😅

Hmm just to clarify, that failing bitwise_and() example is due to:

>>> from dask import array as da
>>> x1 = da.asarray(257, dtype="uint32")
>>> x2 = da.asarray([1], dtype="uint8")
>>> int(da.bitwise_and(x1, x2))
ValueError: ...

But if the right argument was a 0d array as opposed to a 1d array, this works as expected:

>>> x2 = da.asarray(1, dtype="uint8")
>>> out = da.bitwise_and(x1, x3)
>>> out
dask.array<bitwise_and, shape=(), dtype=uint32, chunksize=(), chunktype=numpy.ndarray>
>>> int(out)
1

Is the issue here actually to do with broadcasting behaviour? 0d arrays acting as NumPy "scalars", and so inversely being compliant? Or is this actually a value-based casting issue?

Another example I think like this can be seen in test_add, e.g.

>>> int(da.add(da.asarray(255, dtype="uint32"), da.asarray(1, dtype="uint8")))
256
>>> int(da.add(da.asarray(255, dtype="uint32"), da.asarray([1], dtype="uint8")))
>>> 0  # should be 256 - the final array is uint32 but I'm guessing uint8 as an intermediate array caused problems

@asmeurer
Copy link
Member

As implementations start implementing the spec, it would be useful to be able to run the tests with a flag saying don't fail on strictness errors.

Like @honno said, the test suite shouldn't be checking anything that isn't explicitly required by the spec. If it is, that's a bug in the test suite. If you want to implement "strictness" in your library you'll have to add tests for it in your own test suite (e.g., virtually all of the numpy.array_api tests are for strictness, leaving the other tests to the array_api_tests test suite).

That would be great. In general, testing that value-based casting is not happening is what we want (since that's what the spec says), but the ability to disable that check across all tests would be handy.

"Value-based casting" is a pretty broad term. NumPy has somewhat specific instances where it applies it (with some exceptions), and presumably Dask follows NumPy (?), but there's no reason some other library might be implementing value-based casting in other ways. Really there's not a distinction in the test suite between "value-based casting" and "correct type promotion". The only difference is that "value-based casting" might be different depending on the value, but that's just a question of which values hypothesis happens to generate for the example arrays.

So it sounds like what you would really want is either

a) more general control either over what types of inputs hypothesis generates as inputs to the tests (i.e., never generate 0-D inputs, because you know those have value-based casting), or

b) a better ability to ignore certain checks (like tests for type promotion) while still being able to test other things like shapes and other semantics.

Both of these are somewhat challenging to do. The first I would say is probably more challenging just due to the way hypothesis works, unless @honno can think of any clever ways to work around it. The best thing I can think of would be to manually patch the hypothesis_helpers file and/or the @given decorators for specific tests to restrict the inputs. For instance, if you know a test will fail for 0-D inputs but still want it to run on non-0-D inputs you could manually add some assume() call to the top of the test. However, I will also note that hypothesis itself should be reporting distinct errors, so if 0-D inputs give one error (like bad type promotion because of value-based casting) and non-0-D inputs give some other error that is checked after the type promotion check in the test, the results should show you both of those. So to some degree, this can also just be seen as an issue where we need to improve the way pytest shows the error messages so they are easier to walk through (this is something we've discussed quite a bit in the past).

The second would require refactoring the tests so that instead of having one test function per spec function, we have multiple test functions, effectively one per assert, so that you can easily disable the checks that you already know won't work (like type promotion) while still checking the others. This would be a pretty significant reworking of the test suite, and it's one we've resisted doing because it would add a lot of complexity to the code. It's also hard because sometimes the checks actually depend on each other, so it's not as simple as just splitting a test function into multiple functions. Also, as I noted above, hypothesis does report individual errors separately, and we've specifically given each assertion error a different message so they can be reported distinctly. So as long as some inputs can pass the first checks in a test function to get to the later ones, you will see errors from those later ones.

@tomwhite
Copy link
Contributor Author

Thanks for the detailed response @asmeurer.

I can see that it's not particularly straightforward to control for the failures in these type promotion edge cases. I've run the test suite a few more times against Dask with a higher setting for max examples, which has found a few more of these edge cases - and there aren't as many as I first feared. (Current list of tests to skip: https://github.com/tomwhite/dask/blob/60405597c2777f0a40216cc1f517fab987591ce0/.github/workflows/array-api.yml#L116-L145.)

So in light of your second point (how hypothesis does report individual errors separately), I think the current behaviour is actually fine, and I'd be happy to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants