-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: harden typetracer operations #1849
Conversation
Note that, under NumPy's rules, the result of |
bd53abd
to
a1fa4b8
Compare
I recognize why we'd want to be rigorous like this, but that could easily make common operations fail to be Dask-delayable. Fortunately, we have a model: what does >>> import numpy as np, dask.array as da
>>> np.array([1, 2, 3], np.uint8) + np.uint16(10)
array([11, 12, 13], dtype=uint8)
>>> np.array([1, 2, 3], np.uint8) + np.uint16(300)
array([301, 302, 303], dtype=uint16) Okay, so NumPy promotes the dtype if the value is large enough (value dependent). >>> da.array(np.array([1, 2, 3], np.uint8)) + da.array(np.uint16(10))
dask.array<add, shape=(3,), dtype=uint16, chunksize=(3,), chunktype=numpy.ndarray>
>>> da.array(np.array([1, 2, 3], np.uint8)) + da.array(np.uint16(300))
dask.array<add, shape=(3,), dtype=uint16, chunksize=(3,), chunktype=numpy.ndarray> Dask promotes if the type is large enough. TypeTracer is designed first and foremost for dask-awkward; it hasn't been used in anything else and it's still a private implementation, anyway. If it acquires another purpose, we can think about generalizing it, but for now, we should think of it as 100% supporting dask-awkward. Since |
So, my overarching concern here is for To be explicit, this currently isn't the case if we have (or add in future) any code that evaluates With respect to the value-dependence of Thus far, we may never rely on this kind of behavior being identical between The "robust" solution here is to use the new Array API standard: https://numpy.org/neps/nep-0047-array-api-standard.html, which is effectively what @jpivarski I originally wrote that |
Right, we don't want a TypeTracer Array and a non-TypeTracer Array to return different results, because then the dask-awkward operation would still fail, but it would do so when it's pulling back the results and testing them against the expected Form. How about if we follow The two commits that loosen TypeTracer/non-TypeTracer agreement to Type, rather than Form, is something that I've been on the fence about. dask-array only needs to know Types; ensuring that TypeTracers reproduce Forms was a decision to be strict so that we wouldn't need to tighten it (which is much harder than loosening it) if we ever find out that we need Forms. Numba and RDataFrame interfaces need to know Forms, but they can separately compile on each worker, so it's not necessary to give dask-awkward an exact Form. dask-awkward has been out for a while, and Uproot is using/testing it, though not many users are. At some point, we'll be comfortable enough to say that Dask isn't ever going to need to know exact Forms, and we can then loosen the agreement to Type across the board. It seems to me that a1fa4b8 and 59bb711 are not necessary, so let's hold off on them until we make this decision globally. I think it will probably happen, and it's not API-breaking, so we can do it after users have been working with 2.x for a while. Anyway, this point about loosening TypeTracer agreement to Type is unrelated to the dtype choice above because dtype is a Type thing. |
We had a discussion on Slack, and here's a summary:
This PR will therefore walk back 59bb711 and a1fa4b8 which loosen the predictive power of typetracer, and work on standardising the nplikes. |
Codecov Report
Additional details and impacted files
|
0c69b73
to
a4ea6b0
Compare
This PR looks scary, but it's fairly straightforward so far. The plan here is still to formalise the NumpyLike type promotion rules, and implement an API that closely resembles the array API. We won't be able to actually invoke the Array API whilst we want to support more types, but we can ensure that we look like the Array API for future ease / API decisions. Following the Array API is so far a rewarding process; the interfaces are simple, and predictably typed. The area that's currently a bit tricky is working out result dtypes; we need to define our own results here, because the Array API doesn't include complex numbers (draft) or datetime/timedeltas yet. I'm happy that we've got the approximate rules in place, but it's not clear which rules NumPy applies; It looks like these types are unlikely to be standardised by the Array API any time soon, so we'll need to decide upon this for ourselves. Note that, we don't need to implement what NumPy does here; we just need to choose something, and stick to it. Ideally, we'll choose something that's a subset of NumPy's behavior so that we can directly use the corresponding methods. Again, this is an internal part of Awkward, so we're able to impose these kind of constraints upon ourselves. NumPy uses these rules for RE complex types, we currently allow float-complex mixing, as it can be a lossless conversion. I'll give this some more thought. |
Yes, complex types seem pretty straightforward, since they are exact supersets of the corresponding floating point types. For date-time types, we can refuse every type of mixing except for the few mathematical operations that are defined (with commutative symmetry for
I don't think there are any other mathematical operations that apply to date-time types. I looked at the NumPy reference (and also found this), and was drawn to the comment about "nonlinear time units." I wondered what they meant by that. It's the fact that NumPy accepts time units like "months" and "years," which can have a variable number of days, depending on their absolute time position. THAT is a huge can of worms. We should avoid ever computing
ourselves—let NumPy deal with the possibility that I think we do let NumPy handle all of this, since the above are ufuncs, and reducers like You asked about cases like |
Whilst I remember, with respect to the type inspection interface, the standard doesn't yet implement everything, but I'm planning to take this as inspiration: data-apis/array-api#425
The only concern I have is that we want to anticipate what the spec does so that we don't deviate unnecessarily. In the long-term, there may be a point where the Array API supports complex types and datetimes (I suspect a long-long way away), and at that point most of our nplike would become pass-throughs. In particular, the Array API doesn't mix int and float types. There seems to be a lean towards allowing complex and floats to mix, I imagine because the result is obvious (float64→complex128).
Yes, this is special because the |
@jpivarski this reply is turning into a long post, so I'll give it some structure. I'll probably keep editing this, so best check the web version only. The Discussion so FarI've also come to realise that this is a bigger can of worms than I had first anticipated. Let me summarise the worm-can here for posterity:
How we use NumPy /
|
Yes and yes: we absolutely want valueless promotion semantics, and we want it to be forward-looking, to agree with the Array API instead of NumPy. Eventually, NumPy will adhere to Array API by default, so this is a temporary situation. Also, as you point out, we are already okay with some small differences from NumPy when they're well motivated. How to implement it without introducing a lot of code to maintain: I hope we'll be able to create empty arrays, apply the operation to them, and read the dtype off the resulting (also empty) array. As a reminder, this was the problem:
Suppose that we pull in >>> import numpy as np
>>> import numpy.array_api
<stdin>:1: UserWarning: The numpy.array_api submodule is still experimental. See NEP 47.
>>> np.array_api.empty(0, dtype=np.array_api.uint8) + np.array_api.empty(0, dtype=np.array_api.uint16)
empty((0,), dtype=uint16) I think we're currently using empty arrays as canaries (canary typing is some corollary to duck typing?), and when we don't put values in those arrays, even the current NumPy API does the above: >>> np.empty(0, dtype=np.uint8) + np.empty(0, dtype=np.uint16)
array([], dtype=uint16) So maybe we can't just pass arrays (with values) to NumPy and assume that it will do the right type propagation, but we can use empty arrays to predict a type, then apply it to the result, right? If need be, we can use |
So far I've formalised the NumpyLike API, which looks more like the Array API, but doesn't make as many guarantees about type promotion. After our conversation on I realised that the most obvious thing to start with is Then I thought about In code, this looks like class TypeTracerTraits(str, enum.Enum):
POSITIVE = "POSITIVE"
# Don't preserve positivity between `positive` arrays under these operations
TypeTracerArray.drop_trait_for(
TypeTracerTraits.POSITIVE,
operator.sub,
operator.inv,
operator.neg,
operator.ge,
operator.le,
operator.gt,
operator.lt,
operator.eq,
operator.ne,
)
# Allow non-TypeTracerArray scalars to automatically gain the positive trait, e.g. in `x + 1`
@TypeTracerArray.initialises_scalar_trait
def _initialise_scalar_trait(value):
if isinstance(value, (int, float)) and value >= 0:
return TypeTracerTraits.POSITIVE This is the current WIP: https://github.com/scikit-hep/awkward/blob/agoose77/refactor-typetracer/src/awkward/_nplikes/typetracer.py @jpivarski how do you feel about this code? Are you comfortable with merging the scalar and array types, and secondly to moving the "length"-ness to a runtime trait? |
Making Making I think you can get away with just assigning |
56837d9
to
97a2589
Compare
97a2589
to
fba011f
Compare
@jpivarski what does
mean? Are you referring to the concept that zero is neither positive nor negative? I am using the CS interpretation here; that sign exists independently of the magnitude :) |
Using
is standard usage; I haven't noticed any deviation from this in CS. My absolute favorite use of precision language in documentation is Java's core library docs, which defines
Since the irrational number Floating point numbers have a separate sign bit, such that >>> np.array([0.0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([-0.0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x80' but these two values of zero are equal to each other; there is one zero value associated with two bit patterns. (There's a lot of distinct bit patterns for the NaN value.) However, integers ( >>> np.array([0]).tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([1]).tobytes()
b'\x01\x00\x00\x00\x00\x00\x00\x00'
>>> np.array([-1]).tobytes()
b'\xff\xff\xff\xff\xff\xff\xff\xff' |
Right, that's what I'm referring to. One of the most amusing classes I demonstrated in Physics involved discovering how people think floats work, vs how they actually work!
I audibly nose exhaled to this.
Hmm, we might be crossing wires; I'm talking about Anyway 🚂, let's go for non-negative. That suits me fine! |
3450e70
to
e405d32
Compare
e405d32
to
5af3605
Compare
WIP: add an API to smooth over typetracer changes |
Something for me to think about here is what should |
…awkward-1.0 into agoose77/refactor-typetracer
Jim and I had a conversation about this. Since #2020, #1928, etc, we have much less need to formally build out our array abstraction. At the same time, the challenges with this PR - the large amount of new code, and the need to use our own array object are still present. I've decided to reduce the scope of this PR; instead, we will prohibit operations on arrays that do not proceed via the nplike API, and implement the appropriate tests there such that we do not need a special array object. This can be accompanied by a runtime test to ensure we are only using nplike, or a mypy-level type check. |
Closing to ... once again ... create a new PR. |
Awkward's
nplike
mechanism currently exposes implementation details w.r.t the underlying array libraries used. This poses a problem when these implementations differ, or when we need to replicate them for typetracer, which has no backing array module.To improve this, we should more strongly define the
NumpyLike
API, and move away from value-based behaviour that applies to a subset of operations. This will make it easier to ensure that ournplike
s are well behaved and consistent, at a cost of writing more code than the simple shims that we currently define.Primary Goals
NumpyLike
Secondary Goals
The secondary goal of this PR is to make the
UnknownScalar
object complain if it is used in a concrete context. This will break existing usages, but I think this change will make it harder to write code that has implicitly wrong behavior when a typetracer array is passed through. I.e., this change will require moreif nplike.known_data
in our content classes. I think this is a good change, though; theNumpyLike
mechanism can't anticipate how users of thenplike
will interpret the data, so we should surface this logic explicitly where it matters.Implementation
The
NumpyLike
mechanism serves two purposes:(1) will ultimately be handled by the Array API that NumPy et al. are implementing. This stricter subset of the NumPy API provides guarantees about output types, and is designed for cross-compatibility between array libraries. It is not ready, however, for use across all NumPy versions, and does not yet cover the full suite of types and operations (?) that we need. The latest draft adds support for complex datatypes, but datetime objects remain missing. There are several reasons why we cannot simply adopt the Array API outright:
datetime64
/complexXXX?
Therefore, the
NumpyLike
API will deviate from the Array API standard in order to support this information. In general, it will be preferable to add special NumpyLike methodse.g. as_contiguous()
over new parameters to theNumpyLike
API. In addition to new parameters, there are also parameters in the Array API specification that may not be appropriate fornplike
. It would be preferrable to define unused arguments anyway, so that we can later replaceNumpyLike
methods with thin shims over the respective Array API object, i.e.We could implement an Array API namespace for typetracer, such that the
TypeTracerArray
can be operated upon by third-party code that supports the Array API. There is no strong motivation for this;NumpyLike
should remain an internal detail, and no users should ever see e.g.TypeTracerArray
in their interactions with Awkard. Because of this, there is little argument to be made for directly using the Array API in code that currently consumes theNumpyLike
in future given that we still have to consider (2).(2) will remain an important motivation for NumpyLikes. We have small additions like
is_eager
andis_own_array
that abstract library-specific features. We could removeis_own_array
asarray.__array_namespace__() is array_api_obj
, but these other features remain important. Furthermore, we want to flavour a layout by its nplike, so we need to keep the "expected" array API information in the layout somewhere. We don't want to have to implement the entire Array API for e.g. TypeTracer, so it's best that we go through theNumpyLike
mechanism in order to explicitly define the available functions.Tasks
UnknownScalar
a non-concrete objectMakeTypeTracer.add
etc. complain for non-predictable dtypesListArray._pad_none
always adds optionfinfo
andiinfo
toNumpyLike
nplike.ndarray
usage withnplike.is_own_array
📚 The documentation for this PR will be available at https://awkward-array.readthedocs.io/en/agoose77-refactor-typetracer/ once Read the Docs has finished building 🔨